Using scatter plots for multivariate data —— python data science cookbook

来源：互联网发布：数据分析师修炼编辑：程序博客网时间：2024/05/17 07:09

以下均来自《python数据科学指南》课本案例

In multivariate data analysis, we are interested in seeing if there any relationships between the columns that we are analyzing. In two column/variable cases, the best place to start is a standard scatter plot. There can be four types of relationships, as follows:

No relationship
Strong
Simple
Multivariate (not simple) relationship

Eample:

We will use the Iris dataset. It’s a multivariate dataset introduced by Sir Ronald Fisher. Refer tohttps://archive.ics.uci.edu/ml/datasets/Iris for more information. The Iris dataset has 150 instances and four attributes/columns. The 150 instances are composed of 50 records from each of the three species of the Iris flower (Setosa, virginica, and versicolor). The four attributes are the sepal length in cm, sepal width in cm, petal length in cm, and petal width in cm. Thus, the Iris dataset also serves as a great classification dataset.

#!/usr/bin/env python2# -*- coding: utf-8 -*-"""@author: snaildove"""# Load Librariresfrom sklearn.datasets import load_irisimport matplotlib.pyplot as pltimport itertools# 1. Load Iris datasetdata = load_iris()x = data['data']y = data['target']feature_names = data['feature_names']#We will proceed with demonstrating with a scatter plot:# 2.Perform a simple scatter plot.# Plot 6 graphs, combinations of our features, sepal length, sepal width, petal length and petal width.plt.close('all')plt.figure(1)# We want a plot with# 3 rows and 2 columns, 3 and 2 in# below variable signifies that.subplot_start = 321col_numbers = range(0,4)# Need it for labeling the graphcol_pairs = itertools.combinations(col_numbers,2)plt.subplots_adjust(wspace = 0.5)for col_pair in col_pairs:    plt.subplot(subplot_start)    plt.scatter(x[:,col_pair[0]],x[:,col_pair[1]],c=y)    plt.xlabel(feature_names[col_pair[0]])    plt.ylabel(feature_names[col_pair[1]])    subplot_start+=1plt.show()

As you can see, we have plotted two combinations of our columns. We also have the class labels represented using three different colors. Let’s look at the bottom left plot, petal length versus petal width. We see that different range of values belong to different class labels. Now, this gives us a great clue for classification; the petal width and length variables are good candidates if the problem in hand is classification.

Note

For the Iris dataset, the petal width and length can alone classify the records in their respective flower family.
These kinds of observations can be quickly made during the feature selection process with the help of bivariate scatter plots.

0 0