Using scatter plots for multivariate data —— python data science cookbook
来源:互联网 发布:数据分析师修炼 编辑:程序博客网 时间:2024/05/17 07:09
以下均来自《python数据科学指南》课本案例
In multivariate data analysis, we are interested in seeing if there any relationships between the columns that we are analyzing. In two column/variable cases, the best place to start is a standard scatter plot. There can be four types of relationships, as follows:
- No relationship
- Strong
- Simple
- Multivariate (not simple) relationship
Eample:
We will use the Iris dataset. It’s a multivariate dataset introduced by Sir Ronald Fisher. Refer tohttps://archive.ics.uci.edu/ml/datasets/Iris for more information. The Iris dataset has 150 instances and four attributes/columns. The 150 instances are composed of 50 records from each of the three species of the Iris flower (Setosa, virginica, and versicolor). The four attributes are the sepal length in cm, sepal width in cm, petal length in cm, and petal width in cm. Thus, the Iris dataset also serves as a great classification dataset.
#!/usr/bin/env python2# -*- coding: utf-8 -*-"""@author: snaildove"""# Load Librariresfrom sklearn.datasets import load_irisimport matplotlib.pyplot as pltimport itertools# 1. Load Iris datasetdata = load_iris()x = data['data']y = data['target']feature_names = data['feature_names']#We will proceed with demonstrating with a scatter plot:# 2.Perform a simple scatter plot.# Plot 6 graphs, combinations of our features, sepal length, sepal width, petal length and petal width.plt.close('all')plt.figure(1)# We want a plot with# 3 rows and 2 columns, 3 and 2 in# below variable signifies that.subplot_start = 321col_numbers = range(0,4)# Need it for labeling the graphcol_pairs = itertools.combinations(col_numbers,2)plt.subplots_adjust(wspace = 0.5)for col_pair in col_pairs: plt.subplot(subplot_start) plt.scatter(x[:,col_pair[0]],x[:,col_pair[1]],c=y) plt.xlabel(feature_names[col_pair[0]]) plt.ylabel(feature_names[col_pair[1]]) subplot_start+=1plt.show()
As you can see, we have plotted two combinations of our columns. We also have the class labels represented using three different colors. Let’s look at the bottom left plot, petal length versus petal width. We see that different range of values belong to different class labels. Now, this gives us a great clue for classification; the petal width and length variables are good candidates if the problem in hand is classification.
Note
For the Iris dataset, the petal width and length can alone classify the records in their respective flower family.These kinds of observations can be quickly made during the feature selection process with the help of bivariate scatter plots.
- Using scatter plots for multivariate data —— python data science cookbook
- Performing summary statistics and plots —— Python Data Science Cookbook
- Multivariate Data Plots
- data imputation —— Python Data Science Cookbook
- sampling brief —— python data science cookbook
- Removing stop words —— Python Data Science CookBook
- Python for data science
- Stemming the words and word lemmatization —— Python Data Science CookBook
- the bag of words representation —— Python Data Science CookBook
- class Intermediate Python for Data Science
- Data Science in Python
- box-and-whisker plot —— Python Data Science Cookook
- PCA(Principal Component Analysis) —— Python Data Science Cookook
- something useful for data science.
- 翻译:Getting Started With Python For Data Science
- Should you teach Python or R for data science?
- class Python Data Science Toolbox
- Generating Multivariate Normal Data by Using PROC IML & R
- AsyncTask异步任务 android
- iOS之ProtocolBuffer搭建和示例demo
- 如何在同一台电脑上使用两个Git账户
- CJOJ P2298 【NOI2016】区间
- android的Binder框架
- Using scatter plots for multivariate data —— python data science cookbook
- 区间合并
- U3D[02.07]
- System.exit(0)和System.exit(1)区别
- 监听者模式(listener)(c++实现)
- PHP_elasticsearch搜索引擎的安装与使用
- 怎样在weblogic服务器中发布web项目
- IOC(控制反转)和DI(依赖注入)
- 黑白棋