python:利用pandas进行绘图(总结)绘图工具

来源:互联网 发布:网络配置代码 编辑:程序博客网 时间:2024/05/29 17:12

利用python进行数据分析

第八章:绘图和可视化

pandas绘图工具

>>> from pandas.plotting import scatter_matrix
>>> from pandas import Series, DataFrame
>>> import numpy as np
>>> import pandas as pd
>>> import matplotlib.pyplot as plt

1,散点图矩阵(Scatter Matrix Plot)

These functions can be imported from pandas.plotting and take a Series or DataFrame as an argument.
利用绘图工具绘图,需要引入pandas.plotting模块,以Series和DataFrame作为参数
>>> df = pd.DataFrame(np.random.randn(1000, 4), columns=['a', 'b', 'c', 'd'])
>>> scatter_matrix(df, alpha=0.2, figsize=(6, 6), diagonal='kde')
>>> plt.show()
生成4X4的共16个图片,对角线是密度图,其他的为散点图

2,密度图(Density Plot)

You can create density plots using the Series.plot.kde() and DataFrame.plot.kde() methods
利用Series.plot.kde()或DataFrame.plot.kde()方法绘制密度图
np.random.randn(1000)生成的是一个正太分布曲线
>>> ser = pd.Series(np.random.randn(1000))
>>> ser.plot.kde()
生成一个正太分布曲线图

3,安德鲁斯曲线(Andrews Curves)

Andrews curves allow one to plot multivariate data as a large number of curves that are created using the attributes of samples as coefficients for Fourier series. By coloring these curves differently for each class it is possible to visualize data clustering. Curves belonging to samples of the same class will usually be closer together and form larger structures.
安德鲁斯曲线是在一个绘图中存在大量的曲线,这些曲线是不同样本之间存在的不同属性而产生的分类结果;所以在绘图时利用不同的颜色来区分不同的分组,不同分类的曲线在绘图时会靠近并形成一个更大的结构体系。使用andrews_curves()方法进行绘图
>>> from pandas.plotting import andrews_curves
>>> df=DataFrame(np.random.rand(10,10), columns=range(1,11))
>>> df

     1         2         3         4         5         6         7   \

0 0.657668 0.234840 0.187963 0.480384 0.676935 0.644506 0.849955
1 0.347819 0.278945 0.482548 0.856854 0.369824 0.921871 0.195208
2 0.481188 0.886892 0.269874 0.992266 0.663039 0.285274 0.222589
3 0.999133 0.932073 0.656683 0.607936 0.362180 0.756532 0.479407
4 0.918229 0.965718 0.243416 0.042666 0.932310 0.734750 0.142455
5 0.393881 0.821673 0.598786 0.715335 0.525187 0.763766 0.570982
6 0.998222 0.770152 0.803504 0.932111 0.629249 0.632741 0.230093
7 0.730399 0.127948 0.586990 0.890208 0.885532 0.821200 0.216378
8 0.823925 0.741674 0.690356 0.269986 0.530224 0.446307 0.265048
9 0.497035 0.830702 0.399065 0.242242 0.192078 0.622756 0.867983

     8         9         10  

0 0.428669 0.921396 0.865082
1 0.897575 0.000369 0.019511
2 0.004554 0.093646 0.152874
3 0.376975 0.512618 0.385439
4 0.314657 0.032770 0.406077
5 0.087637 0.525262 0.095010
6 0.841192 0.115266 0.358726
7 0.957213 0.709480 0.013137
8 0.483483 0.687900 0.431011
9 0.924797 0.119433 0.386189
>>> plt.figure()
>>> andrews_curves(df, 1)
df这个DataFrame对象的第一列,每一个index的数值都绘制出一条曲线

4,平行坐标(Parallel Coordinates)

Parallel coordinates is a plotting technique for plotting multivariate data. It allows one to see clusters in data and to estimate other statistics visually. Using parallel coordinates points are represented as connected line segments. Each vertical line represents one attribute. One set of connected line segments represents one data point. Points that tend to cluster will appear closer together.
>>> from pandas.plotting import parallel_coordinates
>>> df=DataFrame(np.random.rand(10,10), columns=range(1,11))
>>> df

     1         2         3         4         5         6         7   \

0 0.467659 0.978732 0.179538 0.685182 0.229915 0.882398 0.924433
1 0.863878 0.992446 0.732572 0.543559 0.164539 0.710433 0.220690
2 0.816937 0.866524 0.561880 0.136630 0.972659 0.352004 0.650383
3 0.351081 0.341353 0.004663 0.600008 0.880758 0.440976 0.111892
4 0.226553 0.014078 0.379845 0.598606 0.341625 0.675299 0.708234
5 0.170063 0.342096 0.813045 0.860868 0.905096 0.737247 0.652726
6 0.797142 0.777763 0.737259 0.100391 0.551292 0.739408 0.266556
7 0.130778 0.201388 0.896418 0.549645 0.587309 0.548748 0.009598
8 0.467129 0.298170 0.861704 0.217054 0.761984 0.110673 0.493671
9 0.778196 0.456548 0.171519 0.745076 0.905559 0.390150 0.727006

     8         9         10  

0 0.494924 0.612457 0.026332
1 0.430576 0.064443 0.970996
2 0.776737 0.251197 0.410517
3 0.763297 0.365974 0.889982
4 0.947055 0.200605 0.179035
5 0.435712 0.694421 0.101725
6 0.581694 0.719693 0.588572
7 0.998294 0.138834 0.059504
8 0.549928 0.096064 0.312498
9 0.854901 0.985777 0.691980
>>> plt.figure()
>>> parallel_coordinates(df, 1)
最终结果是df这个DataFrame对象的第一列,每一个index的数值都绘制出一条线并通过2-10这些线段进行分隔

5,Lag Plot

Lag plots are used to check if a data set or time series is random. Random data should not exhibit any structure in the lag plot. Non-random structure implies that the underlying data are not random.
Lag plots用于查看随机数据,随机数据不会在lag plot当中展示,非随机体系,意味着潜在数据不是随机的。
>>> from pandas.plotting import lag_plot
>>> plt.figure()
>>> data = pd.Series(0.1 * np.random.rand(1000) + 0.9 * np.sin(np.linspace(-99 * np.pi, 99 * np.pi, num=1000)))
>>> lag_plot(data)
绘制图形的X轴是y(t),Y轴是y(t+1)

6,自相关图(Autocorrelation Plot)

Autocorrelation plots are often used for checking randomness in time series. This is done by computing autocorrelations for data values at varying time lags. If time series is random, such autocorrelations should be near zero for any and all time-lag separations. If time series is non-random then one or more of the autocorrelations will be significantly non-zero. The horizontal lines displayed in the plot correspond to 95% and 99% confidence bands. The dashed line is 99% confidence band.
>>> from pandas.plotting import autocorrelation_plot
>>> data = pandas.Series(0.7 * np.random.rand(1000) + 0.3 * np.sin(np.linspace(-9 * np.pi, 9 * np.pi, num=1000)))
>>> autocorrelation_plot(data)
生成图片的横轴是label是Lag,纵轴label是Autocorrelation

7,Bootstrap Plot

Bootstrap plots are used to visually assess the uncertainty of a statistic, such as mean, median, midrange, etc. A random subset of a specified size is selected from a data set, the statistic in question is computed for this subset and the process is repeated a specified number of times. Resulting plots and histograms are what constitutes the bootstrap plot.
>>> from pandas.plotting import bootstrap_plot
>>> data = pd.Series(np.random.rand(1000))
>>> bootstrap_plot(data, size=50, samples=500, color='green')

8,RadViz

RadViz is a way of visualizing multi-variate data. It is based on a simple spring tension minimization algorithm. Basically you set up a bunch of points in a plane. In our case they are equally spaced on a unit circle. Each point represents a single attribute. You then pretend that each sample in the data set is attached to each of these points by a spring, the stiffness of which is proportional to the numerical value of that attribute (they are normalized to unit interval). The point in the plane, where our sample settles to (where the forces acting on our sample are at an equilibrium) is where a dot representing our sample will be drawn. Depending on which class that sample belongs it will be colored differently.
>>> df=DataFrame(np.array([[2,4,6,79,23,190,552,1314,23457], [4,9,6,97,32,110,555,1210,4325]]).T, columns=['a','b'])
>>> radviz(df, 'a')

原创粉丝点击