Pandas统计分析基础

来源:互联网 发布:qq三国陆逊打技能数据 编辑:程序博客网 时间:2024/05/22 02:28

Pandas统计分析

pandas数据的基本统计分析

和numpy的函数近似

import pandas as pddates = pd.date_range('20130101',periods=10)dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',               '2013-01-05', '2013-01-06', '2013-01-07', '2013-01-08',               '2013-01-09', '2013-01-10'],              dtype='datetime64[ns]', freq='D')
import numpy as npdf = pd.DataFrame(np.random.randn(10,4),index=dates,columns=['A','B','C','D'])df
A B C D 2013-01-01 -1.587560 -0.198819 0.720054 1.921686 2013-01-02 0.296288 1.876570 0.338344 0.597835 2013-01-03 -1.832852 0.752045 2.184984 -0.157722 2013-01-04 -0.650829 1.690322 -1.145963 -0.798702 2013-01-05 -0.729986 -0.494417 2.166254 1.131232 2013-01-06 -1.759444 -1.104058 0.462934 2.050315 2013-01-07 0.760111 -1.753986 0.104831 1.075343 2013-01-08 0.096572 0.383660 0.604831 0.715224 2013-01-09 0.126292 1.025429 0.019330 -0.417396 2013-01-10 -0.179047 0.175366 0.826219 -0.451984
df.describe() # 快速统计结果
A B C D count 10.000000 10.000000 10.000000 10.000000 mean -0.546045 0.235211 0.628182 0.566583 std 0.923341 1.164277 0.985506 1.001821 min -1.832852 -1.753986 -1.145963 -0.798702 25% -1.373167 -0.420517 0.163209 -0.352477 50% -0.414938 0.279513 0.533883 0.656529 75% 0.118862 0.957083 0.799678 1.117260 max 0.760111 1.876570 2.184984 2.050315
df.mean() # 按列求平均值
A   -0.546045B    0.235211C    0.628182D    0.566583dtype: float64
df.mean(1) # 按行求平均值
2013-01-01    0.2138402013-01-02    0.7772592013-01-03    0.2366142013-01-04   -0.2262932013-01-05    0.5182712013-01-06   -0.0875632013-01-07    0.0465752013-01-08    0.4500722013-01-09    0.1884142013-01-10    0.092638Freq: D, dtype: float64

基本统计分析函数

  • .describe() 针对0轴(列)的统计汇总,计数/平均值/标准差/最小值/四分位数/最大值
  • .sum() 计算数据的总和,按0轴计算(各行计算),下同,要按列计算参数1
  • .count() 非NaN值数量
  • .mean() .median() .mode() 计算数据的算数平均值/算数中位数/众数
  • .var() .std() 计算数据的方差/标准差
  • .min() .max() 计算数据的最小值/最大值

只适用于series:

  • .argmin(),.argmax() 计算数据最大值/最小值所在位置的索引位置(自动索引,用她是因为很容易切片等操作)
  • .idxmin(),.idxmax() 计算数据最大值/最小值所在位置的索引(自定义索引)
a = pd.Series([9,8,7,6],index=['a','b','c','d'])a
a    9b    8c    7d    6dtype: int64
b = pd.DataFrame(np.arange(20).reshape(4,5),index=['c','a','d','b'])b
0 1 2 3 4 c 0 1 2 3 4 a 5 6 7 8 9 d 10 11 12 13 14 b 15 16 17 18 19
a.describe()
count    4.000000mean     7.500000std      1.290994min      6.00000025%      6.75000050%      7.50000075%      8.250000max      9.000000dtype: float64
type(a.describe()) # series对象
pandas.core.series.Series
a.describe()['count']
4.0
b.describe() #默认0轴运算
0 1 2 3 4 count 4.000000 4.000000 4.000000 4.000000 4.000000 mean 7.500000 8.500000 9.500000 10.500000 11.500000 std 6.454972 6.454972 6.454972 6.454972 6.454972 min 0.000000 1.000000 2.000000 3.000000 4.000000 25% 3.750000 4.750000 5.750000 6.750000 7.750000 50% 7.500000 8.500000 9.500000 10.500000 11.500000 75% 11.250000 12.250000 13.250000 14.250000 15.250000 max 15.000000 16.000000 17.000000 18.000000 19.000000
type(b.describe()) #dataframe对象
pandas.core.frame.DataFrame
# 返回横行数据,seriesb.describe().loc['max']
0    15.01    16.02    17.03    18.04    19.0Name: max, dtype: float64
b.describe().iloc[7]
0    15.01    16.02    17.03    18.04    19.0Name: max, dtype: float64
# 返回一列值,这里第2列b.describe()[2]
count     4.000000mean      9.500000std       6.454972min       2.00000025%       5.75000050%       9.50000075%      13.250000max      17.000000Name: 2, dtype: float64
b.describe().loc[:,2]
count     4.000000mean      9.500000std       6.454972min       2.00000025%       5.75000050%       9.50000075%      13.250000max      17.000000Name: 2, dtype: float64

数据的累计统计分析

  • 对序列的前1-n个数累计运算
  • 可减少for循环的使用

累计统计分析函数,适用于series和dataframe类型

  • .cumsum() 依次给出前1/2/…/n个数的和
  • .cumprod() 依次给出前1/2/…/n个数的积
  • .cummax() 依次给出前1/2/…/n个数的最大值
  • .cummin() 依次给出前1/2/…/n个数的最小值
b = pd.DataFrame(np.arange(20).reshape(4,5),index=['c','a','d','b'])b
0 1 2 3 4 c 0 1 2 3 4 a 5 6 7 8 9 d 10 11 12 13 14 b 15 16 17 18 19
b.cumsum() # 列的累加和
0 1 2 3 4 c 0 1 2 3 4 a 5 7 9 11 13 d 15 18 21 24 27 b 30 34 38 42 46
b.cumprod() # 列的累加积
0 1 2 3 4 c 0 1 2 3 4 a 0 6 14 24 36 d 0 66 168 312 504 b 0 1056 2856 5616 9576

滚动计算(窗口计算)函数

适用series/dataframe

  • .rolling(w).sum() 依次计算相邻w个元素的和
  • .rolling(w).mean() 依次计算相邻w个元素的算数平均值
  • .rolling(w).var() 依次计算相邻w个元素的方差
  • .rolling(w).std() 依次计算相邻w个元素的标准差
  • .rolling(w).min .max() 依次计算相邻w个元素的最小值/最大值
b.rolling(2).sum() # 纵向列,以两个元素为单位,做求和运算
0 1 2 3 4 c NaN NaN NaN NaN NaN a 5.0 7.0 9.0 11.0 13.0 d 15.0 17.0 19.0 21.0 23.0 b 25.0 27.0 29.0 31.0 33.0
b.rolling(3).sum()
0 1 2 3 4 c NaN NaN NaN NaN NaN a NaN NaN NaN NaN NaN d 15.0 18.0 21.0 24.0 27.0 b 30.0 33.0 36.0 39.0 42.0
原创粉丝点击