Pandas入门学习总结

来源：互联网发布：手机qq聊天监控软件编辑：程序博客网时间：2024/05/21 06:51

# 1. pandas

import pandas as pd

import numpy as np
# Pandas中的数据结构
# Series：一维数组，与Numpy中的一维array类似。二者与Python基本的数据结构List也很相近，
# 其区别是：List中的元素可以是不同的数据类型，而Array和Series中则只允许存储相同的数据类型，
# 这样可以更有效的使用内存，提高运算效率。
# Time- Series：以时间为索引的Series
# DataFrame：二维的表格型数据结构,可以将DataFrame理解为Series的容器

# 2. pd.Series

s = pd.Series([1,3,5,np.nan,6,8])#NAN表明不是个数

print(s)

0    1.01    3.02    5.03    NaN4    6.05    8.0dtype: float64

# DataFrame
# DataFrame是二维的数据结构，其本质是Series的容器，
# 因此，DataFrame可以包含一个索引以及与这些索引联合在一起的Series，
# 由于一个Series中的数据类型是相同的，而不同Series的数据结构可以不同。
# 因此对于DataFrame来说，每一列的数据结构都是相同的，而不同的列之间则可以是不同的数据结构。
# 或者以数据库进行类比，DataFrame中的每一行是一个记录，名称为Index的一个元素，
# 而每一列则为一个字段，是这个记录的一个属性

# 3 创建DataFrame

# 模式1:
# dict = {'col1': ts1, 'col2': ts2} # tsx可以是列表、字典、元组、或者pd.series()
# df = pd.DataFrame(data=data, index=index) # index也可以是pd.date_range等其他类型
# 模式2:
# df = pd.DataFrame(data=data,index = index,columns = columns)
# 模式3:
# dict = {'column1_name':pd.Series(),'column2_name':pd.Series(),...,'columnn_name':pd.Series()}
# df = pd.DataFrame(dict,index = index ) #没有index则默认从np.arange(n)从0到(n-1)

# 3.1 以Series的字典的结构构建DataFrame.列的顺序随机

# 这时候的最外面字典对应的是DataFrame的列，
# 内嵌的字典及Series则是其中每个值
# 可以看到d是一个字典，其中one的值为Series有3个值，而two为Series有4个值。
# 由d构建的为一个4行2列的DataFrame。其中one只有3个值，因此d行one列为NaN（Not a Number）
# --Pandas默认的缺失值标记
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)

    one twoa  1.0  1.0b  2.0  2.0c  3.0  3.0d  NaN  4.0

改变两个列的名字，其他不变

d = {'two' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
'one' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'g', 'h'])}
df = pd.DataFrame(d)
print('----------\n',df)

 one  twoa  1.0  1.0b  2.0  2.0c  NaN  3.0g  3.0  NaNh  4.0  NaN

结果不发生改变，说明是按照列的名字排序的
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),

'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'g', 'h'])}
df = pd.DataFrame(d)

print('----------\n',df)

----------    one  twoa  1.0  1.0b  2.0  2.0c  3.0  NaNg  NaN  3.0h  NaN  4.0

# 3.2 从字典的列表构建DataFrame，其中每个字典代表的是每条记录（DataFrame中的一行）

# 字典中每个值对应的是这条记录的相关属性
d = [{'one' : 1,'two':1},{'one' : 2,'two' : 2},{'one' : 3,'two' : 3},{'two' : 4}]
df = pd.DataFrame(d,index=['a','b','c','d']

print('----------\n',df)

    one  twoa  1.0    1b  2.0    2c  3.0    3d  NaN    4

# 以上的语句与以Series的字典形式创建的DataFrame相同，只是思路略有不同，一个是以列为单位构建，
# 将所有记录的不同属性转化为多个Series，行标签冗余，另一个是以行为单位构建，
# 将每条记录转化为一个字典，列标签冗余。使用这种方式，如果不通过columns指定列的顺序，
# 那么列的顺序会是随机即3.1的。

# 3.3 使用传递的numpy数组创建数据帧,并使用日期索引和标记列

dates = pd.date_range('20130101',periods=6)#共6天

df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))

#list(str)可以将字符串打散输出出来

print(df)

                   A         B         C         D2013-01-01  0.550842 -0.344767 -1.159734 -0.7385922013-01-02 -1.122310 -0.058075 -0.912459 -0.2331262013-01-03 -0.147547  0.552087 -0.592051  1.3615952013-01-04 -1.152434  0.423264 -0.954811  0.5625912013-01-05 -0.064692  0.121858 -0.741079  0.4081682013-01-06  0.797960  0.291905  0.071038  0.509477

pd.date_range(start,end,periods,freq,normalize)

start：开始日期

end：结束日期

periods：日期范围内日期的个数，periods*freq 等于日期范围的长度

freq：每多少天或其他明确时间频率的方式，默认为D，即1天

normalize：把日期规范化到午夜时间戳，即把时间规范化为 00:00:00

#3.4使用传递的可转换序列的字典对象创建数据帧

df2 = pd.DataFrame({ 'A' : 1.,
'B' : pd.Timestamp('20130102'),#pd.Timestamp(日期) 数字生成日期格式
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),#list(range(4) 即[0, 1, 2, 3]

'D' : np.array([3] * 4,dtype='int32'),#np.array([3] * 4,dtype='int32') 即 [3 3 3 3]

#np.array([元素] * k) 即 [元素元素元素元素.....元素元素] 一共k个

'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo' })

print('df2\n',df2)

df2      A          B    C  D      E    F0  1.0 2013-01-02  1.0  3   test  foo1  1.0 2013-01-02  1.0  3  train  foo2  1.0 2013-01-02  1.0  3   test  foo3  1.0 2013-01-02  1.0  3  train  foo

# head和tail方法可以显示DataFrame前N条和后N条记录，N为对应的参数，默认值为5。

print('head')
print(df2.head(2))
print('tail')

print(df2.tail(2))

head     A          B    C  D      E    F0  1.0 2013-01-02  1.0  3   test  foo1  1.0 2013-01-02  1.0  3  train  footail     A          B    C  D      E    F2  1.0 2013-01-02  1.0  3   test  foo3  1.0 2013-01-02  1.0  3  train  foo

# 显示索引,列,和底层numpy数据
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print('df.index\n',df.index)
print('df.columns\n',df.columns)
print('df.values\n',df.values)

print(df.describe())

df.index Index(['a', 'b', 'c', 'd'], dtype='object')df.columns Index(['one', 'two'], dtype='object')df.values [[  1.   1.] [  2.   2.] [  3.   3.] [ nan   4.]]       one       twocount  3.0  4.000000mean   2.0  2.500000std    1.0  1.290994min    1.0  1.00000025%    1.5  1.75000050%    2.0  2.50000075%    2.5  3.250000max    3.0  4.000000

# 转置

print(df.T)

       a    b    c    done  1.0  2.0  3.0  NaNtwo  1.0  2.0  3.0  4.0

# 4. DataFrame排序

# 4.1按轴标签(行名、列名)进行排序

d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
'two' : pd.Series([5., -1., 3., -2.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)

print('原始df\n',df)

原始df    one  twoa  1.0  5.0b  2.0 -1.0c  3.0  3.0d  NaN -2.0

x = df.sort_index(axis=1, ascending=False) #ascending=False降序排列,ascending=True升序排列

axis = 0 按行名称 axis = 1 按列名称

print('按轴标签排序\n',x)

按轴标签排序    two  onea  5.0  1.0b -1.0  2.0c  3.0  3.0d -2.0  NaN

# 4.2按值排序进行排序

d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
'two' : pd.Series([5., -1., 3., -2.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)

print(df)

   one  twoa  1.0  5.0b  2.0 -1.0c  3.0  3.0d  NaN -2.0

x =df.sort_values(['a','b'],axis=1,ascending=False)

或者

#x = df.sort_values(axis=1,by =['a','b'] ,ascending=False)

先按照第'a'行的值排序如果有重复按照'b'行排序

这里的axis可以省略，因为名字可以唯一确定某一列或某一行

#sort(columns = 'two', ascending=False) #注意ascending仅是参数名字,True对应降序排列，False对应升序排列

print('排序\n',x)

排序    two  onea  5.0  1.0b -1.0  2.0c  3.0  3.0d -2.0  NaN

# 5. 内容检索

# 5.1 整列或整行检索

dates = pd.date_range('1/1/2000', periods=5)
df = pd.DataFrame(np.random.randn(5, 4), index=dates,
columns=['A', 'B', 'C', 'D'])
print('df\n',df)

print("列'A':\n",df['A'])

df                    A         B         C         D2000-01-01  1.293043 -1.333675  0.480421  0.7514322000-01-02 -0.234011 -1.275571 -1.312495  0.2005522000-01-03  0.203873  2.104864  0.223676  0.1009732000-01-04  0.227849 -1.345697  1.799077 -1.7842052000-01-05 -0.326033  0.610245 -1.575445  0.870782列'A': 2000-01-01    1.2930432000-01-02   -0.2340112000-01-03    0.2038732000-01-04    0.2278492000-01-05   -0.326033Freq: D, Name: A, dtype: float64

s = df['A']
print('s[0:3]\n',s[0:3])

print('\n原来df:\n',df)

s[0:3] 2000-01-01    1.2930432000-01-02   -0.2340112000-01-03    0.203873Freq: D, Name: A, dtype: float64原来df:                    A         B         C         D2000-01-01  1.293043 -1.333675  0.480421  0.7514322000-01-02 -0.234011 -1.275571 -1.312495  0.2005522000-01-03  0.203873  2.104864  0.223676  0.1009732000-01-04  0.227849 -1.345697  1.799077 -1.7842052000-01-05 -0.326033  0.610245 -1.575445  0.870782

df[['B', 'A']] = df[['A', 'B']]
print("\n'A'列与'B'列互换后的df:\n",df)
print('df的前三行\n',df[:3])
print('\ndf:\n',df)

print('\n反向输出df\n',df[::-1])

'A'列与'B'列互换后的df:                    A         B         C         D2000-01-01 -1.333675  1.293043  0.480421  0.7514322000-01-02 -1.275571 -0.234011 -1.312495  0.2005522000-01-03  2.104864  0.203873  0.223676  0.1009732000-01-04 -1.345697  0.227849  1.799077 -1.7842052000-01-05  0.610245 -0.326033 -1.575445  0.870782df的前三行                    A         B         C         D2000-01-01 -1.333675  1.293043  0.480421  0.7514322000-01-02 -1.275571 -0.234011 -1.312495  0.2005522000-01-03  2.104864  0.203873  0.223676  0.100973df:                    A         B         C         D2000-01-01 -1.333675  1.293043  0.480421  0.7514322000-01-02 -1.275571 -0.234011 -1.312495  0.2005522000-01-03  2.104864  0.203873  0.223676  0.1009732000-01-04 -1.345697  0.227849  1.799077 -1.7842052000-01-05  0.610245 -0.326033 -1.575445  0.870782反向输出df                    A         B         C         D2000-01-05  0.610245 -0.326033 -1.575445  0.8707822000-01-04 -1.345697  0.227849  1.799077 -1.7842052000-01-03  2.104864  0.203873  0.223676  0.1009732000-01-02 -1.275571 -0.234011 -1.312495  0.2005522000-01-01 -1.333675  1.293043  0.480421  0.751432

df = pd.DataFrame(np.random.randn(5,4), columns=list('ABCD'),
index=pd.date_range('20130101',periods=5))
# 这是错误的，无法进行slices
# dfl.loc[2:3]
#这是可以的

print(df.loc['20130102':'20130104'])

                   A         B         C         D2013-01-02  0.619954  0.899184  1.550459  0.0579532013-01-03  0.247558  0.672721 -0.188366  0.4430952013-01-04 -2.458783  0.798642  0.064322  0.310493

s = df['A']

print('单列切片是可以的但是要用loc函数\n',s.loc['20130102':'20130104'])

单列切片是可以的但是要用loc函数 2013-01-02    0.6199542013-01-03    0.2475582013-01-04   -2.458783Freq: D, Name: A, dtype: float64

df1 = pd.DataFrame(np.random.randn(6,4),index=list('abcdef'),
columns=list('ABCD'))
print(df1)

print("df1.loc[['a', 'b', 'd'], :]\n",df1.loc[['a', 'b', 'd'], :])

          A         B         C         Da  0.305018  0.554215  2.143252  1.918004b -0.346651 -0.411049 -1.442780  0.500032c -0.249730  1.973843 -0.155139 -0.295213d  0.129608  0.059382 -0.829285 -0.986293e -2.577511  1.313742  0.207580  0.243185f  1.266195 -0.031532  0.571188 -1.308195df1.loc[['a', 'b', 'd'], :]           A         B         C         Da  0.305018  0.554215  2.143252  1.918004b -0.346651 -0.411049 -1.442780  0.500032d  0.129608  0.059382 -0.829285 -0.986293

print("终于找到了一种dataframe切片的方式了:df1.loc['d':, 'A':'C']\n",df1.loc['d':, 'A':'C'])

# 5.2boolean操作

df1.loc(['a'] > 0)

# 6. 排名(ranking)

# 跟排序关系密切，且它会增设一个排名值（从1开始，一直到数组中有效数据的数量）。
# 它跟numpy.argsort产生的间接排序索引差不多，只不过它可以根据某种规则破坏平级关系。
# Series和DataFrame的rank方法:默认情况下，rank是通过“为各组分配一个平均排名”的方式破坏平级关系的

# 6.1 Series排名

# 排名时用于破坏平级关系的method选项
# Method 说明
# ‘average’ 默认：在相等分组中，为各个值分配平均排名
# ‘min’ 使用整个分组的最小排名
# ‘max’ 使用整个分组的最大排名
# ‘first’ 按值在原始数据中的出现顺序分配排名
obj = pd.Series([7,-5,7,4,2,0,4])
print("\nOriginal obj:\n",obj)

Original obj:0    71   -52    73    44    25    06    4dtype: int64

print("\nDefault mode of ranking (average ranking):\n",obj.rank()) #默认是平均排名

Default mode of ranking (average ranking):0    6.51    1.02    6.53    4.54    3.05    2.06    4.5dtype: float64

print("\nFirst mode of ranking (group by the ordering of presentation):\n",obj.rank(method='first')) #根据值在原数据中出现的顺序给出排名

First mode of ranking (group by the ordering of presentation):0    6.01    1.02    7.03    4.04    3.05    2.06    5.0dtype: float64

print("\n按降序使用分组的最大排名:\n",obj.rank(ascending=False, method='max'))

按降序使用分组的最大排名:0    2.01    7.02    2.03    4.04    5.05    6.06    4.0dtype: float64

print("\n按降序使用分组的最小排名:\n",obj.rank(ascending=False, method='min'))#符合人们习惯的排名

按降序使用分组的最小排名:0    1.01    7.02    1.03    3.04    5.05    6.06    3.0dtype: float64

# 6.2 DataFrame在行或列上计算排名

类似于 numpy.argsort(ascending = True)
# DataFrame的排序接近与人们的直观印象
dict = {'a':pd.Series([0,1,0,1],index = [1,2,3,4]),
'b':[4.3, 7, -3, 2],
'c':[-2, 5, 8, -2.5]}
df3 = pd.DataFrame(dict)
print("\noriginal df3:\n",df3)
print("\ndf3按行升序排序:\n",df3.rank(axis=1)) # df.rank()返回的都是类似于argsort()返回的索引值
print("\ndf3按列升序排序:\n",df3.rank(axis=0))

original df3:    a    b    c1  0  4.3 -2.02  1  7.0  5.03  0 -3.0  8.04  1  2.0 -2.5df3按行升序排序:      a    b    c1  2.0  3.0  1.02  1.0  3.0  2.03  2.0  1.0  3.04  2.0  3.0  1.0df3按列升序排序:      a    b    c1  1.5  3.0  2.02  3.5  4.0  3.03  1.5  1.0  4.04  3.5  2.0  1.0

# 7. Pandas的索引和选择

data = np.random.randn(6,4)
dates_index = pd.date_range('20170802',periods = 6)
labeled_columns = list('ABCD')
df = pd.DataFrame(data,index = dates_index,columns = labeled_columns)
#可以使用一些方法通过位置num或名字label来检索，例如 ix索引成员(field)

# 7.1. 属性.ix[行选:行选,列选:列选]

.ix既可以输入行名、列名，也可以输入整数索引和slice切片
print("\noriginal df:\n",df)
print("\ndf.ix['20170802']\n",df.ix['20170802'])
print("\ndf.ix[:,'A']\n",df.ix[:,'A'])
print("\ndf.ix[:,'A']\n",df.ix[:,['A','B']])
print("\ndf.ix[:,'A':'C']\n",df.ix[:,'A':'C'])
print("\ndf.ix[0:3,0:3]\n",df.ix[0:3,0:3])

original df:                    A         B         C         D2017-08-02  0.087691  0.219709 -1.415308 -0.9041642017-08-03 -1.093569  0.416883  0.198400  1.0450942017-08-04  0.427113  0.306400 -0.340359 -0.8199782017-08-05  0.030368  0.852479 -1.733437  1.3550392017-08-06 -1.257529  0.142805 -1.545963  1.2650292017-08-07  0.148612  0.289964  0.661178  1.134609df.ix['20170802']A    0.087691B    0.219709C   -1.415308D   -0.904164Name: 2017-08-02 00:00:00, dtype: float64df.ix[:,'A']2017-08-02    0.0876912017-08-03   -1.0935692017-08-04    0.4271132017-08-05    0.0303682017-08-06   -1.2575292017-08-07    0.148612Freq: D, Name: A, dtype: float64df.ix[:,'A']                    A         B2017-08-02  0.087691  0.2197092017-08-03 -1.093569  0.4168832017-08-04  0.427113  0.3064002017-08-05  0.030368  0.8524792017-08-06 -1.257529  0.1428052017-08-07  0.148612  0.289964df.ix[:,'A':'C']                    A         B         C2017-08-02  0.087691  0.219709 -1.4153082017-08-03 -1.093569  0.416883  0.1984002017-08-04  0.427113  0.306400 -0.3403592017-08-05  0.030368  0.852479 -1.7334372017-08-06 -1.257529  0.142805 -1.5459632017-08-07  0.148612  0.289964  0.661178df.ix[0:3,0:3]                    A         B         C2017-08-02  0.087691  0.219709 -1.4153082017-08-03 -1.093569  0.416883  0.1984002017-08-04  0.427113  0.306400 -0.340359

# 7.2. 属性.iloc[:,:] 纯粹的整数索引

print("\noriginal df:\n",df)
print("\ndf.iloc[0,:]\n",df.iloc[0,:])
print("\ndf.iloc[:,0]\n",df.iloc[:,0])
print("\ndf.iloc[0:3,0:3]\n",df.iloc[0:3,0:3])

original df:                    A         B         C         D2017-08-02 -0.385499 -0.306875  0.315595 -1.8965892017-08-03 -1.245150  1.637986  0.902566 -0.3699122017-08-04 -1.238186  0.181560  0.214879 -1.1141812017-08-05 -0.440366 -1.993538 -0.688619  2.3318782017-08-06 -0.011589 -1.807583 -1.482700  2.1055312017-08-07 -1.348804 -0.044709  2.163649 -0.174644df.iloc[0,:]A   -0.385499B   -0.306875C    0.315595D   -1.896589Name: 2017-08-02 00:00:00, dtype: float64df.iloc[:,0]2017-08-02   -0.3854992017-08-03   -1.2451502017-08-04   -1.2381862017-08-05   -0.4403662017-08-06   -0.0115892017-08-07   -1.348804Freq: D, Name: A, dtype: float64df.iloc[0:3,0:3]                    A         B         C2017-08-02 -0.385499 -0.306875  0.3155952017-08-03 -1.245150  1.637986  0.9025662017-08-04 -1.238186  0.181560  0.214879

# 7.3 选择行df.loc[行:行,列:列]

print("\noriginal df:\n",df)
print("\n第一行df.loc[dates_index[0],:]\n",df.loc[dates_index[0],:])
print("\n0-1行A到C列df.loc[dates_index[0:2],'A':'C']\n",df.loc[dates_index[0:2],'A':'C'])
print("\从第二行到下面所有行ndf.loc[dates_index[2:],:]\n",df.loc[dates_index[2:],:])

original df:                    A         B         C         D2017-08-02  0.852031 -0.361794 -0.131658  1.2939722017-08-03  0.042461 -0.405303  0.424170  0.2075102017-08-04 -0.105956 -1.225223  0.545878 -0.2566592017-08-05 -0.048689 -0.184247  1.389649 -0.2090202017-08-06 -0.355166 -1.208937 -1.588495 -0.4428202017-08-07  1.676568  1.037785  1.682274 -0.012409第一行df.loc[dates_index[0],:]A    0.852031B   -0.361794C   -0.131658D    1.293972Name: 2017-08-02 00:00:00, dtype: float640-1行A到C列df.loc[dates_index[0:2],'A':'C']                    A         B         C2017-08-02  0.852031 -0.361794 -0.1316582017-08-03  0.042461 -0.405303  0.424170从第二行到下面所有行df.loc[dates_index[2:],:]                    A         B         C         D2017-08-04 -0.105956 -1.225223  0.545878 -0.2566592017-08-05 -0.048689 -0.184247  1.389649 -0.2090202017-08-06 -0.355166 -1.208937 -1.588495 -0.4428202017-08-07  1.676568  1.037785  1.682274 -0.012409

# 7.3 布尔索引

print("\noriginal df:\n",df)
print("\ndf.A>0.5:\n",df.A>0.5)#注意这里可以用df.A来引用df的'A'列,等价与df['A']>0.5
print("\ndf[df.A>0.5]\n",df[df.A>0.5])
print("\ndf[df['A']>0.5]\n",df[df['A']>0.5])

original df:                    A         B         C         D2017-08-02  1.174908  0.294394  1.458811 -0.2295042017-08-03  1.543965 -0.822035  0.019391 -0.2367072017-08-04 -0.073200  0.614149 -0.385161  1.3836752017-08-05 -0.125898 -0.264261 -0.237433  1.5220092017-08-06  1.391828 -1.292755  0.779818  0.1849772017-08-07  0.393325  1.358546 -1.131494 -0.036899df.A>0.5: 2017-08-02     True2017-08-03     True2017-08-04    False2017-08-05    False2017-08-06     True2017-08-07    FalseFreq: D, Name: A, dtype: booldf[df.A>0.5]                    A         B         C         D2017-08-02  1.174908  0.294394  1.458811 -0.2295042017-08-03  1.543965 -0.822035  0.019391 -0.2367072017-08-06  1.391828 -1.292755  0.779818  0.184977df[df['A']>0.5]                    A         B         C         D2017-08-02  1.174908  0.294394  1.458811 -0.2295042017-08-03  1.543965 -0.822035  0.019391 -0.2367072017-08-06  1.391828 -1.292755  0.779818  0.184977

# 8. 列columns和行index的名字及数据值的查看values

data = np.random.randn(6,4)
dates_index = pd.date_range('20170802',periods = 6)
labeled_columns = list('ABCD')
df = pd.DataFrame(data,index = dates_index,columns = labeled_columns)

print('df.index:\n',df.index) #虽然自动弹出项没有该项，但是可以正常输出
print("df.columns\n",df.columns)
print("\nvalues:\n",df.values)

df.index: DatetimeIndex(['2017-08-02', '2017-08-03', '2017-08-04', '2017-08-05',               '2017-08-06', '2017-08-07'],              dtype='datetime64[ns]', freq='D')df.columns Index(['A', 'B', 'C', 'D'], dtype='object')values: [[-2.02446357 -0.714604    1.3423228  -1.6556258 ] [ 0.12293179 -0.18787551 -0.4763062   0.04510573] [ 0.92268382  0.86067222  0.14025608 -0.39711331] [-0.38946469 -0.95570081  0.05857063 -0.1566773 ] [ 1.39358622 -0.22103091  1.14675195 -0.90141337] [-1.64881973  1.26254782  1.11880424 -0.42361963]]

# 9. dataframe数据遍历和迭代iteration

# 9.1 df.iterrow()

对DataFrame的每一行进行迭代，返回一个Tuple (index, Series)
print("\n循环遍历df,原始df如下:\n",df)
print("\n&&&&&&&&&&&&&&&&&&&&\n")
for idx,row in df.iterrows():
print('\nidx\n',idx)
print(row)

循环遍历df,原始df如下:                    A         B         C         D2017-08-02 -2.024464 -0.714604  1.342323 -1.6556262017-08-03  0.122932 -0.187876 -0.476306  0.0451062017-08-04  0.922684  0.860672  0.140256 -0.3971132017-08-05 -0.389465 -0.955701  0.058571 -0.1566772017-08-06  1.393586 -0.221031  1.146752 -0.9014132017-08-07 -1.648820  1.262548  1.118804 -0.423620&&&&&&&&&&&&&&&&&&&&idx 2017-08-02 00:00:00A   -2.024464B   -0.714604C    1.342323D   -1.655626Name: 2017-08-02 00:00:00, dtype: float64idx 2017-08-03 00:00:00A    0.122932B   -0.187876C   -0.476306D    0.045106Name: 2017-08-03 00:00:00, dtype: float64idx 2017-08-04 00:00:00A    0.922684B    0.860672C    0.140256D   -0.397113Name: 2017-08-04 00:00:00, dtype: float64idx 2017-08-05 00:00:00A   -0.389465B   -0.955701C    0.058571D   -0.156677Name: 2017-08-05 00:00:00, dtype: float64idx 2017-08-06 00:00:00A    1.393586B   -0.221031C    1.146752D   -0.901413Name: 2017-08-06 00:00:00, dtype: float64idx 2017-08-07 00:00:00A   -1.648820B    1.262548C    1.118804D   -0.423620Name: 2017-08-07 00:00:00, dtype: float64

# 9.2 df.itertuples()

也是一行一行地迭代，返回的是一个namedtuple，通常比iterrow快，
# 因为不需要做转换
print("\n@@@@@@@@@@@@@@@@@@@@\n")
for row in df.itertuples():
print (row)

@@@@@@@@@@@@@@@@@@@@Pandas(Index=Timestamp('2017-08-02 00:00:00', freq='D'), A=-2.0244635731150744, B=-0.71460400319097506, C=1.342322802457689, D=-1.6556258003597193)Pandas(Index=Timestamp('2017-08-03 00:00:00', freq='D'), A=0.12293179203239156, B=-0.18787551355369914, C=-0.47630620483341407, D=0.045105732709614245)Pandas(Index=Timestamp('2017-08-04 00:00:00', freq='D'), A=0.92268382357994816, B=0.86067222202020854, C=0.14025607716961588, D=-0.39711330744921625)Pandas(Index=Timestamp('2017-08-05 00:00:00', freq='D'), A=-0.389464692996358, B=-0.95570081033916021, C=0.058570629560594063, D=-0.15667729517068432)Pandas(Index=Timestamp('2017-08-06 00:00:00', freq='D'), A=1.393586215688051, B=-0.22103091191411234, C=1.1467519519104892, D=-0.90141336789197646)Pandas(Index=Timestamp('2017-08-07 00:00:00', freq='D'), A=-1.6488197318775826, B=1.2625478230114269, C=1.118804243889175, D=-0.42361962564746392)

# 9.3 iteriems()

对DataFrame相当于对列迭代
print("\n^^^^^^^^^^^^^^^^\n")
for c, col in df.iteritems():
print (c)
print(col)

^^^^^^^^^^^^^^^^A2017-08-02   -2.0244642017-08-03    0.1229322017-08-04    0.9226842017-08-05   -0.3894652017-08-06    1.3935862017-08-07   -1.648820Freq: D, Name: A, dtype: float64B2017-08-02   -0.7146042017-08-03   -0.1878762017-08-04    0.8606722017-08-05   -0.9557012017-08-06   -0.2210312017-08-07    1.262548Freq: D, Name: B, dtype: float64C2017-08-02    1.3423232017-08-03   -0.4763062017-08-04    0.1402562017-08-05    0.0585712017-08-06    1.1467522017-08-07    1.118804Freq: D, Name: C, dtype: float64D2017-08-02   -1.6556262017-08-03    0.0451062017-08-04   -0.3971132017-08-05   -0.1566772017-08-06   -0.9014132017-08-07   -0.423620Freq: D, Name: D, dtype: float64

# 10. 行数和类型

print("\ndf的行数:",df.shape[0])
print("\ndf的数据类型:\n",df.dtypes)

df的行数: 6df的数据类型:A    float64B    float64C    float64D    float64dtype: object

阅读全文

0 0