Python pandas 入门

来源：互联网发布：嗅探软件编辑：程序博客网时间：2024/05/20 23:05

pandas读取csv文件

pandas中为我们提供了丰富的读取文件的接口，对数据处理极为的方便。

创建数据

from pandas import DataFrame,read_csvimport matplotlib.pyplot as pltimport pandas as pdimport sysimport matplotlibprint('Pandas version' + sys.version)print('Pandas version' + pd.__version__)print('Matplotlib version' + matplotlib.__version__)# create datanames = ['zhangsan','lisi','wangwu','zhaoliu']ages = [11,12,13,14]dataset = zip(names,ages)dataset

将我们的数据转变为pandas的格式：

df = pd.DataFrame(data = dataset,columns = ['Names','Age'])dfOut[47]:       Names  Age0  zhangsan   111      lisi   122    wangwu   133   zhaoliu   14

将我们的数据写入到一个csv文件中去，利用pandas的接口

df.to_csv('age.csv',index = False,header = False)

这里的两个参数分别是前面的索引和表头，可以看到我们的文件已经生成了。

获取数据

Location = 'E:\\code\\python\\pandas\\age.csv'df = pd.read_csv(Location)dfOut[50]:   zhangsan  110     lisi  121   wangwu  132  zhaoliu  14

出现一个问题是我们的df会将数据的第一条判定为header，所以我们需要指定名字。

df = pd.read_csv(Location,names =['Names','Age'])dfOut[53]:       Names  Age0  zhangsan   111      lisi   122    wangwu   133   zhaoliu   14

分析数据

pandas提供了一些接口就像数据库的操作一样。

Sorted = df.sort_values(['Age'], ascending = False)SortedOut[56]:       Names  Age3   zhaoliu   142    wangwu   131      lisi   120  zhangsan   11df['Age'].max()Out[55]: 14

呈现数据

pandas提供了数据的可视化

df['Age'].plot()

这里写图片描述

pandas读取txt文件

pandas读取txt文件和读取csv很像。

pandas读取excel文件

创建数据

写入一个函数中

import pandas as pdimport matplotlib.pyplot as pltimport sysimport matplotlibimport numpy.random as npnp.seed(111)def CreateDataSet(Number):    Output = []    for i in range(Number):        #Create a weekly data rang        rng = pd.date_range(start='1/1/2009',end='12/31/2012',freq='W-MON')        #Create random data        data = np.randint(low=25,high=1000,size=len(rng))        #Status pool        status = [1,2,3]        random_status = [status[np.randint(low=0,high=len(status))] for i in range(len(rng))]                states = ['GA','FL','f1','NY','NJ','TX']        random_states = [states[np.randint(low=0,high=len(states))] for i in range(len(rng))]        Output.extend(zip(random_states,random_status,data,rng))    return Outputdataset = CreateDataSet(4)print datasetdf = pd.DataFrame(data = dataset,columns = ['state','status','CustomerCount','StatusDate'])df.to_excel('Lesson3.xlsx',index = False)print ('Done')

这里只要注意一下它产生随机数时候的方法就可以了。

从文件中读取数据

Location = ‘E:\code\python\pandas\Lesson3.xlsx’

parse a specific sheet

Location = 'E:\\code\\python\\pandas\\Lesson3.xlsx'#parse a specific sheetdf = pd.read_excel(Location,0,index_col = 'StatusDate')df.dtypesdf.indexdf.head()Out[4]:            state  status  CustomerCountStatusDate                             2009-01-05    GA       1            8772009-01-12    FL       1            9012009-02-02    GA       1            3002009-03-09    NY       1            992

数据预处理

df['state'].unique()#可以筛选出一共几种statedf['state'] = df.state.apply(lambda x:x.upper())mask = df['status'] == 1df = df[mask]df.state[df.state=='NJ']='NY'

数据的一些基本的操作:

df['CustomerCount'].plot(figsize=(15,5))#画图sortdf = df[df['state']=='NY'].sort_index(axis=0)#按照横轴进行排序sortdf.head(10)Daily.loc['FL'].plot()Daily.loc['GA'].plot()

画图呈现我们的数据
这里写图片描述

这里写图片描述

同时可以精确到月份
这里写图片描述

数据行列的增加删除

In [5]: dfOut[5]:    Rev0    01    12    23    34    45    56    67    78    89    9In [6]: df['NelCol'] = 5In [7]: dfOut[7]:    Rev  NelCol0    0       51    1       52    2       53    3       54    4       55    5       56    6       57    7       58    8       59    9       5In [9]: df['NelCol'] = df['NelCol'] + 1In [10]: dfOut[10]:    Rev  NelCol0    0       61    1       62    2       63    3       64    4       65    5       66    6       67    7       68    8       69    9       6In [11]: del df['NelCol']In [12]: dfOut[12]:    Rev0    01    12    23    34    45    56    67    78    89    9In [13]: df['test'] = 3In [14]: dfOut[14]:    Rev  test0    0     31    1     32    2     33    3     34    4     35    5     36    6     37    7     38    8     39    9     3In [15]: df['col'] = df['Rev']In [16]: dfOut[16]:    Rev  test  col0    0     3    01    1     3    12    2     3    23    3     3    34    4     3    45    5     3    56    6     3    67    7     3    78    8     3    89    9     3    9In [17]: i = ['a','b','c','d','e','f','g','h','i','j']In [18]: df.index = iIn [19]: dfOut[19]:    Rev  test  cola    0     3    0b    1     3    1c    2     3    2d    3     3    3e    4     3    4f    5     3    5g    6     3    6h    7     3    7i    8     3    8j    9     3    9In [21]: df.loc['a']Out[21]: Rev     0test    3col     0Name: a, dtype: int64In [22]: df.loc['a':'d']Out[22]:    Rev  test  cola    0     3    0b    1     3    1c    2     3    2d    3     3    3In [23]: df['Rev']Out[23]: a    0b    1c    2d    3e    4f    5g    6h    7i    8j    9Name: Rev, dtype: int32In [24]: df[0:3,'Rev']In [25]: df.ix[0:3,'Rev']Out[25]: a    0b    1c    2Name: Rev, dtype: int32In [26]: df.ix[5:,'col']Out[26]: f    5g    6h    7i    8j    9Name: col, dtype: int32In [27]: df.head()Out[27]:    Rev  test  cola    0     3    0b    1     3    1c    2     3    2d    3     3    3e    4     3    4In [28]: df.head(2)Out[28]:    Rev  test  cola    0     3    0b    1     3    1In [29]: df.tail(2)Out[29]:    Rev  test  coli    8     3    8j    9     3    9

pandas中的groupby语句

首先创建数据框

In [11]: dfOut[11]:   letter  one  two0      a    1    21      a    1    22      b    1    23      b    1    24      c    1    2

看一下group的效果：

In [14]: one = df.groupby('letter')In [15]: oneOut[15]: <pandas.core.groupby.DataFrameGroupBy object at 0x00000000094BDFD0>In [16]: one.sum()Out[16]:         one  twoletter          a         2    4b         2    4c         1    2

groupby 两个属性

In [18]: letterone = df.groupby(['letter','one']).sum()In [19]: letteroneOut[19]:             twoletter one     a      1      4b      1      4c      1      2

不想将我们的letter作为索引

In [21]: letterone = df.groupby(['letter','one'],as_index = False).sum()In [22]: letteroneOut[22]:   letter  one  two0      a    1    41      b    1    42      c    1    2

pandas计算极端值

1:创建我们的数据
In [14]: df

Out[14]:

        Revenue State

2012-01-01 1.0 NY

2012-02-01 2.0 NY

2012-03-01 3.0 NY

2012-04-01 4.0 NY

2012-05-01 5.0 FL

2012-06-01 6.0 FL

2012-07-01 7.0 GA

2012-08-01 8.0 GA

2012-09-01 9.0 FL

2012-10-01 10.0 FL

2013-01-01 10.0 NY

2013-02-01 10.0 NY

2013-03-01 9.0 NY

2013-04-01 9.0 NY

2013-05-01 8.0 FL

2013-06-01 8.0 FL

2013-07-01 7.0 GA

2013-08-01 7.0 GA

2013-09-01 6.0 FL

2013-10-01 6.0 FL

计算均值，方差而特殊值

In [15]: newdf = df.copy()In [16]: newdf['x_Mean'] = abs(newdf['Revenue'] - newdf['Revenue'].mean())In [17]: newdf['1.96*std'] = 1.96*newdf['Revenue'].std()In [19]: newdf['Outlier'] = newdf['x_Mean'] > newdf['1.96*std']newdf.head()Out[31]:             Revenue State Outlier  x_Mean  1.96*std2012-01-01      1.0    NY   False    5.00  7.5548132012-02-01      2.0    NY   False    4.00  7.5548132012-03-01      3.0    NY   False    3.00  7.5548132012-04-01      4.0    NY   False    2.00  7.5548132012-05-01      5.0    FL   False    2.25  3.434996

用状态进行分组
In [22]: newdf = df.copy()
In [23]: State = newdf.groupby(‘State’)

运用lambda函数：

In [27]: newdf['Outlier']=State.transform(lambda x:abs(x - x.mean())>1.96*x.std())In [28]: newdf['x_Mean']=State.transform(lambda x:abs(x - x.mean()))In [29]: newdf['1.96*std'] = State.transform(lambda x:x.std())In [30]: newdf['1.96*std'] = State.transform(lambda x:x.std()*1.96)

可以用一个函数实现上述功能：

newdf = df.copy()StateMonth = newdf.groupby(['State', lambda x: x.month])def s(group):    group['x-Mean'] = abs(group['Revenue'] - group['Revenue'].mean())    group['1.96*std'] = 1.96*group['Revenue'].std()      group['Outlier'] = abs(group['Revenue'] - group['Revenue'].mean()) > 1.96*group['Revenue'].std()    return groupNewdf2 = StateMonth.apply(s)

0 0