Python pandas 入门
来源:互联网 发布:嗅探 软件 编辑:程序博客网 时间:2024/05/20 23:05
pandas读取csv文件
pandas中为我们提供了丰富的读取文件的接口,对数据处理极为的方便。
创建数据
from pandas import DataFrame,read_csvimport matplotlib.pyplot as pltimport pandas as pdimport sysimport matplotlibprint('Pandas version' + sys.version)print('Pandas version' + pd.__version__)print('Matplotlib version' + matplotlib.__version__)# create datanames = ['zhangsan','lisi','wangwu','zhaoliu']ages = [11,12,13,14]dataset = zip(names,ages)dataset
将我们的数据转变为pandas的格式:
df = pd.DataFrame(data = dataset,columns = ['Names','Age'])dfOut[47]: Names Age0 zhangsan 111 lisi 122 wangwu 133 zhaoliu 14
将我们的数据写入到一个csv文件中去,利用pandas的接口
df.to_csv('age.csv',index = False,header = False)
这里的两个参数分别是前面的索引和表头,可以看到我们的文件已经生成了。
获取数据
Location = 'E:\\code\\python\\pandas\\age.csv'df = pd.read_csv(Location)dfOut[50]: zhangsan 110 lisi 121 wangwu 132 zhaoliu 14
出现一个问题是我们的df会将数据的第一条判定为header,所以我们需要指定名字。
df = pd.read_csv(Location,names =['Names','Age'])dfOut[53]: Names Age0 zhangsan 111 lisi 122 wangwu 133 zhaoliu 14
分析数据
pandas提供了一些接口就像数据库的操作一样。
Sorted = df.sort_values(['Age'], ascending = False)SortedOut[56]: Names Age3 zhaoliu 142 wangwu 131 lisi 120 zhangsan 11df['Age'].max()Out[55]: 14
呈现数据
pandas提供了数据的可视化
df['Age'].plot()
pandas读取txt文件
pandas读取txt文件和读取csv很像。
pandas读取excel文件
创建数据
写入一个函数中
import pandas as pdimport matplotlib.pyplot as pltimport sysimport matplotlibimport numpy.random as npnp.seed(111)def CreateDataSet(Number): Output = [] for i in range(Number): #Create a weekly data rang rng = pd.date_range(start='1/1/2009',end='12/31/2012',freq='W-MON') #Create random data data = np.randint(low=25,high=1000,size=len(rng)) #Status pool status = [1,2,3] random_status = [status[np.randint(low=0,high=len(status))] for i in range(len(rng))] states = ['GA','FL','f1','NY','NJ','TX'] random_states = [states[np.randint(low=0,high=len(states))] for i in range(len(rng))] Output.extend(zip(random_states,random_status,data,rng)) return Outputdataset = CreateDataSet(4)print datasetdf = pd.DataFrame(data = dataset,columns = ['state','status','CustomerCount','StatusDate'])df.to_excel('Lesson3.xlsx',index = False)print ('Done')
这里只要注意一下它产生随机数时候的方法就可以了。
从文件中读取数据
Location = ‘E:\code\python\pandas\Lesson3.xlsx’
parse a specific sheet
Location = 'E:\\code\\python\\pandas\\Lesson3.xlsx'#parse a specific sheetdf = pd.read_excel(Location,0,index_col = 'StatusDate')df.dtypesdf.indexdf.head()Out[4]: state status CustomerCountStatusDate 2009-01-05 GA 1 8772009-01-12 FL 1 9012009-02-02 GA 1 3002009-03-09 NY 1 992
数据预处理
df['state'].unique()#可以筛选出一共几种statedf['state'] = df.state.apply(lambda x:x.upper())mask = df['status'] == 1df = df[mask]df.state[df.state=='NJ']='NY'
数据的一些基本的操作:
df['CustomerCount'].plot(figsize=(15,5))#画图sortdf = df[df['state']=='NY'].sort_index(axis=0)#按照横轴进行排序sortdf.head(10)Daily.loc['FL'].plot()Daily.loc['GA'].plot()
画图呈现我们的数据
同时可以精确到月份
数据行列的增加删除
In [5]: dfOut[5]: Rev0 01 12 23 34 45 56 67 78 89 9In [6]: df['NelCol'] = 5In [7]: dfOut[7]: Rev NelCol0 0 51 1 52 2 53 3 54 4 55 5 56 6 57 7 58 8 59 9 5In [9]: df['NelCol'] = df['NelCol'] + 1In [10]: dfOut[10]: Rev NelCol0 0 61 1 62 2 63 3 64 4 65 5 66 6 67 7 68 8 69 9 6In [11]: del df['NelCol']In [12]: dfOut[12]: Rev0 01 12 23 34 45 56 67 78 89 9In [13]: df['test'] = 3In [14]: dfOut[14]: Rev test0 0 31 1 32 2 33 3 34 4 35 5 36 6 37 7 38 8 39 9 3In [15]: df['col'] = df['Rev']In [16]: dfOut[16]: Rev test col0 0 3 01 1 3 12 2 3 23 3 3 34 4 3 45 5 3 56 6 3 67 7 3 78 8 3 89 9 3 9In [17]: i = ['a','b','c','d','e','f','g','h','i','j']In [18]: df.index = iIn [19]: dfOut[19]: Rev test cola 0 3 0b 1 3 1c 2 3 2d 3 3 3e 4 3 4f 5 3 5g 6 3 6h 7 3 7i 8 3 8j 9 3 9In [21]: df.loc['a']Out[21]: Rev 0test 3col 0Name: a, dtype: int64In [22]: df.loc['a':'d']Out[22]: Rev test cola 0 3 0b 1 3 1c 2 3 2d 3 3 3In [23]: df['Rev']Out[23]: a 0b 1c 2d 3e 4f 5g 6h 7i 8j 9Name: Rev, dtype: int32In [24]: df[0:3,'Rev']In [25]: df.ix[0:3,'Rev']Out[25]: a 0b 1c 2Name: Rev, dtype: int32In [26]: df.ix[5:,'col']Out[26]: f 5g 6h 7i 8j 9Name: col, dtype: int32In [27]: df.head()Out[27]: Rev test cola 0 3 0b 1 3 1c 2 3 2d 3 3 3e 4 3 4In [28]: df.head(2)Out[28]: Rev test cola 0 3 0b 1 3 1In [29]: df.tail(2)Out[29]: Rev test coli 8 3 8j 9 3 9
pandas中的groupby语句
首先创建数据框
In [11]: dfOut[11]: letter one two0 a 1 21 a 1 22 b 1 23 b 1 24 c 1 2
看一下group的效果:
In [14]: one = df.groupby('letter')In [15]: oneOut[15]: <pandas.core.groupby.DataFrameGroupBy object at 0x00000000094BDFD0>In [16]: one.sum()Out[16]: one twoletter a 2 4b 2 4c 1 2
groupby 两个属性
In [18]: letterone = df.groupby(['letter','one']).sum()In [19]: letteroneOut[19]: twoletter one a 1 4b 1 4c 1 2
不想将我们的letter作为索引
In [21]: letterone = df.groupby(['letter','one'],as_index = False).sum()In [22]: letteroneOut[22]: letter one two0 a 1 41 b 1 42 c 1 2
pandas计算极端值
1:创建我们的数据
In [14]: df
Out[14]:
Revenue State
2012-01-01 1.0 NY
2012-02-01 2.0 NY
2012-03-01 3.0 NY
2012-04-01 4.0 NY
2012-05-01 5.0 FL
2012-06-01 6.0 FL
2012-07-01 7.0 GA
2012-08-01 8.0 GA
2012-09-01 9.0 FL
2012-10-01 10.0 FL
2013-01-01 10.0 NY
2013-02-01 10.0 NY
2013-03-01 9.0 NY
2013-04-01 9.0 NY
2013-05-01 8.0 FL
2013-06-01 8.0 FL
2013-07-01 7.0 GA
2013-08-01 7.0 GA
2013-09-01 6.0 FL
2013-10-01 6.0 FL
计算均值,方差而特殊值
In [15]: newdf = df.copy()In [16]: newdf['x_Mean'] = abs(newdf['Revenue'] - newdf['Revenue'].mean())In [17]: newdf['1.96*std'] = 1.96*newdf['Revenue'].std()In [19]: newdf['Outlier'] = newdf['x_Mean'] > newdf['1.96*std']newdf.head()Out[31]: Revenue State Outlier x_Mean 1.96*std2012-01-01 1.0 NY False 5.00 7.5548132012-02-01 2.0 NY False 4.00 7.5548132012-03-01 3.0 NY False 3.00 7.5548132012-04-01 4.0 NY False 2.00 7.5548132012-05-01 5.0 FL False 2.25 3.434996
用状态进行分组
In [22]: newdf = df.copy()
In [23]: State = newdf.groupby(‘State’)
运用lambda函数:
In [27]: newdf['Outlier']=State.transform(lambda x:abs(x - x.mean())>1.96*x.std())In [28]: newdf['x_Mean']=State.transform(lambda x:abs(x - x.mean()))In [29]: newdf['1.96*std'] = State.transform(lambda x:x.std())In [30]: newdf['1.96*std'] = State.transform(lambda x:x.std()*1.96)
可以用一个函数实现上述功能:
newdf = df.copy()StateMonth = newdf.groupby(['State', lambda x: x.month])def s(group): group['x-Mean'] = abs(group['Revenue'] - group['Revenue'].mean()) group['1.96*std'] = 1.96*group['Revenue'].std() group['Outlier'] = abs(group['Revenue'] - group['Revenue'].mean()) > 1.96*group['Revenue'].std() return groupNewdf2 = StateMonth.apply(s)
- Python pandas 入门
- Python pandas快速入门
- Python 数据分析包:pandas 入门
- 利用python进行数据分析-pandas入门
- python学习笔记一(pandas入门)
- python机器学习库入门之pandas
- 利用Python数据分析:pandas入门(四)
- 利用Python数据分析:pandas入门(五)
- 利用Python数据分析:pandas入门(六)
- Python——数据分析Pandas入门
- Python数据分析入门-Pandas环境搭建
- python数据分析(pandas入门)
- python包pandas安装和入门
- pandas入门
- pandas入门
- Pandas入门
- Pandas入门
- pandas入门
- 单片机AD-DA转换
- 6. ZigZag Conversion(Z字形输出数组)
- 分销算法 无限级树形结构算法
- 【翻译】Android Interface Definition Language (AIDL)
- 脚本增加浏览量
- Python pandas 入门
- 广播的实践-强制下线功能
- uiautomator 之uiautomator 2.0 UiObject2学习收藏
- linux学习笔记2----Torvalds的Linux发展
- 笔记5:HTML总结!
- Android Studio 使用过程中出现的问题及解决办法
- linux:select()函数详解
- 仿淘宝商品详情页中(继续拖动到图文详情)
- jquery中的ajax