Python3pandas库DataFrame用法(基础整理)
来源:互联网 发布:二叉树层次遍历算法 编辑:程序博客网 时间:2024/06/07 12:36
创建一个DataFrame
(1)用字典dict,字典值value是列表list
(2)用Series构建DataFrame
(3)用一个字典构成的列表list of dicts来构建DataFrame
广播特性
定位DataFrame里的元素
(1)利用表达式boolean定位
(2)利用loc,iloc,ix函数定位
可以定位数字,就可以赋值
info()和describe()
缺省值填充fillna,ffill,bfill
层次化的index
unstack:Series转化成DataFrame
csv文件读写read_scv/to_csv
一个DataFrame就是一张表格,Series可以理解成一维数据,DataFrame就是一个二维数据,DataFrame可以由多个Series组成(DataFrame可以理解成Series的一个集合)
创建一个DataFrame
(1)用字典dict,字典值value是列表list
population={'city':['Beijing','Shanghai','Guangzhou','Shenzhen','Hangzhou','Chongqing'], 'year':[2016,2017,2016,2017,2016,2016], 'population':[2100,2300,1000,700,500,500] }population=pd.DataFrame(population) ###print(population)
city population year0 Beijing 2100 20161 Shanghai 2300 20172 Guangzhou 1000 20163 Shenzhen 700 20174 Hangzhou 500 20165 Chongqing 500 2016
pdc=pd.DataFrame(population,columns=['year','city','population']) #columns参数改变列名print(pdc)
year city population0 2016 Beijing 21001 2017 Shanghai 23002 2016 Guangzhou 10003 2017 Shenzhen 7004 2016 Hangzhou 5005 2016 Chongqing 500
tmp={'city':['Beijing','Shanghai','Guangzhou','Shenzhen','Hangzhou','Chongqing'], 'year':[2016,2017,2016,2017,2016,2016], 'population':[2100,2300,1000,700,500,500] }pdci=pd.DataFrame(tmp,columns=['year','city','population'], index=['one','two','three','four','five','six']) #改变行index索引和列名columnsprint(pdci)
year city populationone 2016 Beijing 2100two 2017 Shanghai 2300three 2016 Guangzhou 1000four 2017 Shenzhen 700five 2016 Hangzhou 500six 2016 Chongqing 500
(2)用Series构建DataFrame
cities={'Beijing':55000,'Shanghai':60000,'shenzhen':50000,'Hangzhou':20000,'Guangzhou':45000,'Suzhou':None}apts=pd.Series(cities,name='income')apts['shenzhen']=70000less_than_50000=(apts<50000)apts[less_than_50000]=40000apts2=pd.Series({'Beijing':10000,'Shanghai':8000,'shenzhen':6000,'Tianjin':40000,'Guangzhou':7000,'Chongqing':30000})# print(apts2)apts=apts+apts2apts[apts.isnull()]=apts.mean()# print(apts)df=pd.DataFrame({'apts':apts,'apts2':apts2}) ###print(df)
apts apts2Beijing 65000.0 10000.0Chongqing 64000.0 30000.0Guangzhou 47000.0 7000.0Hangzhou 64000.0 NaNShanghai 68000.0 8000.0Suzhou 64000.0 NaNTianjin 64000.0 40000.0shenzhen 76000.0 6000.0
(3)用一个字典构成的列表list of dicts来构建DataFrame
data=[{'JackMa':99999999999,'Han':5000,'David':10000}, {'JackMa':99999999998,'Han':4000,'David':11000}]pdl=pd.DataFrame(data,index=['salary1','salary2'])print(pdl)
David Han JackMasalary1 10000 5000 99999999999salary2 11000 4000 99999999998
广播特性
cities={'Beijing':55000,'Shanghai':60000,'shenzhen':50000,'Hangzhou':20000,'Guangzhou':45000,'Suzhou':None}apts=pd.Series(cities,name='income')apts['shenzhen']=70000less_than_50000=(apts<50000)apts[less_than_50000]=40000apts2=pd.Series({'Beijing':10000,'Shanghai':8000,'shenzhen':6000,'Tianjin':40000,'Guangzhou':7000,'Chongqing':30000})apts=apts+apts2apts[apts.isnull()]=apts.mean()df=pd.DataFrame({'apts':apts,'apts2':apts2})#print(df)df['bonus']=2000 #增加一个新列bonus,并且都赋值2000print(df)
apts apts2 bonusBeijing 65000.0 10000.0 2000Chongqing 64000.0 30000.0 2000Guangzhou 47000.0 7000.0 2000Hangzhou 64000.0 NaN 2000Shanghai 68000.0 8000.0 2000Suzhou 64000.0 NaN 2000Tianjin 64000.0 40000.0 2000shenzhen 76000.0 6000.0 2000
df['income']=df['apts']*2+df['apts2']*1.5+df['bonus']print(df)
apts apts2 bonus incomeBeijing 65000.0 10000.0 2000 147000.0Chongqing 64000.0 30000.0 2000 175000.0Guangzhou 47000.0 7000.0 2000 106500.0Hangzhou 64000.0 NaN 2000 NaNShanghai 68000.0 8000.0 2000 150000.0Suzhou 64000.0 NaN 2000 NaNTianjin 64000.0 40000.0 2000 190000.0shenzhen 76000.0 6000.0 2000 163000.0
print(df.index)
Index(['Beijing', 'Chongqing', 'Guangzhou', 'Hangzhou', 'Shanghai', 'Suzhou', 'Tianjin', 'shenzhen'], dtype='object')
定位DataFrame里的元素
(1)利用表达式boolean定位
import pandas as pdcities={'Beijing':55000,'Shanghai':60000,'shenzhen':50000,'Hangzhou':20000,'Guangzhou':45000,'Suzhou':None}apts=pd.Series(cities,name='income')apts['shenzhen']=70000less_than_50000=(apts<50000)apts[less_than_50000]=40000apts2=pd.Series({'Beijing':10000,'Shanghai':8000,'shenzhen':6000,'Tianjin':40000,'Guangzhou':7000,'Chongqing':30000})apts=apts+apts2apts[apts.isnull()]=apts.mean()df=pd.DataFrame({'apts':apts,'apts2':apts2})df['bonus']=2000 #增加一个新列bonus,并且都赋值2000df['income']=df['apts']*2+df['apts2']*1.5+df['bonus']#print(df)# apts apts2 bonus income#Beijing 65000.0 10000.0 2000 147000.0#Chongqing 64000.0 30000.0 2000 175000.0#Guangzhou 47000.0 7000.0 2000 106500.0#Hangzhou 64000.0 NaN 2000 NaN#Shanghai 68000.0 8000.0 2000 150000.0#Suzhou 64000.0 NaN 2000 NaN#Tianjin 64000.0 40000.0 2000 190000.0#shenzhen 76000.0 6000.0 2000 163000.0print(df.apts==64000)print(df['apts']==64000) #boolean条件
Beijing FalseChongqing TrueGuangzhou FalseHangzhou TrueShanghai FalseSuzhou TrueTianjin Trueshenzhen FalseName: apts, dtype: bool
print(df[df['apts']==64000]) #对行做选择,就是把apts列等于64000的行取出来
apts apts2 bonus incomeChongqing 64000.0 30000.0 2000 175000.0Hangzhou 64000.0 NaN 2000 NaNSuzhou 64000.0 NaN 2000 NaNTianjin 64000.0 40000.0 2000 190000.0
df[df.apts==64000]['income']=200000 #报错,在复制片段上赋值,原来的df没被改变
(2)利用loc,iloc,ix函数定位
loc:通过“行标签”索引行数据
print(df.loc['Hangzhou']) #定位选某一行
apts 64000.0apts2 NaNbonus 2000.0income NaNName: Hangzhou, dtype: float64
print(df.loc[['Hangzhou','Shanghai']])
apts apts2 bonus incomeHangzhou 64000.0 NaN 2000 NaNShanghai 68000.0 8000.0 2000 150000.0
print(df.loc[df['apts']==64000,['apts2','apts','bonus']])#前面的部分是对行做选择,后面的部分是对列做选择
apts2 apts bonusChongqing 30000.0 64000.0 2000Hangzhou NaN 64000.0 2000Suzhou NaN 64000.0 2000Tianjin 40000.0 64000.0 2000
iloc:通过“行号”索引行数据
print(df.iloc[0:5])
apts apts2 bonus incomeBeijing 65000.0 10000.0 2000 147000.0Chongqing 64000.0 30000.0 2000 175000.0Guangzhou 47000.0 7000.0 2000 106500.0Hangzhou 64000.0 NaN 2000 NaNShanghai 68000.0 8000.0 2000 150000.0
ix:通过行标签或者行号索引行数据(基于loc和iloc 的混合)
print(df.ix[1:4,1:3]) #用行号和列号做数据选择
apts2 bonusChongqing 30000.0 2000Guangzhou 7000.0 2000Hangzhou NaN 2000
可以定位数字,就可以赋值
df.loc[:,'income']=5000print(df)
apts apts2 bonus incomeBeijing 65000.0 10000.0 2000 5000Chongqing 64000.0 30000.0 2000 5000Guangzhou 47000.0 7000.0 2000 5000Hangzhou 64000.0 NaN 2000 5000Shanghai 68000.0 8000.0 2000 5000Suzhou 64000.0 NaN 2000 5000Tianjin 64000.0 40000.0 2000 5000shenzhen 76000.0 6000.0 2000 5000
info()和describe()
info
print(df.info())
<class 'pandas.core.frame.DataFrame'>Index: 8 entries, Beijing to shenzhenData columns (total 4 columns):apts 8 non-null float64apts2 6 non-null float64bonus 8 non-null int64income 8 non-null int64dtypes: float64(2), int64(2)memory usage: 320.0+ bytesNone
describe
print(df.describe())
apts apts2 bonus incomecount 8.000000 6.000000 8.0 8.0mean 64000.000000 16833.333333 2000.0 5000.0std 8017.837257 14483.323744 0.0 0.0min 47000.000000 6000.000000 2000.0 5000.025% 64000.000000 7250.000000 2000.0 5000.050% 64000.000000 9000.000000 2000.0 5000.075% 65750.000000 25000.000000 2000.0 5000.0max 76000.000000 40000.000000 2000.0 5000.0
head
print(df.head(2))
apts apts2 bonus incomeBeijing 65000.0 10000.0 2000 5000Chongqing 64000.0 30000.0 2000 5000
tail
print(df.tail(2))
apts apts2 bonus incomeTianjin 64000.0 40000.0 2000 5000shenzhen 76000.0 6000.0 2000 5000
条件判断与条件组合
#df2.loc[((df2['dow']==0)|(df2['dow']==2)|(df2['dow']==4)),:]#df2.loc[ df2['dow'].isin([0,2,4]) , : ] #可以是一个列表,numpy array,Series#~(df2['dow'].isin([0,2,4]))
缺省值填充fillna,ffill,bfill
fillna
import pandas as pdcities={'Beijing':55000,'Shanghai':60000,'shenzhen':50000,'Hangzhou':20000,'Guangzhou':45000,'Suzhou':None}apts=pd.Series(cities,name='income')apts['shenzhen']=70000less_than_50000=(apts<50000)apts[less_than_50000]=40000apts2=pd.Series({'Beijing':10000,'Shanghai':8000,'shenzhen':6000,'Tianjin':40000,'Guangzhou':7000,'Chongqing':30000})apts=apts+apts2apts[apts.isnull()]=apts.mean()df=pd.DataFrame({'apts':apts,'apts2':apts2})df['bonus']=2000 df['income']=df['apts']*2+df['apts2']*1.5+df['bonus']#print(df)# apts apts2 bonus income#Beijing 65000.0 10000.0 2000 147000.0#Chongqing 64000.0 30000.0 2000 175000.0#Guangzhou 47000.0 7000.0 2000 106500.0#Hangzhou 64000.0 NaN 2000 NaN#Shanghai 68000.0 8000.0 2000 150000.0#Suzhou 64000.0 NaN 2000 NaN#Tianjin 64000.0 40000.0 2000 190000.0#shenzhen 76000.0 6000.0 2000 163000.0
dff=df.fillna(value=0) #df没变print(dff)
apts apts2 bonus incomeBeijing 65000.0 10000.0 2000 147000.0Chongqing 64000.0 30000.0 2000 175000.0Guangzhou 47000.0 7000.0 2000 106500.0Hangzhou 64000.0 0.0 2000 0.0Shanghai 68000.0 8000.0 2000 150000.0Suzhou 64000.0 0.0 2000 0.0Tianjin 64000.0 40000.0 2000 190000.0shenzhen 76000.0 6000.0 2000 163000.0
inplace
dff=df.fillna(value=0, inplace=True)print(df);print(dff) #inplace参数True,df改变,没有新的dff拷贝
apts apts2 bonus incomeBeijing 65000.0 10000.0 2000 147000.0Chongqing 64000.0 30000.0 2000 175000.0Guangzhou 47000.0 7000.0 2000 106500.0Hangzhou 64000.0 0.0 2000 0.0Shanghai 68000.0 8000.0 2000 150000.0Suzhou 64000.0 0.0 2000 0.0Tianjin 64000.0 40000.0 2000 190000.0shenzhen 76000.0 6000.0 2000 163000.0None
ffill
dffr=df.fillna(method='ffill') #新生成的补NaN前向拷贝,df没变print(dffr)
apts apts2 bonus incomeBeijing 65000.0 10000.0 2000 147000.0Chongqing 64000.0 30000.0 2000 175000.0Guangzhou 47000.0 7000.0 2000 106500.0Hangzhou 64000.0 7000.0 2000 106500.0Shanghai 68000.0 8000.0 2000 150000.0Suzhou 64000.0 8000.0 2000 150000.0Tianjin 64000.0 40000.0 2000 190000.0shenzhen 76000.0 6000.0 2000 163000.0
bfill
dfba=df.fillna(method='bfill') #新生成的补NaN后向拷贝,df没变print(dfba)
apts apts2 bonus incomeBeijing 65000.0 10000.0 2000 147000.0Chongqing 64000.0 30000.0 2000 175000.0Guangzhou 47000.0 7000.0 2000 106500.0Hangzhou 64000.0 8000.0 2000 150000.0Shanghai 68000.0 8000.0 2000 150000.0Suzhou 64000.0 40000.0 2000 190000.0Tianjin 64000.0 40000.0 2000 190000.0shenzhen 76000.0 6000.0 2000 163000.0
层次化的index
import pandas as pdimport numpy as npdata=pd.Series(np.random.randn(10),index=[['a','a','a','b','b','c','c','d','d','d'], [1,2,3,1,2,1,2,1,2,3]])print(data)print(type(data))
a 1 0.346467 2 -0.043077 3 0.043878b 1 0.107763 2 -0.175726c 1 -1.833683 2 0.033884d 1 -1.807021 2 0.819740 3 0.294679dtype: float64<class 'pandas.core.series.Series'>
print(data.index)
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]], labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 0, 1, 0, 1, 2]])
print(data['b':'c'])
b 1 0.353241 2 0.379744c 1 -0.860706 2 -0.795483dtype: float64
print(data[:2])
a 1 0.763116 2 0.058009dtype: float64
unstack:Series转化成DataFrame
unstack=data.unstack() #将层级数据横向拉开,不够长的补NaNprint(unstack)print(type(unstack))
1 2 3a -0.637935 -0.104897 -1.536381b 2.448302 1.679833 NaNc -0.845155 0.829459 NaNd 0.597535 -0.464255 -0.898994<class 'pandas.core.frame.DataFrame'> #对比data的类型
csv文件读写read_scv/to_csv
import pandas as pdcities={'Beijing':55000,'Shanghai':60000,'shenzhen':50000,'Hangzhou':20000,'Guangzhou':45000,'Suzhou':None}apts=pd.Series(cities,name='income')apts['shenzhen']=70000less_than_50000=(apts<50000)apts[less_than_50000]=40000apts2=pd.Series({'Beijing':10000,'Shanghai':8000,'shenzhen':6000,'Tianjin':40000,'Guangzhou':7000,'Chongqing':30000})apts=apts+apts2apts[apts.isnull()]=apts.mean()df=pd.DataFrame({'apts':apts,'apts2':apts2})df['bonus']=2000 df['income']=df['apts']*2+df['apts2']*1.5+df['bonus']#print(df)df.to_csv('df.csv')df.to_csv('df2.csv',index=False) #去掉第一列,行索引列import osdf2_site = r"D:\PYTHON35\idle\df2.csv"pwd = os.getcwd() #获取当前工作目录os.chdir(os.path.dirname(df2_site))tmp_df = pd.read_csv(os.path.basename(df2_site)) ###print(tmp_df)
apts apts2 bonus income0 65000.0 10000.0 2000 147000.01 64000.0 30000.0 2000 175000.02 47000.0 7000.0 2000 106500.03 64000.0 NaN 2000 NaN4 68000.0 8000.0 2000 150000.05 64000.0 NaN 2000 NaN6 64000.0 40000.0 2000 190000.07 76000.0 6000.0 2000 163000.0
tmp_df_index = pd.Index(['Beijing','Shanghai',"Suzhou",'Hangzhou','Tianjin','Chongqing','Nanjing','Shenzhen'])tmp_df.index=tmp_df_index #修改索引print(tmp_df)
apts apts2 bonus incomeBeijing 65000.0 10000.0 2000 147000.0Shanghai 64000.0 30000.0 2000 175000.0Suzhou 47000.0 7000.0 2000 106500.0Hangzhou 64000.0 NaN 2000 NaNTianjin 68000.0 8000.0 2000 150000.0Chongqing 64000.0 NaN 2000 NaNNanjing 64000.0 40000.0 2000 190000.0Shenzhen 76000.0 6000.0 2000 163000.0
df.to_csv('df3.csv',sep='\t')
阅读全文
0 0
- Python3pandas库DataFrame用法(基础整理)
- Python3pandas库DataFrame的分组,拼接,统计运算等用法(基础整理)
- Python3pandas库Series用法(基础整理)
- Python3pandas库transform用法
- DataFrame基础用法2
- spark dataframe API 整理
- spark DataFrame用法
- ruby基础用法简单整理
- python dataframe基础入门
- Pandas基础复习-DataFrame
- spark dataframe dataset reducebykey用法
- Spark基础:如何遍历dataframe
- spark dataframe API 整理 (差集等)
- DataFrame筛选数据与loc用法
- Dataframe中na.fill的用法
- 关于pandas.DataFrame的groupby的用法
- Spark ML 基础:DataFrame、Dataset、feature
- spark-DataFrame学习记录-[1]基础部分
- pod 清除缓存
- Android 屏幕常亮
- [android进阶篇]MVP模式优化,防止内存泄漏和空指针问题
- LaTex公式符号
- “我们为什么要录用你?”应聘者如何回答?
- Python3pandas库DataFrame用法(基础整理)
- 如何开始CTF
- Bootstrap学习记录,快速手册
- 统计学 数据的概括性度量
- A Prototype Discovery Environment for Analyzing and Visualizing Terascale Turbulent Fluid Flow Simul
- 编译Java文件并生成jar包
- 如何优雅的抓取低功耗蓝牙ble的数据(未经测试,需要花钱买ble设备,类似网卡)
- Java 基本数据类型转换问题
- 函数作用以及定时器