【Python数据分析与展示】（六）处理缺失数据，层次化索引

来源：互联网发布：密码签到软件破解编辑：程序博客网时间：2024/06/06 17:25

处理缺失数据

pandas用浮点值NaN来表示缺失数据，它只是一个易于被检测出来的标识

方法说明 dropna 过滤缺失数据，可以用阈值调节容忍度 fillna 用指定值或插值方法填充缺失数据 isnull 返回布尔值标识哪些是NaN notnull isnull的反义

Examples

df = pd.DataFrame([[np.nan, 2, np.nan, 0], [3, 4, np.nan, 1],
… [np.nan, np.nan, np.nan, 5]],
… columns=list(‘ABCD’))
df
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
Drop the columns where all elements are nan:

df.dropna(axis=1, how=’all’)
A B D
0 NaN 2.0 0
1 3.0 4.0 1
2 NaN NaN 5
Drop the columns where any of the elements is nan

df.dropna(axis=1, how=’any’)
D
0 0
1 1
2 5
Drop the rows where all of the elements are nan (there is no row to drop, so df stays the same):

df.dropna(axis=0, how=’all’)
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
Keep only the rows with at least 2 non-na values:

df.dropna(thresh=2)
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1

fillna参数

参数说明 value 填充值，标量或字典 method ffill和bfill axis 按哪个轴 inplace 替换原副本 limit 最大替换数量

Examples

df = pd.DataFrame([[np.nan, 2, np.nan, 0],
… [3, 4, np.nan, 1],
… [np.nan, np.nan, np.nan, 5],
… [np.nan, 3, np.nan, 4]],
… columns=list(‘ABCD’))
df
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
Replace all NaN elements with 0s.

df.fillna(0)
A B C D
0 0.0 2.0 0.0 0
1 3.0 4.0 0.0 1
2 0.0 0.0 0.0 5
3 0.0 3.0 0.0 4
We can also propagate non-null values forward or backward.

df.fillna(method=’ffill’)
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 3.0 4.0 NaN 5
3 3.0 3.0 NaN 4
Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.

values = {‘A’: 0, ‘B’: 1, ‘C’: 2, ‘D’: 3}
df.fillna(value=values)
A B C D
0 0.0 2.0 2.0 0
1 3.0 4.0 2.0 1
2 0.0 1.0 2.0 5
3 0.0 3.0 2.0 4
Only replace the first NaN element.

df.fillna(value=values, limit=1)
A B C D
0 0.0 2.0 2.0 0
1 3.0 4.0 NaN 1
2 NaN 1.0 NaN 5
3 NaN 3.0 NaN 4

层次化索引

层次化索引（hierarchical indexing）是pandas的一项重要功能，他能使你在一个轴上拥有多个索引级别
example：Series

data = Series(np.random.rand(10),             index = [['a','a','a','b','b','b','c','c','d','d'],['1','2','3','1','2','3','1','2','1',2]])#a  1    0.974478   2    0.638362   3    0.101788b  1    0.713843   2    0.106504   3    0.175605c  1    0.608555   2    0.399577d  1    0.102047   2    0.726674dtype: float64data.index #MultiIndex(levels=[['a', 'b', 'c', 'd'], [2, '1', '2', '3']],           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [1, 2, 3, 1, 2, 3, 1, 2, 1, 0]])data.unstack()#   1              2            3a   0.972446    0.058712    0.758507b   0.648166    0.169893    0.423814c   0.655640    0.869214    NaNd   0.141091    0.251400    NaN

对于DataFrame，每条轴都可以有分层索引

frame = DataFrame(np.arange(12).reshape(4,3),index = [['a','a','b','b'],['1','2','1','2']],columns=[['Ohio','Ohio','Colonado'],['green','red','green']])#     Ohio      Colonado      green red greena   1   0   1   2    2   3   4   5b   1   6   7   8    2   9   10  11frame.index.names= ["key1","key2"]frame.columns.names = ['state','color']frame#     state  Ohio    Colonado     color  green red greenkey1  key2          a      1    0   1   2       2    3   4   5b      1    6   7   8       2    9   10  11frame["Ohio"]#   color   green   redkey1   key2     a       1   0   1        2   3   4b       1   6   7        2   9   10frame.swaplevel('key1','key2') #交换两个索引，名称或者编号#    state   Ohio       Colonado    color   green red   greenkey2 key1           1      a    0   1   22      a    3   4   51      b    6   7   82      b    9   10  11frame.sort_index(level = 1) #按第二级索引排列frame.swaplevel('key1','key2').sort_index(level = 0) #交换并按第1级索引排列frame.sum(axis = 0,level = 1)

也可以使用列当行索引 set_index函数

frame1 = DataFrame({'a':range(7),'b':range(7,0,-1),'c':['a','a','a','b','b','b','b'],'d':[0,1,2,0,1,2,3]})frame1 #   a   b   c   d0   0   7   a   01   1   6   a   12   2   5   a   23   3   4   b   04   4   3   b   15   5   2   b   26   6   1   b   3frame2 = frame1.set_index(['c','d'])frame2#       a   bc   d       a   0   0   7    1   1   6    2   2   5b   0   3   4    1   4   3    2   5   2    3   6   1frame1.set_index(['c','d'],drop=False) #保留作为索引的列frame1.set_index(['c','d'])frame2.reset_index() #恢复

阅读全文

0 0