python基础之pandas

来源：互联网发布：linux grub 启动顺序编辑：程序博客网时间：2024/05/04 15:10

coding: utf-8

In[3]:

Series的创建：

import pandas as pd
import numpy as np
s = pd.Series([1,3,6,np.nan,44,1])#NaN（not a number），在数学表示上表示一个无法表示的数

print(s)
“””
0 1.0
1 3.0
2 6.0
3 NaN
4 44.0
5 1.0
dtype: float64
“”“

In[2]:

DataFrame 的创建：

dates = pd.date_range(‘20160101’,periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=[‘a’,’b’,’c’,’d’])

print(df)
“””
a b c d
2016-01-01 -0.253065 -2.071051 -0.640515 0.613663
2016-01-02 -1.147178 1.532470 0.989255 -0.499761
2016-01-03 1.221656 -2.390171 1.862914 0.778070
2016-01-04 1.473877 -0.046419 0.610046 0.204672
2016-01-05 -1.584752 -0.700592 1.487264 -1.778293
2016-01-06 0.633675 -1.414157 -0.277066 -0.442545
“”“

DataFrame是一个表格型的数据结构，它包含有一组有序的列，

每列可以是不同的值类型（数值，字符串，布尔值等）。DataFrame既有行索引也有列索引，它可以被看做由Series组成的大字典。

In[3]:

DataFrame 的一些简单运用

print(df[‘b’])

“””
2016-01-01 -2.071051
2016-01-02 1.532470
2016-01-03 -2.390171
2016-01-04 -0.046419
2016-01-05 -0.700592
2016-01-06 -1.414157
Freq: D, Name: b, dtype: float64
“”“

In[4]:

df1 = pd.DataFrame(np.arange(12).reshape((3,4)))
print(df1)

“””
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
“”“

采取默认的从0开始 index.

In[5]:

df2 = pd.DataFrame({‘A’ : 1.,
‘B’ : pd.Timestamp(‘20130102’),
‘C’ : pd.Series(1,index=list(range(4)),dtype=’float32’),
‘D’ : np.array([3] * 4,dtype=’int32’),
‘E’ : pd.Categorical([“test”,”train”,”test”,”train”]),
‘F’ : ‘foo’})

print(df2)

“””
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
“””
print(df2.dtypes)

“””
df2.dtypes
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
“”“

In[6]:

print(df2.index)

Int64Index([0, 1, 2, 3], dtype=’int64’)

In[7]:

print(df2.columns)

Index([‘A’, ‘B’, ‘C’, ‘D’, ‘E’, ‘F’], dtype=’object’)

In[8]:

print(df2.values)

“””
array([[1.0, Timestamp(‘2013-01-02 00:00:00’), 1.0, 3, ‘test’, ‘foo’],
[1.0, Timestamp(‘2013-01-02 00:00:00’), 1.0, 3, ‘train’, ‘foo’],
[1.0, Timestamp(‘2013-01-02 00:00:00’), 1.0, 3, ‘test’, ‘foo’],
[1.0, Timestamp(‘2013-01-02 00:00:00’), 1.0, 3, ‘train’, ‘foo’]], dtype=object)
“”“

In[10]:

df2.describe()

“””
A C D
count 4.0 4.0 4.0
mean 1.0 1.0 3.0
std 0.0 0.0 0.0
min 1.0 1.0 3.0
25% 1.0 1.0 3.0
50% 1.0 1.0 3.0
75% 1.0 1.0 3.0
max 1.0 1.0 3.0
“”“

In[11]:

print(df2.T)#翻转数据

In[16]:

print(df2.sort_index(axis=1, ascending=False))#对数据的 index 进行排序并输出

“””
F E D C B A
0 foo test 3 1.0 2013-01-02 1.0
1 foo train 3 1.0 2013-01-02 1.0
2 foo test 3 1.0 2013-01-02 1.0
3 foo train 3 1.0 2013-01-02 1.0
“”“

In[17]:

print(df2.sort_values(by=’B’))#对数据值排序输出:

In[18]:

Pandas 选择数据

dates = pd.date_range(‘20130101’, periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates, columns=[‘A’,’B’,’C’,’D’])

“””
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
“”“

In[19]:

简单的筛选

print(df[‘A’])
print(df.A)

In[22]:

让选择跨越多行或多列:

print(df[0:3])

“””
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
“”“

print(df[‘20130102’:’20130104’])

“””
A B C D
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
“”“

In[23]:

我们可以使用标签来选择数据 loc, 本例子主要通过标签名字选择某一行数据，

或者通过选择某行或者所有行（:代表所有行）然后选其中某一列或几列数据。:

print(df.loc[‘20130102’])
“””
A 4
B 5
C 6
D 7
Name: 2013-01-02 00:00:00, dtype: int64
“”“

print(df.loc[:,[‘A’,’B’]])
“””
A B
2013-01-01 0 1
2013-01-02 4 5
2013-01-03 8 9
2013-01-04 12 13
2013-01-05 16 17
2013-01-06 20 21
“”“

print(df.loc[‘20130102’,[‘A’,’B’]])
“””
A 4
B 5
Name: 2013-01-02 00:00:00, dtype: int64
“”“

In[25]:

可以采用位置进行选择 iloc, 在这里我们可以通过位置选择在不同情况下所需要的数据例如选某一个，连续选或者跨行选等操作。

print (df)
print(df.iloc[3,1])

13

print(df.iloc[3:5,1:3])
“””
B C
2013-01-04 13 14
2013-01-05 17 18
“”“

print(df.iloc[[1,3,5],1:3])
“””
B C
2013-01-02 5 6
2013-01-04 13 14
2013-01-06 21 22

“”“

In[26]:

采用混合选择 ix, 其中选择’A’和’C’的两列，并选择前三行的数据。

print(df.ix[:3,[‘A’,’C’]])
“””
A C
2013-01-01 0 2
2013-01-02 4 6
2013-01-03 8 10
“”“

In[27]:

我们可以采用判断指令 (Boolean indexing) 进行选择. 我们可以约束某项条件然后选择出当前所有数据.

print(df[df.A>8])
“””
A B C D
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
“”“

In[28]:

Pandas 设置值

dates = pd.date_range(‘20130101’, periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates, columns=[‘A’,’B’,’C’,’D’])

“””
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
“”“

可以利用索引或者标签确定需要修改值的位置。

df.iloc[2,2] = 1111
df.loc[‘20130101’,’B’] = 2222

“””
A B C D
2013-01-01 0 2222 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 1111 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
“”“

In[29]:

如果现在的判断条件是这样, 我们想要更改B中的数, 而更改的位置是取决于 A 的. 对于A大于4的位置. 更改B在相应位置上的数为0.

df.B[df.A>4] = 0
“””
A B C D
2013-01-01 0 2222 2 3
2013-01-02 4 5 6 7
2013-01-03 8 0 1111 11
2013-01-04 12 0 14 15
2013-01-05 16 0 18 19
2013-01-06 20 0 22 23
“”“

In[30]:

如果对整列做批处理, 加上一列 ‘F’, 并将 F 列全改为 NaN, 如下:

df[‘F’] = np.nan
“””
A B C D F
2013-01-01 0 2222 2 3 NaN
2013-01-02 4 5 6 7 NaN
2013-01-03 8 0 1111 11 NaN
2013-01-04 12 0 14 15 NaN
2013-01-05 16 0 18 19 NaN
2013-01-06 20 0 22 23 NaN
“”“

In[31]:

用上面的方法也可以加上 Series 序列（但是长度必须对齐）。

df[‘E’] = pd.Series([1,2,3,4,5,6], index=pd.date_range(‘20130101’,periods=6))
“””
A B C D F E
2013-01-01 0 2222 2 3 NaN 1
2013-01-02 4 5 6 7 NaN 2
2013-01-03 8 0 1111 11 NaN 3
2013-01-04 12 0 14 15 NaN 4
2013-01-05 16 0 18 19 NaN 5
2013-01-06 20 0 22 23 NaN 6
“”“

In[32]:

Pandas 处理丢失数据

dates = pd.date_range(‘20130101’, periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates, columns=[‘A’,’B’,’C’,’D’])
df.iloc[0,1] = np.nan
df.iloc[1,2] = np.nan
“””
A B C D
2013-01-01 0 NaN 2.0 3
2013-01-02 4 5.0 NaN 7
2013-01-03 8 9.0 10.0 11
2013-01-04 12 13.0 14.0 15
2013-01-05 16 17.0 18.0 19
2013-01-06 20 21.0 22.0 23
“”“

In[33]:

df.dropna(
axis=0, # 0: 对行进行操作; 1: 对列进行操作
how=’any’ # ‘any’: 只要存在 NaN 就 drop 掉; ‘all’: 必须全部是 NaN 才 drop
) #如果想直接去掉有 NaN 的行或列, 可以使用 dropna
“””
A B C D
2013-01-03 8 9.0 10.0 11
2013-01-04 12 13.0 14.0 15
2013-01-05 16 17.0 18.0 19
2013-01-06 20 21.0 22.0 23
“”“

In[34]:

df.fillna(value=0)#如果是将 NaN 的值用其他值代替, 比如代替成 0:
“””
A B C D
2013-01-01 0 0.0 2.0 3
2013-01-02 4 5.0 0.0 7
2013-01-03 8 9.0 10.0 11
2013-01-04 12 13.0 14.0 15
2013-01-05 16 17.0 18.0 19
2013-01-06 20 21.0 22.0 23
“”“

In[35]:

df.isnull() #判断是否有缺失数据 NaN, 为 True 表示缺失数据:
“””
A B C D
2013-01-01 False True False False
2013-01-02 False False True False
2013-01-03 False False False False
2013-01-04 False False False False
2013-01-05 False False False False
2013-01-06 False False False False
“”“

In[36]:

np.any(df.isnull()) == True #np.any：Returns True if any of the elements of a evaluate to True.

True

In[17]:

import pandas as pd #加载模块
import xlrd

读取csv

data = pd.read_csv(“E:/Learning materials/codes/numpy and pandas/student.csv”)

打印出data

print(data)

In[18]:

data.to_pickle(‘student.pickle’)#将资料存取成pickle

In[19]:

Pandas 合并 concat

import pandas as pd
import numpy as np

定义资料集

df1 = pd.DataFrame(np.ones((3,4))*0, columns=[‘a’,’b’,’c’,’d’])
df2 = pd.DataFrame(np.ones((3,4))*1, columns=[‘a’,’b’,’c’,’d’])
df3 = pd.DataFrame(np.ones((3,4))*2, columns=[‘a’,’b’,’c’,’d’])

concat纵向合并

res = pd.concat([df1, df2, df3], axis=0)

打印结果

print(res)

a b c d

0 0.0 0.0 0.0 0.0

1 0.0 0.0 0.0 0.0

2 0.0 0.0 0.0 0.0

0 1.0 1.0 1.0 1.0

1 1.0 1.0 1.0 1.0

2 1.0 1.0 1.0 1.0

0 2.0 2.0 2.0 2.0

1 2.0 2.0 2.0 2.0

2 2.0 2.0 2.0 2.0

In[21]:

承上一个例子，并将index_ignore设定为True

res = pd.concat([df1, df2, df3], axis=0, ignore_index=True)

打印结果结果的index变0, 1, 2, 3, 4, 5, 6, 7, 8

print(res)

a b c d

0 0.0 0.0 0.0 0.0

1 0.0 0.0 0.0 0.0

2 0.0 0.0 0.0 0.0

3 1.0 1.0 1.0 1.0

4 1.0 1.0 1.0 1.0

5 1.0 1.0 1.0 1.0

6 2.0 2.0 2.0 2.0

7 2.0 2.0 2.0 2.0

8 2.0 2.0 2.0 2.0

In[22]:

定义资料集

join=’outer’为预设值，因此未设定任何参数时，函数默认join=’outer’。

此方式是依照column来做纵向合并，有相同的column上下合并在一起，其他独自的column个自成列，原本没有值的位置皆以NaN填充。

df1 = pd.DataFrame(np.ones((3,4))*0, columns=[‘a’,’b’,’c’,’d’], index=[1,2,3])
df2 = pd.DataFrame(np.ones((3,4))*1, columns=[‘b’,’c’,’d’,’e’], index=[2,3,4])

纵向”外”合并df1与df2

res = pd.concat([df1, df2], axis=0, join=’outer’)

print(res)

a b c d e

1 0.0 0.0 0.0 0.0 NaN

2 0.0 0.0 0.0 0.0 NaN

3 0.0 0.0 0.0 0.0 NaN

2 NaN 1.0 1.0 1.0 1.0

3 NaN 1.0 1.0 1.0 1.0

4 NaN 1.0 1.0 1.0 1.0

In[24]:

承上一个例子

纵向”内”合并df1与df2

res = pd.concat([df1, df2], axis=0, join=’inner’)

打印结果

print(res)

b c d

1 0.0 0.0 0.0

2 0.0 0.0 0.0

3 0.0 0.0 0.0

2 1.0 1.0 1.0

3 1.0 1.0 1.0

4 1.0 1.0 1.0

重置index并打印结果

res = pd.concat([df1, df2], axis=0, join=’inner’, ignore_index=True)
print(res)

b c d

0 0.0 0.0 0.0

1 0.0 0.0 0.0

2 0.0 0.0 0.0

3 1.0 1.0 1.0

4 1.0 1.0 1.0

5 1.0 1.0 1.0

In[25]:

定义资料集

df1 = pd.DataFrame(np.ones((3,4))*0, columns=[‘a’,’b’,’c’,’d’], index=[1,2,3])
df2 = pd.DataFrame(np.ones((3,4))*1, columns=[‘b’,’c’,’d’,’e’], index=[2,3,4])

依照`df1.index`进行横向合并

res = pd.concat([df1, df2], axis=1, join_axes=[df1.index])

打印结果

print(res)

a b c d b c d e

1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN

2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0

3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0

移除join_axes，并打印结果

res = pd.concat([df1, df2], axis=1)
print(res)

a b c d b c d e

1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN

2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0

3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0

4 NaN NaN NaN NaN 1.0 1.0 1.0 1.0

In[26]:

import pandas as pd
import numpy as np

定义资料集

将df2合并到df1的下面，以及重置index，并打印出结果

res = df1.append(df2, ignore_index=True)
print(res)

a b c d

0 0.0 0.0 0.0 0.0

1 0.0 0.0 0.0 0.0

2 0.0 0.0 0.0 0.0

3 1.0 1.0 1.0 1.0

4 1.0 1.0 1.0 1.0

5 1.0 1.0 1.0 1.0

合并多个df，将df2与df3合并至df1的下面，以及重置index，并打印出结果

res = df1.append([df2, df3], ignore_index=True)
print(res)

a b c d

0 0.0 0.0 0.0 0.0

1 0.0 0.0 0.0 0.0

2 0.0 0.0 0.0 0.0

3 1.0 1.0 1.0 1.0

4 1.0 1.0 1.0 1.0

5 1.0 1.0 1.0 1.0

6 1.0 1.0 1.0 1.0

7 1.0 1.0 1.0 1.0

8 1.0 1.0 1.0 1.0

合并series，将s1合并至df1，以及重置index，并打印出结果

res = df1.append(s1, ignore_index=True)
print(res)

a b c d

0 0.0 0.0 0.0 0.0

1 0.0 0.0 0.0 0.0

2 0.0 0.0 0.0 0.0

3 1.0 2.0 3.0 4.0

In[28]:

left = pd.DataFrame({‘key’: [‘K0’, ‘K1’, ‘K2’, ‘K3’],
‘A’: [‘A0’, ‘A1’, ‘A2’, ‘A3’],
‘B’: [‘B0’, ‘B1’, ‘B2’, ‘B3’]})
right = pd.DataFrame({‘key’: [‘K0’, ‘K1’, ‘K2’, ‘K3’],
‘C’: [‘C0’, ‘C1’, ‘C2’, ‘C3’],
‘D’: [‘D0’, ‘D1’, ‘D2’, ‘D3’]})

print(left)

A B key

0 A0 B0 K0

1 A1 B1 K1

2 A2 B2 K2

3 A3 B3 K3

print(right)

C D key

0 C0 D0 K0

1 C1 D1 K1

2 C2 D2 K2

3 C3 D3 K3

依据key column合并，并打印出

res = pd.merge(left, right, on=’key’)

print(res)

A B key C D

0 A0 B0 K0 C0 D0

1 A1 B1 K1 C1 D1

2 A2 B2 K2 C2 D2

3 A3 B3 K3 C3 D3

In[29]:

import pandas as pd

定义资料集并打印出

left = pd.DataFrame({‘key1’: [‘K0’, ‘K0’, ‘K1’, ‘K2’],
‘key2’: [‘K0’, ‘K1’, ‘K0’, ‘K1’],
‘A’: [‘A0’, ‘A1’, ‘A2’, ‘A3’],
‘B’: [‘B0’, ‘B1’, ‘B2’, ‘B3’]})
right = pd.DataFrame({‘key1’: [‘K0’, ‘K1’, ‘K1’, ‘K2’],
‘key2’: [‘K0’, ‘K0’, ‘K0’, ‘K0’],
‘C’: [‘C0’, ‘C1’, ‘C2’, ‘C3’],
‘D’: [‘D0’, ‘D1’, ‘D2’, ‘D3’]})

print(left)

A B key1 key2

0 A0 B0 K0 K0

1 A1 B1 K0 K1

2 A2 B2 K1 K0

3 A3 B3 K2 K1

print(right)

C D key1 key2

0 C0 D0 K0 K0

1 C1 D1 K1 K0

2 C2 D2 K1 K0

3 C3 D3 K2 K0

依据key1与key2 columns进行合并，并打印出四种结果[‘left’, ‘right’, ‘outer’, ‘inner’]

res = pd.merge(left, right, on=[‘key1’, ‘key2’], how=’inner’)
print(res)

A B key1 key2 C D

0 A0 B0 K0 K0 C0 D0

1 A2 B2 K1 K0 C1 D1

2 A2 B2 K1 K0 C2 D2

res = pd.merge(left, right, on=[‘key1’, ‘key2’], how=’outer’)
print(res)

A B key1 key2 C D

0 A0 B0 K0 K0 C0 D0

1 A1 B1 K0 K1 NaN NaN

2 A2 B2 K1 K0 C1 D1

3 A2 B2 K1 K0 C2 D2

4 A3 B3 K2 K1 NaN NaN

5 NaN NaN K2 K0 C3 D3

res = pd.merge(left, right, on=[‘key1’, ‘key2’], how=’left’)
print(res)

A B key1 key2 C D

0 A0 B0 K0 K0 C0 D0

1 A1 B1 K0 K1 NaN NaN

2 A2 B2 K1 K0 C1 D1

3 A2 B2 K1 K0 C2 D2

4 A3 B3 K2 K1 NaN NaN

res = pd.merge(left, right, on=[‘key1’, ‘key2’], how=’right’)
print(res)

A B key1 key2 C D

0 A0 B0 K0 K0 C0 D0

1 A2 B2 K1 K0 C1 D1

2 A2 B2 K1 K0 C2 D2

3 NaN NaN K2 K0 C3 D3

In[30]:

定义资料集并打印出

df1 = pd.DataFrame({‘col1’:[0,1], ‘col_left’:[‘a’,’b’]})
df2 = pd.DataFrame({‘col1’:[1,2,2],’col_right’:[2,2,2]})

print(df1)

col1 col_left

0 0 a

1 1 b

print(df2)

col1 col_right

0 1 2

1 2 2

2 2 2

依据col1进行合并，并启用indicator=True，最后打印出

res = pd.merge(df1, df2, on=’col1’, how=’outer’, indicator=True)
print(res)

col1 col_left col_right _merge

0 0.0 a NaN left_only

1 1.0 b 2.0 both

2 2.0 NaN 2.0 right_only

3 2.0 NaN 2.0 right_only

自定indicator column的名称，并打印出

res = pd.merge(df1, df2, on=’col1’, how=’outer’, indicator=’indicator_column’)
print(res)

col1 col_left col_right indicator_column

0 0.0 a NaN left_only

1 1.0 b 2.0 both

2 2.0 NaN 2.0 right_only

3 2.0 NaN 2.0 right_only

In[32]:

依据index合并

import pandas as pd

定义资料集并打印出

left = pd.DataFrame({‘A’: [‘A0’, ‘A1’, ‘A2’],
‘B’: [‘B0’, ‘B1’, ‘B2’]},
index=[‘K0’, ‘K1’, ‘K2’])
right = pd.DataFrame({‘C’: [‘C0’, ‘C2’, ‘C3’],
‘D’: [‘D0’, ‘D2’, ‘D3’]},
index=[‘K0’, ‘K2’, ‘K3’])

print(left)

A B

K0 A0 B0

K1 A1 B1

K2 A2 B2

print(right)

C D

K0 C0 D0

K2 C2 D2

K3 C3 D3

依据左右资料集的index进行合并，how=’outer’,并打印出

res = pd.merge(left, right, left_index=True, right_index=True, how=’outer’)
print(res)

A B C D

K0 A0 B0 C0 D0

K1 A1 B1 NaN NaN

K2 A2 B2 C2 D2

K3 NaN NaN C3 D3

依据左右资料集的index进行合并，how=’inner’,并打印出

res = pd.merge(left, right, left_index=True, right_index=True, how=’inner’)
print(res)

A B C D

K0 A0 B0 C0 D0

K2 A2 B2 C2 D2

In[35]:

解决overlapping的问题

定义资料集

boys = pd.DataFrame({‘k’: [‘K0’, ‘K1’, ‘K2’], ‘age’: [1, 2, 3]})
girls = pd.DataFrame({‘k’: [‘K0’, ‘K0’, ‘K3’], ‘age’: [4, 5, 6]})

使用suffixes解决overlapping的问题

print(boys)
print(girls)
res = pd.merge(boys, girls, on=’k’, how=’outer’)
print(res)
res = pd.merge(boys, girls, on=’k’, suffixes=[‘_boy’, ‘_girl’], how=’inner’)
print(res)

age_boy k age_girl

0 1 K0 4

1 1 K0 5

In[54]:

Pandas plot 出图

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

随机生成1000个数据

data = pd.Series(np.random.randn(1000),index=np.arange(1000))

为了方便观看效果, 我们累加这个数据

data=np.cumsum(data)

pandas 数据可以直接观看其可视化形式

data.plot()

plt.show()

In[56]:

data = pd.DataFrame(
np.random.randn(1000,4),
index=np.arange(1000),
columns=list(“ABCD”)
)
data=np.cumsum(data)
data.plot()
plt.show()

In[57]:

ax = data.plot.scatter(x=’A’,y=’B’,color=’DarkBlue’,label=’Class1’)

In[58]:

将之下这个 data 画在上一个 ax 上面

data.plot.scatter(x=’A’,y=’C’,color=’LightGreen’,label=’Class2’,ax=ax)
plt.show()

In[ ]:

阅读全文

0 0

python基础之pandas

coding: utf-8

In[3]:

Series的 创建：

In[2]:

DataFrame 的创建：

DataFrame是一个表格型的数据结构，它包含有一组有序的列，

每列可以是不同的值类型（数值，字符串，布尔值等）。DataFrame既有行索引也有列索引， 它可以被看做由Series组成的大字典。

In[3]:

DataFrame 的一些简单运用

In[4]:

采取默认的从0开始 index.

In[5]:

In[6]:

Int64Index([0, 1, 2, 3], dtype=’int64’)

In[7]:

Index([‘A’, ‘B’, ‘C’, ‘D’, ‘E’, ‘F’], dtype=’object’)

In[8]:

In[10]:

In[11]:

In[16]:

In[17]:

In[18]:

Pandas 选择数据

In[19]:

简单的筛选

In[22]:

让选择跨越多行或多列:

In[23]:

我们可以使用标签来选择数据 loc, 本例子主要通过标签名字选择某一行数据，

或者通过选择某行或者所有行（:代表所有行）然后选其中某一列或几列数据。:

In[25]:

可以采用位置进行选择 iloc, 在这里我们可以通过位置选择在不同情况下所需要的数据例如选某一个，连续选或者跨行选等操作。

13

In[26]:

采用混合选择 ix, 其中选择’A’和’C’的两列，并选择前三行的数据。

In[27]:

我们可以采用判断指令 (Boolean indexing) 进行选择. 我们可以约束某项条件然后选择出当前所有数据.

In[28]:

Pandas 设置值

可以利用索引或者标签确定需要修改值的位置。

In[29]:

如果现在的判断条件是这样, 我们想要更改B中的数, 而更改的位置是取决于 A 的. 对于A大于4的位置. 更改B在相应位置上的数为0.

In[30]:

如果对整列做批处理, 加上一列 ‘F’, 并将 F 列全改为 NaN, 如下:

In[31]:

用上面的方法也可以加上 Series 序列（但是长度必须对齐）。

In[32]:

Pandas 处理丢失数据

In[33]:

In[34]:

In[35]:

In[36]:

True

In[17]:

读取csv

打印出data

In[18]:

In[19]:

Pandas 合并 concat

定义资料集

concat纵向合并

打印结果

a b c d

0 0.0 0.0 0.0 0.0

1 0.0 0.0 0.0 0.0

2 0.0 0.0 0.0 0.0

0 1.0 1.0 1.0 1.0

1 1.0 1.0 1.0 1.0

2 1.0 1.0 1.0 1.0

0 2.0 2.0 2.0 2.0

1 2.0 2.0 2.0 2.0

2 2.0 2.0 2.0 2.0

In[21]:

承上一个例子，并将index_ignore设定为True

打印结果 结果的index变0, 1, 2, 3, 4, 5, 6, 7, 8

a b c d

0 0.0 0.0 0.0 0.0

1 0.0 0.0 0.0 0.0

2 0.0 0.0 0.0 0.0

Series的创建：

每列可以是不同的值类型（数值，字符串，布尔值等）。DataFrame既有行索引也有列索引，它可以被看做由Series组成的大字典。

打印结果结果的index变0, 1, 2, 3, 4, 5, 6, 7, 8

依照`df1.index`进行横向合并