Pandas常用操作

来源：互联网发布：婴儿面膜淘宝授权书编辑：程序博客网时间：2024/06/06 16:35

Pandas是一款基于Python的数据分析包，使用起来非常方便，能够极大地解放数据分析师的繁琐工作，强烈推荐！

Series

import pandas as pd

Series类说明

class Series(pandas.core.base.IndexOpsMixin, pandas.core.generic.NDFrame)
| One-dimensional ndarray with axis labels (including time series).

help(pd.Series(a[1:,0]).map)
Examples

>>> xone   1two   2three 3>>> y1  foo2  bar3  baz>>> x.map(y)one   footwo   barthree baz

>>>pd.Series(a[1:,0])0    41    7dtype: int64>>>pd.Series(a[1:,0]).map(lambda x: "|"+str(x)+"|")0    |4|1    |7|dtype: object>>> pda   c1  c20   1   41   2   52   3   6>>> pdb   c1  c20  10   71   9   62   8   5>>> pda+10+pdb   c1  c20  21  211  21  212  21  21

Series的bool运算选择

>>> pda[pda>2]2    33    44    5dtype: int64

复杂一些的

>>> ser = pd.Series(['111', '112', '122'])>>> ser0    1111    1122    122dtype: object

>>> ser[[x.startswith('11') for x in ser]]0    1111    112dtype: object这里是不能简单的使用ser[ser.startswith(‘11’)]， 因为ser不是str类型

DataFrame

DataFrame的常用构造方式：

使用Series组合构造
直接从csv文件中读取pd.read_csv(“filename.csv”)
多维数组构造，
df = pd.DataFrame(np.random.randn(10, 4), columns=[‘A’, ‘B’, ‘C’, ‘D’])

构造完数据后，存储至磁盘文件系统

df = pd.DataFrame({        'order_adslot': order_adslot_ds,        'weight': weight    })    df.to_csv('data/gm_ranking_model.dat', index=False, sep='\t')

给已有的dataframe增加一行数据

>>> df = df.append(pd.DataFrame({"c1":[100], "c2":[200]}))>>> df    c1   c20    1   101    2    92    3    83    4    74    5    65    1   100  100  200

分组求和(聚合操作)

>>> df.groupby('c1').sum()    或者df.groupby([‘c1']).sum()      c2c11     202      93      84      75      6100  200As you can see, the result of the aggregation will have the group names as the new index along the grouped axis>>> df.groupby([‘c1’], as_index=False).sum()

列与列之间的四则运算

>>> df   c1  c20   1   11   2   12   3   43   4   44   5   5>>> df['c3'] = df.c1+df.c2>>> df   c1  c2  c30   1   1   21   2   1   32   3   4   73   4   4   84   5   5  10

删除某一列

>>> del df['c3']>>> df   c1  c20   1   11   2   12   3   43   4   44   5   5

bool运算，选择满足条件的行

>>> df[df.c1>=3]   c1  c2  c32   3   4   73   4   4   84   5   5  10>>> df[df["c1"]>=3]   c1  c2  c32   3   4   73   4   4   84   5   5  10

插一个奇怪的问题：

>>> df    c1        c2                                           c3   c40  昂科拉  10-14-15  太平洋汽车移动PCauto手机客户端\n（IOS&Android）资讯/图文列表第十位    11    a         b                                            c  NaN>>> df[[math.isnan(float(x)) for x in df["c4"]]]  c1 c2 c3   c41  a  b  c  NaN>>> df[[math.isnan(float(x)) for x in df["c4"] if type(x) is float]]这个会报错的，原因不明；但是金勇提供了另一种方式>>> df[df[“c4”].isnull()]那如果我们要选出不为空的该怎么做呢？>>> df[df[“c4”].isnull() is False]    #这个也是会报错的 金勇提供了另一种方式>>> df = df.dropna()dropna()函数是删除至少有一列为空的行或所有列均为空的行如果我们需要指定某一列不为空的行，就需要使用notnull()>>>df[df[“c4”].notnull()]

按位置选定指定的行和列

>>> df.iloc[:, [1,2]]   c2  c30   1   21   1   32   4   73   4   84   5  10

深复制&浅复制

df2 = df.copy()>>> del df2["c3"]>>> df   c1  c2  c30   1   1   21   2   1   32   3   4   73   4   4   84   5   5  10>>> df2   c1  c20   1   11   2   12   3   43   4   44   5   5

DataFrame与DataFrame之间的join操作

>>> df2   c1  c2  c40   1   1   11   2   1   22   3   4  123   4   4  164   5   5  25>>> df.merge(df2)   c1  c2  c3  c40   1   1   2   11   2   1   3   22   3   4   7  123   4   4   8  164   5   5  10  25

参考链接：http://pandas.pydata.org/pandas-docs/stable/api.html

0 0