文章标题

来源:互联网 发布:批发网络摄像头 编辑:程序博客网 时间:2024/06/07 22:46

pandas:powerful Python data analysis tookit

                             ——Wes McKinney & PyData Development Team,Release 0.18.0,March 17, 2016
Customarily,we import as follows:

import numpy as np  import pandas as pdimport matplotlib.pyplot as plt

1. Object Creation

1.1 Series

Series is a one-dimensional labeled array capable of holding any data type(integers,strings,floating point numbers,Python objects,etc.).The axis labels are collectively referred to as the index(轴标签统称为索引).The basic method to create a Series is to call:

>>> s = pd.Series(data,index = index)

Here,data can be many different things:

  • a Python dict
  • an ndarray
  • a scalar value

The passed index is a list of axis labels.Thus,this separates into a few cases depending on what data is:

From ndarray
If data is an ndarray,index must be the same length as data.If no index is passed,one will be created havig values [0,1,3,…,len(data)-1].

>>> s = pd.Series(np.random.randn(5),index = ['a','b','c','d','e'])>>> sa   -0.159223b    2.317106c   -0.341460d   -1.499552e    0.400351dtype: float64>>> s.indexIndex([u'a', u'b', u'c', u'd', u'e'], dtype='object')>>> pd.Series(np.random.randn(5))0    0.7855361   -1.0140112   -0.1208123    0.2898704    0.705393dtype: float64

From dict
If data is a dict,if index is passed the values in data corresponding to the labels in the index will be pulled out.Otherwise,an index will be constructed from the sorted keys of the dict,if possible.

>>> d = {'a':0.,'b':1.,'c':2.}>>> pd.Series(d)a    0b    1c    2dtype: float64>>> pd.Series(d,index = list('bcda'))b     1c     2d   NaNa     0dtype: float64

**Note:**NaN(not a number) is the standard missing data marker used in pandas.

From scalar value
If data is a scalar value,an index must be provided.The value will be repeated to match the length of index.

>>> pd.Series(5.,index = ['a','b','c','d','e'])a    5b    5c    5d    5e    5dtype: float64

1.1.1 Series is ndarray-like

Series acts very similarly to a ndarray,and is a valid argument to most numpy functions.However,things like slicing also slice the index.

>>> s[0]-0.15922308848832653>>> s[:3]a   -0.159223b    2.317106c   -0.341460dtype: float64>>> s[s > s.median()]e    0.400351b    2.317106dtype: float64>>> s[[4,3,2]]b    2.317106e    0.400351a   -0.159223dtype: float64>>> np.exp(s)d     0.223230c     0.710732a     0.852806e     1.492348b    10.146272dtype: float64

1.1.2 Series is dict-like

>>> s['a']-0.15922308848832653>>> s['e'] = 12>>> sd    -1.499552c    -0.341460a    -0.159223e    12.000000b     2.317106dtype: float64>>> 'e' in sTrue>>> 'f' in s False>>> s['f']KeyError: 'f'

Using the get method,a missing label will return None or specified default:

>>> s.get('f')>>> s.get('f',np.nan)nan>>> s.get('e',np.nan)12.0

1.1.3 Vectorized operations and label alignment with Series

When doing data analysis,as with raw numpy arrays looping through Series value-by-value is usually not necessary.Series can be also be passed into most numpy methods expecting an ndarray.

>>> s+sd    -2.999105c    -0.682919a    -0.318446e    24.000000b     4.634213dtype: float64>>> s * 2>>> np.exp(s)

A key different between Series and ndarray is that operations betwee Series automatically align the data based on label.Thus,you can write computations without giving consideration to whether the Series involved have the same labels.

>>> s[1:] + s[:-1]a    -0.318446b          NaNc    -0.682919d          NaNe    24.000000dtype: float64

Creating a Series by passing a list of values,letting pandas create a default integer index:

>>> s = pd.Series([1,2,3,np.nan,6,8])0     11     32     53   NaN4     65     8dtype: float64

1.2 DataFrame

Creating a DataFrame by passing a numpy array,with a datetime index and labeled columns:

>>> dates = pd.date_range('20130101',periods=6)>>> dates<class 'pandas.tseries.index.DatetimeIndex'>[2013-01-01, ..., 2013-01-06]Length: 6, Freq: D, Timezone: None>>> df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))>>> df               A         B         C         D2013-01-01  0.716330 -1.782610  0.809990  0.3198762013-01-02 -0.171806 -0.526268  0.206743 -1.2462132013-01-03 -1.774970  1.890517 -0.773496 -0.9300832013-01-04  0.537348 -0.870212 -1.227291  1.3228232013-01-05 -0.897589  1.275171  1.064439 -2.0211862013-01-06  0.130427 -1.067145 -1.273118  1.786337[6 rows x 4 columns]    

Creating a DataFrame by passing a dict of objects that can be converted to series-like:

>>> df2 = pd.DataFrame({'A':1.,                        'B':pd.Timestamp('20130102'),                        'C':pd.Series(1,index=range(4),dtype='float32'),                        'D':np.array([3]*4,dtype='int32'),                        'E':pd.Categorical(['test','train','test','train']),                        'F':'foo'})>>>df2     A          B  C  D      E    F0  1 2013-01-02  1  3   test  foo1  1 2013-01-02  1  3  train  foo2  1 2013-01-02  1  3   test  foo3  1 2013-01-02  1  3  train  foo[4 rows x 6 columns]

Having specific dtypes

>>> df2.dtypesA           float64B    datetime64[ns]C           float32D             int32E            objectF            objectdtype: object

If you’re using IPython,tab completion for column names (as well as public attributes) is automatically enabled.

>>> df2.<TAB>Display all 210 possibilities? (y or n)df2.A                  df2.from_csv           df2.rankdf2.B                  df2.from_dict          df2.rdivdf2.C                  df2.from_items         df2.reindexdf2.D                  df2.from_records       df2.reindex_axisdf2.E                  df2.ftypes             df2.reindex_likedf2.F                  df2.ge                 df2.renamedf2.T                  df2.get                df2.rename_axisdf2.abs                df2.get_dtype_counts   df2.reorder_levelsdf2.add                df2.get_ftype_counts   df2.replacedf2.add_prefix         df2.get_value          df2.resampledf2.add_suffix         df2.get_values         df2.reset_indexdf2.align              df2.groupby            df2.rfloordivdf2.all                df2.gt                 df2.rmoddf2.any                df2.head               df2.rmuldf2.append             df2.hist               df2.rpowdf2.apply              df2.iat                df2.rsubdf2.applymap           df2.icol               df2.rtruedivdf2.as_blocks          df2.idxmax             df2.savedf2.as_matrix          df2.idxmin             df2.selectdf2.asfreq             df2.iget_value         df2.set_indexdf2.astype             df2.iloc               df2.set_valuedf2.at                 df2.index              df2.shapedf2.at_time            df2.info               df2.shiftdf2.axes               df2.insert             df2.skewdf2.between_time       df2.interpolate        df2.sortdf2.bfill              df2.irow               df2.sort_indexdf2.blocks             df2.is_copy            df2.sortleveldf2.bool               df2.isin               df2.squeezedf2.boxplot            df2.isnull             df2.stackdf2.clip               df2.iteritems          df2.stddf2.clip_lower         df2.iterkv             df2.subdf2.clip_upper         df2.iterrows           df2.subtractdf2.columns            df2.itertuples         df2.sumdf2.combine            df2.ix                 df2.swapaxesdf2.combineAdd         df2.join               df2.swapleveldf2.combineMult        df2.keys               df2.taildf2.combine_first      df2.kurt               df2.takedf2.compound           df2.kurtosis           df2.to_clipboarddf2.consolidate        df2.last               df2.to_csvdf2.convert_objects    df2.last_valid_index   df2.to_densedf2.copy               df2.le                 df2.to_dictdf2.corr               df2.load               df2.to_exceldf2.corrwith           df2.loc                df2.to_gbqdf2.count              df2.lookup             df2.to_hdfdf2.cov                df2.lt                 df2.to_htmldf2.cummax             df2.mad                df2.to_jsondf2.cummin             df2.mask               df2.to_latexdf2.cumprod            df2.max                df2.to_msgpackdf2.cumsum             df2.mean               df2.to_paneldf2.delevel            df2.median             df2.to_perioddf2.describe           df2.merge              df2.to_pickledf2.diff               df2.min                df2.to_recordsdf2.div                df2.mod                df2.to_sparsedf2.divide             df2.mode               df2.to_sqldf2.dot                df2.mul                df2.to_statadf2.drop               df2.multiply           df2.to_stringdf2.drop_duplicates    df2.ndim               df2.to_timestampdf2.dropna             df2.ne                 df2.to_widedf2.dtypes             df2.notnull            df2.transposedf2.duplicated         df2.pct_change         df2.truedivdf2.empty              df2.pivot              df2.truncatedf2.eq                 df2.pivot_table        df2.tshiftdf2.equals             df2.plot               df2.tz_convertdf2.eval               df2.pop                df2.tz_localizedf2.ffill              df2.pow                df2.unstackdf2.fillna             df2.prod               df2.updatedf2.filter             df2.product            df2.valuesdf2.first              df2.quantile           df2.vardf2.first_valid_index  df2.query              df2.wheredf2.floordiv           df2.radd               df2.xs>>> df2.

2. Viewing Data

See the top&bottom rows of the frame

>>> df.head()>>> df.tail()>>> df.head(10)>>> df.tail(10)

Display the index,columns,and the underlying numpy data

>>> df.index<class 'pandas.tseries.index.DatetimeIndex'>[2013-01-01, ..., 2013-01-06]Length: 6, Freq: D, Timezone: None>>> df.columnsIndex([u'A', u'B', u'C', u'D'], dtype='object')>>> df.valuesarray([[ 0.71632974, -1.78261015,  0.80999048,  0.31987599],   [-0.17180562, -0.52626809,  0.20674317, -1.24621339],   [-1.77496978,  1.89051681, -0.77349583, -0.93008323],   [ 0.53734751, -0.87021202, -1.22729091,  1.32282329],   [-0.89758898,  1.27517093,  1.06443943, -2.02118609],   [ 0.13042695, -1.06714528, -1.27311829,  1.78633711]])

Describe shows a quick statistic summary of your data

>>> df.describe()          A         B         C         Dcount  6.000000  6.000000  6.000000  6.000000mean  -0.243377 -0.180091 -0.198789 -0.128074std    0.943312  1.439183  1.031516  1.513146min   -1.774970 -1.782610 -1.273118 -2.02118625%   -0.716143 -1.017912 -1.113842 -1.16718150%   -0.020689 -0.698240 -0.283376 -0.30510475%    0.435617  0.824811  0.659179  1.072086max    0.716330  1.890517  1.064439  1.786337[8 rows x 4 columns]

Transposing your data

>>> df.T   2013-01-01  2013-01-02  2013-01-03  2013-01-04  2013-01-05  2013-01-06A    0.716330   -0.171806   -1.774970    0.537348   -0.897589    0.130427B   -1.782610   -0.526268    1.890517   -0.870212    1.275171   -1.067145C    0.809990    0.206743   -0.773496   -1.227291    1.064439   -1.273118D    0.319876   -1.246213   -0.930083    1.322823   -2.021186    1.786337[4 rows x 6 columns]

sorting by an axis(sort_index method)

>>> df.sort_index(axis=1,ascending=False)                  D         C         B         A2013-01-01  0.319876  0.809990 -1.782610  0.7163302013-01-02 -1.246213  0.206743 -0.526268 -0.1718062013-01-03 -0.930083 -0.773496  1.890517 -1.7749702013-01-04  1.322823 -1.227291 -0.870212  0.5373482013-01-05 -2.021186  1.064439  1.275171 -0.8975892013-01-06  1.786337 -1.273118 -1.067145  0.130427[6 rows x 4 columns]>>> df.sort_index(axis=0,ascending=False)               A         B         C         D2013-01-06  0.130427 -1.067145 -1.273118  1.7863372013-01-05 -0.897589  1.275171  1.064439 -2.0211862013-01-04  0.537348 -0.870212 -1.227291  1.3228232013-01-03 -1.774970  1.890517 -0.773496 -0.9300832013-01-02 -0.171806 -0.526268  0.206743 -1.2462132013-01-01  0.716330 -1.782610  0.809990  0.319876[6 rows x 4 columns]>>> df.sort_index(axis=1,ascending=False).sort_index(axis=0,ascending=False)               D         C         B         A2013-01-06  1.786337 -1.273118 -1.067145  0.1304272013-01-05 -2.021186  1.064439  1.275171 -0.8975892013-01-04  1.322823 -1.227291 -0.870212  0.5373482013-01-03 -0.930083 -0.773496  1.890517 -1.7749702013-01-02 -1.246213  0.206743 -0.526268 -0.1718062013-01-01  0.319876  0.809990 -1.782610  0.716330[6 rows x 4 columns]

Sorting by values(sort method,0:descending,1:ascengding)

>>> df.sort(['A','B'],ascending=[0,1])

3. Selection

3.1 Getting

Selecting a single column,which yields a Series,equivalent to df.A

>>> df.A2013-01-01    0.7163302013-01-02   -0.1718062013-01-03   -1.7749702013-01-04    0.5373482013-01-05   -0.8975892013-01-06    0.130427Freq: D, Name: A, dtype: float64>>> df['A']2013-01-01    0.7163302013-01-02   -0.1718062013-01-03   -1.7749702013-01-04    0.5373482013-01-05   -0.8975892013-01-06    0.130427Freq: D, Name: A, dtype: float64

Selecting via [],which slices the rows.

>>> df[0:3]               A         B         C         D2013-01-01  0.716330 -1.782610  0.809990  0.3198762013-01-02 -0.171806 -0.526268  0.206743 -1.2462132013-01-03 -1.774970  1.890517 -0.773496 -0.930083>>>df['20130101':'20130103']                   A         B         C         D2013-01-01  0.716330 -1.782610  0.809990  0.3198762013-01-02 -0.171806 -0.526268  0.206743 -1.2462132013-01-03 -1.774970  1.890517 -0.773496 -0.930083[3 rows x 4 columns]

3.2 Selecting by Label

For getting a cross section(横截面) using a label

>>> df.loc[dates[0]]    A    0.716330B   -1.782610C    0.809990D    0.319876Name: 2013-01-01 00:00:00, dtype: float64

Selecting on a multi-axis by label

>>> df.loc[:,['A','B']]                   A         B2013-01-01  0.716330 -1.7826102013-01-02 -0.171806 -0.5262682013-01-03 -1.774970  1.8905172013-01-04  0.537348 -0.8702122013-01-05 -0.897589  1.2751712013-01-06  0.130427 -1.067145[6 rows x 2 columns]

Showing label slicing,both endpoints are included

>>> df.loc['20130102':'20130104',['A','B']]                   A         B2013-01-02 -0.171806 -0.5262682013-01-03 -1.774970  1.8905172013-01-04  0.537348 -0.870212[3 rows x 2 columns]

Reduction in the dimensions of the returned object

>>> df.loc['20130102',['A','B']]A   -0.171806B   -0.526268Name: 2013-01-02 00:00:00, dtype: float64

For getting a scalar value(标量)

>>> df.loc[dates[0],'A']0.71632974391895454

For getting fast access to a scalar(equiv to the prior method)

>>> df.at[dates[0],'A']0.71632974391895454

3.3 Selection by Position

Select via the position of the passed integers

>>> df.iloc[3]    #第四行A    0.537348B   -0.870212C   -1.227291D    1.322823Name: 2013-01-04 00:00:00, dtype: float64

By integer slices,acting similar to numpy/python

>>> df.iloc[3:5,0:2]                   A         B2013-01-04  0.537348 -0.8702122013-01-05 -0.897589  1.275171[2 rows x 2 columns]

By lists of integer position locations,similar to the numpy/python style

>>> df.iloc[[1,2,4],[0,2]]                   A         C2013-01-02 -0.171806  0.2067432013-01-03 -1.774970 -0.7734962013-01-05 -0.897589  1.064439[3 rows x 2 columns]

For slicing rows explicitly

>>> df.iloc[1:3,:]                   A         B         C         D2013-01-02 -0.171806 -0.526268  0.206743 -1.2462132013-01-03 -1.774970  1.890517 -0.773496 -0.930083[2 rows x 4 columns]

For slicing columns explicitly

>>> df.iloc[:,1:3]                   B         C2013-01-01 -1.782610  0.8099902013-01-02 -0.526268  0.2067432013-01-03  1.890517 -0.7734962013-01-04 -0.870212 -1.2272912013-01-05  1.275171  1.0644392013-01-06 -1.067145 -1.273118[6 rows x 2 columns]

For getting a value explicitly

>>> df.iloc[1,1]-0.52626808513391488

For getting fast access to a scalar(equiv to the prior method)

>>> df.iat[1,1]-0.52626808513391488

3.4 Boolean Indexing

Using a single column’s values to select data

>>> df[df.A>0]>>> df[df['A']>0]                   A         B         C         D2013-01-02  0.859761  0.755971  1.371420  0.2716002013-01-03  0.606392  0.077458  0.251290  2.1340132013-01-05  0.022155 -0.216343 -1.179598  0.4313742013-01-06  2.676268  2.295133 -2.132639  0.702915[4 rows x 4 columns]

A where operation for getting.

>>> df[df>0]                   A         B         C         D2013-01-01       NaN       NaN       NaN  0.2093212013-01-02  0.859761  0.755971  1.371420  0.2716002013-01-03  0.606392  0.077458  0.251290  2.1340132013-01-04       NaN       NaN  0.946518       NaN2013-01-05  0.022155       NaN       NaN  0.4313742013-01-06  2.676268  2.295133       NaN  0.702915[6 rows x 4 columns]

Using the isin() method for filtering(过滤):

>>> df[df>0]>>> df2 = df.copy() >>> df2['E'] = ['one','one','two','three','four','three']>>> df2                   A         B         C         D      E2013-01-01 -0.234954 -1.346601 -1.030691  0.209321    one2013-01-02  0.859761  0.755971  1.371420  0.271600    one2013-01-03  0.606392  0.077458  0.251290  2.134013    two2013-01-04 -0.938926 -0.749240  0.946518 -0.248072  three2013-01-05  0.022155 -0.216343 -1.179598  0.431374   four2013-01-06  2.676268  2.295133 -2.132639  0.702915  three[6 rows x 5 columns]>>> df2[df2.E.isin(['two','four'])]                   A         B         C         D     E2013-01-03  0.606392  0.077458  0.251290  2.134013   two2013-01-05  0.022155 -0.216343 -1.179598  0.431374  four[2 rows x 5 columns]

3.5 Setting

Setting a new column automatically aligns the data by the indexes

>>> s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20130102',periods=6))>>> s12013-01-02    12013-01-03    22013-01-04    32013-01-05    42013-01-06    52013-01-07    6Freq: D, dtype: int64>>> df['F'] = s1

Setting values by label

>>> df.at[dates[0],'A'] = 0

Setting values by position

>>> df.iat[0,1] = 0

Setting by assigning with a numpy array

>>> df.loc[:,'D'] = np.array([5*len(df)])

The result of the prior setting operations

df
A B C D F
2013-01-01 0.000000 0.000000 -1.030691 30 NaN
2013-01-02 0.859761 0.755971 1.371420 30 1
2013-01-03 0.606392 0.077458 0.251290 30 2
2013-01-04 -0.938926 -0.749240 0.946518 30 3
2013-01-05 0.022155 -0.216343 -1.179598 30 4
2013-01-06 2.676268 2.295133 -2.132639 30 5

[6 rows x 5 columns]

A where operation with setting

>>> df2 = df.copy()>>> df2[df2>0] = -df2                   A         B         C   D   F2013-01-01  0.000000  0.000000 -1.030691 -30 NaN2013-01-02 -0.859761 -0.755971 -1.371420 -30  -12013-01-03 -0.606392 -0.077458 -0.251290 -30  -22013-01-04 -0.938926 -0.749240 -0.946518 -30  -32013-01-05 -0.022155 -0.216343 -1.179598 -30  -42013-01-06 -2.676268 -2.295133 -2.132639 -30  -5[6 rows x 5 columns]

4. Missing Data

pandas primarily use the value np.nan to represent missing data.It is default not included in computations.

Reindexing allows you to change/add/delete the index on a specified axis.This returns a copy of the data.

>>> df1 = df.reindex(index = dates[0:4],columns = list(df.columns) + ['E'])>>> df1.loc[dates[0]:dates[1],'E'] = 1  >>> df                   A         B         C   D   F2013-01-01  0.000000  0.000000 -1.030691  30 NaN2013-01-02  0.859761  0.755971  1.371420  30   12013-01-03  0.606392  0.077458  0.251290  30   22013-01-04 -0.938926 -0.749240  0.946518  30   32013-01-05  0.022155 -0.216343 -1.179598  30   42013-01-06  2.676268  2.295133 -2.132639  30   5[6 rows x 5 columns]>>> df1                   A         B         C   D   F   E2013-01-01  0.000000  0.000000 -1.030691  30 NaN   12013-01-02  0.859761  0.755971  1.371420  30   1   12013-01-03  0.606392  0.077458  0.251290  30   2 NaN2013-01-04 -0.938926 -0.749240  0.946518  30   3 NaN[4 rows x 6 columns]

To drop any rows that have missing data.

>>> df1.dropna(how='any')                   A         B        C   D  F  E2013-01-02  0.859761  0.755971  1.37142  30  1  1[1 rows x 6 columns]>>> df1.fillna(value=5)                   A         B         C   D  F  E2013-01-01  0.000000  0.000000 -1.030691  30  5  12013-01-02  0.859761  0.755971  1.371420  30  1  12013-01-03  0.606392  0.077458  0.251290  30  2  52013-01-04 -0.938926 -0.749240  0.946518  30  3  5[4 rows x 6 columns]

To get the boolean mask where values are nan

>>> pd.isnull(df1)/df1.isnull()                A      B      C      D      F      E2013-01-01  False  False  False  False   True  False2013-01-02  False  False  False  False  False  False2013-01-03  False  False  False  False  False   True2013-01-04  False  False  False  False  False   True[4 rows x 6 columns]

5. Operations

5.1 Stats

Operations in general exclude missing data.
Performing a descriptive statistic

>>> df.mean()A     0.537608B     0.360497C    -0.295617D    30.000000F     3.000000dtype: float64

Same operation on the other axis

>>> df.mean(1)2013-01-01    7.2423272013-01-02    6.7974302013-01-03    6.5870282013-01-04    6.4516702013-01-05    6.5252432013-01-06    7.567752Freq: D, dtype: float64

Operating with objects that have different dimensionality and need aligment.In addition,pandas automatically broadcasts along the specified dimension.

>>> s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)>>> s2013-01-01   NaN2013-01-02   NaN2013-01-03     12013-01-04     32013-01-05     52013-01-06   NaNFreq: D, dtype: float64>>> df.sub(s,axis = 'index')                   A         B         C   D   F2013-01-01       NaN       NaN       NaN NaN NaN2013-01-02       NaN       NaN       NaN NaN NaN2013-01-03 -0.393608 -0.922542 -0.748710  29   12013-01-04 -3.938926 -3.749240 -2.053482  27   02013-01-05 -4.977845 -5.216343 -6.179598  25  -12013-01-06       NaN       NaN       NaN NaN NaN[6 rows x 5 columns]

5.2 Apply

Applying function to the data

>>> df.apply(np.cumsum) #累计求和                   A         B         C    D   F2013-01-01  0.000000  0.000000 -1.030691   30 NaN2013-01-02  0.859761  0.755971  0.340729   60   12013-01-03  1.466153  0.833429  0.592019   90   32013-01-04  0.527227  0.084189  1.538537  120   62013-01-05  0.549382 -0.132154  0.358939  150  102013-01-06  3.225651  2.162979 -1.773700  180  15[6 rows x 5 columns]>>> df.apply(lambda x:x.max() - x.min())A    3.615194B    3.044374C    3.504059D    0.000000F    4.000000dtype: float64

5.3 Histogramming

>>> s = pd.Series(np.random.randint(0,7,size=10))>>> s0    61    32    63    04    05    66    47    38    09    1dtype: int64>>> s.value_counts()6    30    33    24    11    1dtype: int64

5.4 String Methods

Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses regular expressions by default (and in some cases always uses them)

>>> s = pd.Series(['A','B','C','AaBa','Baca',np.nan,'CABA','dog','cat'])>>> s.str.lower()0       a1       b2       c3    aaba4    baca5     NaN6    caba7     dog8     catdtype: object

6.Merge

6.1 Concat

0 0