Python数据分析--pandas部分笔记

来源：互联网发布：淘宝实名资料购买编辑：程序博客网时间：2024/05/21 21:46

1、Series相关
Series类似于一个列向量，只是在其左侧加了索引，其包括values和index两个属性，Series.values和Series.index。Series对象本身以及其索引都有一个name属性，即Series.name和Series.index.name，能够对Series和其索引命名，与pandas其他功能联系紧密。
DataFrame类似于数组，只是对于行和列都有了索引。取行：frame.ix[i]；取列：frame[i]或者frame.i，其中i为行或者列的索引。

2、apply方法: apply(func())是调用func()函数，例如：
func函数是无参数时：

Input:      def say():                                          print 'say in'             apply(say)Output:     say in

func函数有参数时：

Input:      def say(a, b):                 print a, b             apply(say, ('hello', 'zhangsan'))Output:     hello zhangsan

在DataFrame中，如果需要对各行或者各列进行函数操作，可以利用apply函数来实现。如下例子中，apply()中默认axis是0，即将DataFrame的所有行带入函数进行操作，如果令apply(f, axis=1)意味着对所有列进行操作。

Input:      frame = DataFrame(np.random.randn(4, 3), columns=list('bde'),                  index=['Utah', 'Ohio', 'Texas', 'Oregon'])            f = lambda x: x.max() - x.min()            print frame            print frame.apply(f)            print frame.apply(f, axis=1)Output:                    b         d         e                Utah   -0.613367 -0.689123  0.001532                Ohio    0.835977  1.377497 -0.681188                Texas  -1.865279 -0.587092  0.057747                Oregon -0.770581  1.244155  0.060371          #frame                b    2.701256                d    2.066620                e    0.741559                                              dtype: float64                               #frame.apply(f)                Utah      1.753016                Ohio      2.443354                Texas     0.760543                Oregon    2.246665                dtype: float64                               #frame.apply(f, axis=1)

3、排序
对于Series而言，使用sort_index()和sort_value()来实现对于行(索引)和列(值)的排序，其返回的是一个新对象。

Input:      obj = Series([4, 7, -3, 2], index=['d', 'a', 'b', 'c'])            print obj.sort_index()            print obj.sort_values()Output:     a    7            b   -3            c    2            d    4            dtype: int64                        #索引排序            b   -3            c    2            d    4            a    7            dtype: int64                        #值排序

对于DataFrame而言，使用sort_index()和sort_index(axis=1)来对行索引和列索引进行排序，其行或列的值跟随其移动；其默认是按升序排序的，若要按照降序，则使用sort_index(axis=1, ascending=False)。这里，与书上使用的order不同，order已被sort_values替代。

Input:      frame = DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'],                  columns=['d', 'a', 'b', 'c'])            print frame.sort_index()            print frame.sort_index(axis=1)Output:            d  a  b  c            one    4  5  6  7            three  0  1  2  3                 #行索引排序 o在前                   a  b  c  d            three  1  2  3  0            one    5  6  7  4                 #列索引排序 a在前

当DataFrame需要其内按照一个或多个列的值进行排序时，使用sort_values(by=)来实现，看例子：
按照一列时：

Input：          frame = DataFrame({'b': [4, 7, -3, 2], 'a': [3, 1, 0, 2]})                print frame.sort_values(by='b')Output:         a  b             2  0 -3             3  1  2             0  0  4             1  1  7

按照两列时：

Input:          frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})                print frame.sort_values(by=['a','b'])Output:         a  b             2  0 -3             0  0  4             3  1  2             1  1  7                       #先按照a列排序，当其内元素相同时，看b列

4、排名
使用rank()实现对于Series和DataFrame的排序，在排序时，若存在平级，其存在几种破坏平级关系的method选项，分别为：
average：默认，在相等的分组中，为各值分配平均排名
min：使用整个分组的最小排名
max: 使用整个分组的最大排名
first: 按值在院士数据中出现的顺序分配排名
对于Series排序：

Input:      obj = Series([7, -5, 7, 4, 2, 0, 4])            print obj.rank()Output:     0    6.5            1    1.0            2    6.5            3    4.5            4    3.0            5    2.0            6    4.5     #右侧为排名，对于相等元素4，其排名应该是4和5，使用average时其排名变为4.5

在obj.rank(ascending=False, method='max')中，ascending默认是升序排序。
对于DataFrame排序：

Input:      frame = DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],                   'c': [-2, 5, 8, -2.5]})            print frame.rank(axis=1)Output:          a    b    c            0  2.0  3.0  1.0            1  1.0  3.0  2.0            2  2.0  1.0  3.0            3  2.0  3.0  1.0               #axis=1即对culomns进行排序

5、value_counts：统计Series或者DataFrame中元素出现的次数
Series中：

Input:      obj =Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])            print obj.value_counts()            print pd.value_counts(obj.values, sort=False)Output:     c    3            a    3            b    2            d    1     #右侧一列为统计的出现次数，默认是按统计值降序排列的，c先出现和原series有关            a    3            c    3            b    2            d    1    #sort=False，没有进行排序，只进行了计数统计

这里对pd.value_counts(obj.values, sort=False)有些误解，认为其输出具有一定的排序性，其实，当sort=False时，认定了该函数是没有排序的，所以其只有计数统计功能。pandas.Series.value_counts用法可以从http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html中看到。
DataFrame中：

Input:      data = DataFrame({'Qu1': [1, 3, 4, 3, 4],                             'Qu2': [2, 3, 1, 2, 3],                             'Qu3': [1, 5, 2, 4, 4]})            result = data.apply(pd.value_counts).fillna(0)            print resultOutput:        Qu1  Qu2  Qu3            1  1.0  1.0  1.0            2  0.0  2.0  1.0            3  2.0  2.0  0.0            4  2.0  0.0  2.0            5  0.0  0.0  1.0       #左侧为出现的元素，矩阵为出现次数

6、整数索引
对于Series和DataFrame，其索引如果是整数的时候，需要特别注意，例如：

Input:      ser= Series(bp.arange(3))            print ser[-1]

其输出会出错，主要是因为ser本身会有一个0,1,2的索引，而-1会使pandas求助于索引，而里面并没有-1，导致出错。而对于一个非整数的索引，就没有这样的歧义，例如：

Input:      ser2 = Series(np.arange(3.), index=['a', 'b', 'c'])            print ser[-1]Output:     2.0

对于Series，解决这类问题可以使用iloc[i]命令(书中的iget_value()命令已经被移除)，iloc[i]命令是提供可靠的、不考虑索引类型的、基于位置的索引

Input:      ser= Series(bp.arange(3))            print ser.iloc[-1]Output:     2.0

对于DataFrame，是iloc[i]针对于行，iloc[:,i]是针对于列，如下例子：

Input:      frame =DataFrame(np.arange(6).reshape(3, 2), index=[2, 0, 1], columns=[2,4])            print frame            print frame.iloc[1]             print frame.iloc[:,1]Output:        2  4            2  0  1            0  2  3            1  4  5          #原frame            2    2            4    3            Name: 0, dtype: int32     #取frame的第二行内容2，3            2    1            0    3            1    5            Name: 4, dtype: int32      #取第二列的内容

由此可见，.iloc[i]所索引的并不看Series和DataFrame所定义的索引，只考虑默认的位置的索引。

阅读全文

0 0