python之pandas学习

来源：互联网发布：javascript 现状编辑：程序博客网时间：2024/05/22 17:01

Python中的pandas模块进行数据分析。

接下来pandas介绍中将学习到如下8块内容：
1、数据结构简介：DataFrame和Series
2、数据索引index
3、利用pandas查询数据
4、利用pandas的DataFrames进行统计分析
5、利用pandas实现SQL操作
6、利用pandas进行缺失值的处理
7、利用pandas实现Excel的数据透视表功能
8、多层索引的使用
一、数据结构介绍
在pandas中有两类非常重要的数据结构，即序列Series和数据框DataFrame。Series类似于numpy中的一维数组，除了通吃一维数组可用的函数或方法，而且其可通过索引标签的方式获取数据，还具有索引的自动对齐功能；DataFrame类似于numpy中的二维数组，同样可以通用numpy数组的函数和方法，而且还具有其他灵活应用，后续会介绍到。
1、Series的创建

序列的创建主要有三种方式：

1）通过一维数组创建序列

import numpy as np, pandas as pdarr1 = np.arange(10)print(arr1)print(type(arr1))s1 = pd.Series(arr1)print(s1)print(type(s1))

实验结果：

[0 1 2 3 4 5 6 7 8 9]<class 'numpy.ndarray'>0    01    12    23    34    45    56    67    78    89    9dtype: int32<class 'pandas.core.series.Series'>

2）通过字典的方式创建序列

dic1 = {'a':10,'b':20,'c':30,'d':40,'e':50}print(dic1)print(type(dic1))s2 = pd.Series(dic1)print(s2)print(type(s2))

实验结果：

{'c': 30, 'b': 20, 'e': 50, 'a': 10, 'd': 40}<class 'dict'>a    10b    20c    30d    40e    50dtype: int64<class 'pandas.core.series.Series'>

3）通过DataFrame中的某一行或某一列创建序列

这部分内容我们放在后面讲，因为下面就开始将DataFrame的创建。
2、DataFrame的创建
数据框的创建主要有三种方式：
1）通过二维数组创建数据框

import numpy as np, pandas as pdarr2 = np.array(np.arange(12)).reshape(4,3)print(arr2)print(type(arr2))df1 = pd.DataFrame(arr2)print(df1)print(type(df1))

实验结果：

[[ 0  1  2] [ 3  4  5] [ 6  7  8] [ 9 10 11]]<class 'numpy.ndarray'>   0   1   20  0   1   21  3   4   52  6   7   83  9  10  11<class 'pandas.core.frame.DataFrame'>

2）通过字典的方式创建数据框

以下以两种字典来创建数据框，一个是字典列表，一个是嵌套字典。

dic2 = {'a':[1,2,3,4],'b':[5,6,7,8],'c':[9,10,11,12],'d':[13,14,15,16]}print(dic2)print(type(dic2))df2 = pd.DataFrame(dic2)print(df2)print(type(df2))dic3 = {'one':{'a':1,'b':2,'c':3,'d':4},'two':{'a':5,'b':6,'c':7,'d':8},'three':{'a':9,'b':10,'c':11,'d':12}}print(dic3)print(type(dic3))df3 = pd.DataFrame(dic3)print(df3)print(type(df3))

实验结果：

{'d': [13, 14, 15, 16], 'b': [5, 6, 7, 8], 'a': [1, 2, 3, 4], 'c': [9, 10, 11, 12]}<class 'dict'>   a  b   c   d0  1  5   9  131  2  6  10  142  3  7  11  153  4  8  12  16<class 'pandas.core.frame.DataFrame'>{'two': {'d': 8, 'b': 6, 'a': 5, 'c': 7}, 'one': {'d': 4, 'b': 2, 'a': 1, 'c': 3}, 'three': {'d': 12, 'b': 10, 'a': 9, 'c': 11}}<class 'dict'>   one  three  twoa    1      9    5b    2     10    6c    3     11    7d    4     12    8<class 'pandas.core.frame.DataFrame'>

3）通过数据框的方式创建数据框

dic3 = {'one':{'a':1,'b':2,'c':3,'d':4},'two':{'a':5,'b':6,'c':7,'d':8},'three':{'a':9,'b':10,'c':11,'d':12}}print(dic3)print(type(dic3))df3 = pd.DataFrame(dic3)# print(df3)# print(type(df3))df4 = df3[['one','three']]print(df4)print(type(df4))s3 = df3['one']print(s3)print(type(s3))

实验结果：

{'one': {'d': 4, 'b': 2, 'a': 1, 'c': 3}, 'three': {'d': 12, 'b': 10, 'a': 9, 'c': 11}, 'two': {'d': 8, 'b': 6, 'a': 5, 'c': 7}}<class 'dict'>   one  threea    1      9b    2     10c    3     11d    4     12<class 'pandas.core.frame.DataFrame'>a    1b    2c    3d    4Name: one, dtype: int64<class 'pandas.core.series.Series'>

pandas模块为我们提供了非常多的描述性统计分析的指标函数，如总和、均值、最小值、最大值等，我们来具体看看这些函数：
首先随机生成三组数据

np.random.seed(1234)d1 = pd.Series(2*np.random.normal(size = 10)+3)print(d1)d2 = np.random.f(2,4,size = 10)print(d2)d3 = np.random.randint(1,100,size = 10)print(d3)print(d1.count()) #非空元素计算print(d1.min()) #最小值print(d1.max()) #最大值print(d1.idxmin()) #最小值的位置，类似于R中的which.min函数print(d1.idxmax()) #最大值的位置，类似于R中的which.max函数print(d1.quantile(0.1)) #10%分位数print(d1.sum()) #求和print(d1.mean()) #均值print(d1.median()) #中位数print(d1.mode()) #众数print(d1.var()) #方差print(d1.std()) #标准差print(d1.mad()) #平均绝对偏差print(d1.skew()) #偏度print(d1.kurt()) #峰度print(d1.describe()) #一次性输出多个描述性统计指标

实验结果：

0    3.9428701    0.6180492    5.8654143    2.3746964    1.5588235    4.7743266    4.7191777    1.7269538    3.0313939   -1.485370dtype: float64[ 2.95903083  0.32784914  2.27321231  0.05147861  9.10291941  0.15691116  0.99021894  1.84169938  0.32196418  0.04276792][57 71 57 87 45 91 84 48 50 19]10-1.485369908375.86541393685920.407706758691282927.12633015112.712633015112.703044476022716Series([], dtype: float64)4.917719121012.217593091851.75400292821-0.453758801844-0.116260760058count    10.000000mean      2.712633std       2.217593min      -1.48537025%       1.60085550%       2.70304475%       4.525100max       5.865414dtype: float64

0 0