python数据分析(pandas入门)
来源:互联网 发布:java 分布式 dubbo 编辑:程序博客网 时间:2024/06/16 05:12
1、pandas数据结构之DataFrame
DataFrame生成方式:1、从另一个DataFrame创建。2、从具有二维形状的NumPy数组或数组的复合结构生成。3、使用Series创建。4、从CSV之类文件生成。下面介绍DataFrame的简单用法:
a):读取文件
代码:
from pandas.io.parsers import read_csvdf=read_csv("H:\Python\data\WHO.csv")print "DataFrame:",df运行结果(只截取部分):
DataFrame: Country CountryID Continent \0 Afghanistan 1 1 1 Albania 2 2 2 Algeria 3 3 3 Andorra 4 2 4 Angola 5 3
代码:
print "Shape:",df.shape #大小print "Length:",len(df) #长度结果:
Shape: (202, 358)Length: 202
代码:
print "Column Headers",df.columns #得到每列的标题print "Data type",df.dtypes #得到每列数据的类型结果(截取部分)
Column Headers Index([u'Country', u'CountryID', u'Continent', u'Adolescent fertility rate (%)', u'Adult literacy rate (%)', u'Gross national income per capita (PPP international $)', u'Net primary school enrolment ratio female (%)', u'Net primary school enrolment ratio male (%)', u'Population (in thousands) total', u'Population annual growth rate (%)', ... u'Total_CO2_emissions', u'Total_income', u'Total_reserves', u'Trade_balance_goods_and_services', u'Under_five_mortality_from_CME', u'Under_five_mortality_from_IHME', u'Under_five_mortality_rate', u'Urban_population', u'Urban_population_growth', u'Urban_population_pct_of_total'], dtype='object', length=358)Data type Country objectCountryID int64Continent int64Adolescent fertility rate (%) float64Adult literacy rate (%) float64Gross national income per capita (PPP international $) float64Net primary school enrolment ratio female (%) float64Net primary school enrolment ratio male (%) float64
代码:
print "Index:",df.index结果:
Index: RangeIndex(start=0, stop=202, step=1)
代码:
print "Vales:",df.values结果
Vales: [['Afghanistan' 1L 1L ..., 5740436.0 5.44 22.9] ['Albania' 2L 2L ..., 1431793.9 2.21 45.4] ['Algeria' 3L 3L ..., 20800000.0 2.61 63.3] ..., ['Yemen' 200L 1L ..., 5759120.5 4.37 27.3] ['Zambia' 201L 3L ..., 4017411.0 1.95 35.0] ['Zimbabwe' 202L 3L ..., 4709965.0 1.9 35.9]]
2、pandas数据结构之Series
pandas的Series数据结构是由不同类型的元素组成的一维数组,该数据结构也具有标签,创建方式有:由Python字典创建;由numpy数组创建;由单个标量值创建。a):类型。当选中DataFrame的一列时,得到的是一个Series型的数据。
代码:
country_df=df["Country"]print "Type df:",type(df)print "Type country_df:",type(country_df)结果:
Type df: <class 'pandas.core.frame.DataFrame'>Type country_df: <class 'pandas.core.series.Series'>
代码:
print "Series Shape:",country_df.shape #获取列的形状print "Series index:",country_df.index #获取索引print "Series values:",country_df.values #获取该列的所有值print "Series name:",country_df.name #获取列名(标题)结果:
Series Shape: (202L,)Series index: RangeIndex(start=0, stop=202, step=1)Series values: ['Afghanistan' 'Albania' 'Algeria' 'Andorra' 'Angola' 'Antigua and Barbuda' 'Argentina' 'Armenia' 'Australia' 'Austria' 'Azerbaijan' 'Bahamas' 'Bahrain' 'Bangladesh' 'Barbados' 'Belarus' 'Belgium' 'Belize' 'Benin' 'Bermuda' 'Bhutan' 'Bolivia' 'Bosnia and Herzegovina' 'Botswana' 'Brazil' 'Brunei Darussalam' 'Bulgaria' 'Burkina Faso' 'Burundi' 'Cambodia' 'Cameroon' 'Canada' 'Cape Verde' 'Central African Republic' 'Chad' 'Chile' 'China' 'Colombia' 'Comoros' 'Congo, Dem. Rep.' 'Congo, Rep.' 'Cook Islands' 'Costa Rica' "Cote d'Ivoire" 'Croatia' 'Cuba' 'Cyprus' 'Czech Republic' 'Denmark' 'Djibouti' 'Dominica' 'Dominican Republic' 'Ecuador' 'Egypt' 'El Salvador' 'Equatorial Guinea' 'Eritrea' 'Estonia' 'Ethiopia' 'Fiji' 'Finland' 'France' 'French Polynesia' 'Gabon' 'Gambia' 'Georgia' 'Germany' 'Ghana' 'Greece' 'Grenada' 'Guatemala' 'Guinea' 'Guinea-Bissau' 'Guyana' 'Haiti' 'Honduras' 'Hong Kong, China' 'Hungary' 'Iceland' 'India' 'Indonesia' 'Iran (Islamic Republic of)' 'Iraq' 'Ireland' 'Israel' 'Italy' 'Jamaica' 'Japan' 'Jordan' 'Kazakhstan' 'Kenya' 'Kiribati' 'Korea, Dem. Rep.' 'Korea, Rep.' 'Kuwait' 'Kyrgyzstan' "Lao People's Democratic Republic" 'Latvia' 'Lebanon' 'Lesotho' 'Liberia' 'Libyan Arab Jamahiriya' 'Lithuania' 'Luxembourg' 'Macao, China' 'Macedonia' 'Madagascar' 'Malawi' 'Malaysia' 'Maldives' 'Mali' 'Malta' 'Marshall Islands' 'Mauritania' 'Mauritius' 'Mexico' 'Micronesia (Federated States of)' 'Moldova' 'Monaco' 'Mongolia' 'Montenegro' 'Morocco' 'Mozambique' 'Myanmar' 'Namibia' 'Nauru' 'Nepal' 'Netherlands' 'Netherlands Antilles' 'New Caledonia' 'New Zealand' 'Nicaragua' 'Niger' 'Nigeria' 'Niue' 'Norway' 'Oman' 'Pakistan' 'Palau' 'Panama' 'Papua New Guinea' 'Paraguay' 'Peru' 'Philippines' 'Poland' 'Portugal' 'Puerto Rico' 'Qatar' 'Romania' 'Russia' 'Rwanda' 'Saint Kitts and Nevis' 'Saint Lucia' 'Saint Vincent and the Grenadines' 'Samoa' 'San Marino' 'Sao Tome and Principe' 'Saudi Arabia' 'Senegal' 'Serbia' 'Seychelles' 'Sierra Leone' 'Singapore' 'Slovakia' 'Slovenia' 'Solomon Islands' 'Somalia' 'South Africa' 'Spain' 'Sri Lanka' 'Sudan' 'Suriname' 'Swaziland' 'Sweden' 'Switzerland' 'Syria' 'Taiwan' 'Tajikistan' 'Tanzania' 'Thailand' 'Timor-Leste' 'Togo' 'Tonga' 'Trinidad and Tobago' 'Tunisia' 'Turkey' 'Turkmenistan' 'Tuvalu' 'Uganda' 'Ukraine' 'United Arab Emirates' 'United Kingdom' 'United States of America' 'Uruguay' 'Uzbekistan' 'Vanuatu' 'Venezuela' 'Vietnam' 'West Bank and Gaza' 'Yemen' 'Zambia' 'Zimbabwe']Series name: Country
代码:
print "Last 2 countries:",country_df[-2:] print "Last 2 countries type:",type(country_df[-2:])结果:
Last 2 countries: 200 Zambia201 ZimbabweName: Country, dtype: objectLast 2 countries type: <class 'pandas.core.series.Series'>
3、利用Pandas查询数据
a):head()和tail()函数:
代码:
sunspots=read_csv("H:\Python\data\sunspots.csv")print "Head 2:",sunspots.head(2) #查看前两行print "Tail 2:",sunspots.tail(2) #查看后两行运行结果:
Head 2: Date Yearly Mean Total Sunspot Number0 2016/12/31 39.81 2015/12/31 69.8Tail 2: Date Yearly Mean Total Sunspot Number316 1701-12-31 18.3317 1700-12-31 8.3
代码:
last_date=sunspots.index[-1]print "Last value:\n",sunspots.loc[last_date]运行结果:
last_date=sunspots.index[-1]print "Last value:\n",sunspots.loc[last_date]
4、利用Pandas的DataFrame进行统计计算
pandas的DataFrame数据结构为我们提供了若干统计函数,下面给出部分方法及其简要说明。
方法说明describe这个方法返回描述性统计信息count返回非NAN数据项的数量mad计算平均绝对偏差,级类似于标准差的一个有力统计工具median返回中位数,等价于第50百分位数的值min返回最小值max返回最大值mode返回众数(mod),即一组数据中出现次数最多的变量值std返回表示离散度的标准差,即方差的平方根var返回方差skew返回偏差系数(skewness),该系数表示的是数据分布的对称程度kurt这个方法将返回峰太系数,反映数据分布曲线顶端尖峭或扁平程度代码:
print "Describe:\n",sunspots.describe()print "Non NaN observations:\n",sunspots.count()print "MAD:\n",sunspots.mad()print "Median:\n",sunspots.median()print "Min:\n",sunspots.min()print "Max:\n",sunspots.max()print "Mode:\n",sunspots.mode()print "Standard Deviation:\n",sunspots.std()print "Variance:\n",sunspots.var()print "Skewness:\n",sunspots.skew()print "Kurtosis:\n",sunspots.kurt()
运行结果:
Describe: Yearly Mean Total Sunspot Numbercount 318.000000mean 79.193396std 61.988788min 0.00000025% 24.95000050% 66.25000075% 116.025000max 269.300000Non NaN observations:Date 318Yearly Mean Total Sunspot Number 318dtype: int64MAD:Yearly Mean Total Sunspot Number 50.925104dtype: float64Median:Yearly Mean Total Sunspot Number 66.25dtype: float64Min:Date 1700-12-31Yearly Mean Total Sunspot Number 0dtype: objectMax:Date 2016/12/31Yearly Mean Total Sunspot Number 269.3dtype: objectMode: Date Yearly Mean Total Sunspot Number0 1985/12/31 18.3Standard Deviation:Yearly Mean Total Sunspot Number 61.988788dtype: float64Variance:Yearly Mean Total Sunspot Number 3842.60983dtype: float64Skewness:Yearly Mean Total Sunspot Number 0.808551dtype: float64Kurtosis:Yearly Mean Total Sunspot Number -0.130045dtype: float64
5、利用pandas的DataFrame实现数据聚合
a):为numpy的随机数生成器指定种子,以确保重复运行程序时生成的数据不会走样。该数据有4列: 1、Weather(一个字符串);
2、Food(一个字符串);
3、Price(一个随机浮点数);
4、Number(1~9之间的一个随机整数)。
代码:
import pandas as pdfrom numpy.random import seedfrom numpy.random import randfrom numpy.random import randintimport numpy as npseed(42)#random.rand(n),生成n个0到1间随机数#random.random_integers(low,high=None,size=None) 生成闭区间[low,high]上离散均匀分布的整数值;若high=None,则取值区间变为[1,low] df=pd.DataFrame({'Weather':['cold','hot','cold','hot','cold','hot','cold'],'Food':['soup','soup','icecream','chocolate','icecream','icecream','soup'], 'Price':10*rand(7),'Number':randint(1,9,size=(7,))})print df
运行结果:
Food Number Price Weather0 soup 8 3.745401 cold1 soup 5 9.507143 hot2 icecream 4 7.319939 cold3 chocolate 8 5.986585 hot4 icecream 8 1.560186 cold5 icecream 3 1.559945 hot6 soup 6 0.580836 cold
b):通过Weather列为数据分组,然后遍历各组数据
代码:
weather_group=df.groupby('Weather') #按天气分组i=0for name,group in weather_group: i=i+1 print "Group ",i,name print group运行结果:
Group 1 cold Food Number Price Weather0 soup 8 3.745401 cold2 icecream 4 7.319939 cold4 icecream 8 1.560186 cold6 soup 6 0.580836 coldGroup 2 hot Food Number Price Weather1 soup 5 9.507143 hot3 chocolate 8 5.986585 hot5 icecream 3 1.559945 hot
c):变量Weather_group是一种特殊的pandas对象,可由groupby()生成。这个对象为我们提供了聚合函数,下面展示它的用法:
代码:
print "Weather group first:\n",weather_group.first() #展示各组第一行内容print "Weather group last:\n",weather_group.last() #展示各组最后一行内容print "Weather group mean:\n",weather_group.mean() #计算各组均值运行结果:
Weather group first: Food Number PriceWeather cold soup 8 3.745401hot soup 5 9.507143Weather group last: Food Number PriceWeather cold soup 6 0.580836hot icecream 3 1.559945Weather group mean: Number PriceWeather cold 6.500000 3.301591hot 5.333333 5.684558
d):恰如利用数据库的查询操作那样,也可以针对多列进行分组。
然后就可以用groups属性来了解所生成的数据组,以及每一组包含的行数:
代码:
wf_group=df.groupby(['Weather','Food'])print "WF Group:\n",wf_group.groups运行结果:
WF Group:{('hot', 'chocolate'): Int64Index([3], dtype='int64'), ('cold', 'icecream'): Int64Index([2, 4], dtype='int64'), ('cold', 'soup'): Int64Index([0, 6], dtype='int64'), ('hot', 'soup'): Int64Index([1], dtype='int64'), ('hot', 'icecream'): Int64Index([5], dtype='int64')}
e):通过agg方法,可以对数据组施加一系列的numpy函数:
代码:
print "WF Aggregated:\n",wf_group.agg([np.mean,np.median])运行结果:
WF Aggregated: Number Price mean median mean medianWeather Food cold icecream 6 6 4.440063 4.440063 soup 7 7 2.163119 2.163119hot chocolate 8 8 5.986585 5.986585 icecream 3 3 1.559945 1.559945 soup 5 5 9.507143 9.507143
6、DataFrame的串联与附加操作
a):数据库中的数据表有内部连接与外部连接两种连接类型。pandas的DataFrame也有类似操作,也可以对数据进行串联和附加。
函数concat()的作用是串联DataFrame,如可以把一个由3行数据组成的DataFrame与其他行数据行串接,以便重建原DataFrame:
代码:
print "df:3\n",df[:3]print "Contact Back together:\n",pd.concat([df[:3],df[:3]])运行结果:
df:3 Food Number Price Weather0 soup 8 3.745401 cold1 soup 5 9.507143 hot2 icecream 4 7.319939 coldContact Back together: Food Number Price Weather0 soup 8 3.745401 cold1 soup 5 9.507143 hot2 icecream 4 7.319939 cold0 soup 8 3.745401 cold1 soup 5 9.507143 hot2 icecream 4 7.319939 cold
代码:
print "Appending rows:\n",df[3:].append(df[5:])运行结果:
Appending rows: Food Number Price Weather3 chocolate 8 5.986585 hot4 icecream 8 1.560186 cold5 icecream 3 1.559945 hot6 soup 6 0.580836 cold5 icecream 3 1.559945 hot6 soup 6 0.580836 cold
a)、新建两个CSV文件:dest.csv和tips.csv
代码:
dests=pd.read_csv("H:\Python\data\dest.csv")tips=pd.read_csv("H:\Python\data\\tips.csv")print "dests:\n",destsprint "tips:\n",tips运行结果:
dests: EmpNr Dest0 5 The Hague1 3 Amsterdam2 9 Rotterdamtips: EmpNr Amount0 5 10.01 9 5.02 7 2.5
pandas支持所有的这些连接类型,这里仅介绍内部连接与完全外部连接。
- 用merge函数按照员工编号进行连接处理,代码如下:
print "Merge() on key:\n",pd.merge(dests,tips,on='EmpNr')运行结果:
Merge() on key: EmpNr Dest Amount0 5 The Hague 10.01 9 Rotterdam 5.0
- 使用join方法执行连接操作,需要使用后缀来指示左操作对象和右操作对象:
print "Dest join() tips:\n",dests.join(tips,lsuffix='Dest',rsuffix='Tips')运行结果:
Dest join() tips: EmpNrDest Dest EmpNrTips Amount0 5 The Hague 5 10.01 3 Amsterdam 9 5.02 9 Rotterdam 7 2.5
- 用merge()执行内部连接和外部连接时,更显示的方法如下所示:
代码:
print "Inner join with merge():\n",pd.merge(dests,tips,how='inner') #内连接print "Outer join with merge():\n",pd.merge(dests,tips,how='outer') #完全外部连接运行结果:
Inner join with merge(): EmpNr Dest Amount0 5 The Hague 10.01 9 Rotterdam 5.0Outer join with merge(): EmpNr Dest Amount0 5 The Hague 10.01 3 Amsterdam NaN2 9 Rotterdam 5.03 7 NaN 2.5
8、处理缺失数据
a):读取数据。代码:
df=pd.read_csv("H:\Python\data\WHO.csv")#print df.head()df=df[['Country',df.columns[6]]][:2] #将原df的Country列和第6列组成新DataFrame,并取前两行print "New df:\n",df运行结果:
New df: Country Net primary school enrolment ratio female (%)0 Afghanistan NaN1 Albania 93.0
b):pandas会把缺失的数值标记为NaN,表示None。pandas的isnull()函数可以帮我们检查缺失的数据。
代码:
print "Null Values:\n",pd.isnull(df) #检查每行缺失的数print "Not Null Values:\n",pd.notnull(df) #检查非缺失的数print "Last Column Doubled:\n",2*df[df.columns[-1]] #NAN值乘以一个数后还是NANprint "Last Column plus NaN:\n",df[df.columns[-1]]+np.nan #非NAN值加上NAN后变为了NANprint "Zero filled:\n",df.fillna(0) #使用0替换NAN运行结果:
Null Values: Country Net primary school enrolment ratio female (%)0 False True1 False FalseNot Null Values: Country Net primary school enrolment ratio female (%)0 True False1 True TrueLast Column Doubled:0 NaN1 186.0Name: Net primary school enrolment ratio female (%), dtype: float64Last Column plus NaN:0 NaN1 NaNName: Net primary school enrolment ratio female (%), dtype: float64Zero filled: Country Net primary school enrolment ratio female (%)0 Afghanistan 0.01 Albania 93.0
9、处理日期数据
a):设定从1900年1月1日开始为期42天的时间范围。
代码:
print "Date range:\n",pd.date_range('1/1/1900',periods=42,freq='D') #42表示天数,D表示使用日频率。如果periods='W',表示42周运行结果:
Date range:DatetimeIndex(['1900-01-07', '1900-01-14', '1900-01-21', '1900-01-28', '1900-02-04', '1900-02-11', '1900-02-18', '1900-02-25', '1900-03-04', '1900-03-11', '1900-03-18', '1900-03-25', '1900-04-01', '1900-04-08', '1900-04-15', '1900-04-22', '1900-04-29', '1900-05-06', '1900-05-13', '1900-05-20', '1900-05-27', '1900-06-03', '1900-06-10', '1900-06-17', '1900-06-24', '1900-07-01', '1900-07-08', '1900-07-15', '1900-07-22', '1900-07-29', '1900-08-05', '1900-08-12', '1900-08-19', '1900-08-26', '1900-09-02', '1900-09-09', '1900-09-16', '1900-09-23', '1900-09-30', '1900-10-07', '1900-10-14', '1900-10-21'], dtype='datetime64[ns]', freq='W-SUN')
b):在pandas中,日期区间是有限制的。pandas的时间戳基于numpy datetime64类型,以纳秒为单位,并且用一个64位整数来表示具体数值。因此,日期有效的时间戳介于1677年至2262年。当然,这些年份也不是所有日期都是有效的。这个时间范围的精确中点是1970年1月1日。这样,1677年1月1日就无法用pandas时间戳定义,而1677年9月30日就可以,下面用代码说明:
代码:
import pandas as pdimport systry: print "Date range:\n",pd.date_range('1/1/1677',periods=4,frep='D')except: etype,value,_=sys.exc_info() #获得错误类型,错误值 print "Error encountered:\n",etype,value #打印运行结果:
Date range:Error encountered:<class 'pandas.tslib.OutOfBoundsDatetime'> Out of bounds nanosecond timestamp: 1677-01-01 00:00:00
b):使用pandas的Dateoffset函数计算允许的日期范围:
代码:
offset=pd.DateOffset(seconds=2**63/10**9)mid=pd.to_datetime('1/1/1970')print "Start valid range:\n",mid-offsetprint "End valid range:\n",mid+offset运行结果:
Start valid range:1677-09-21 00:12:44End valid range:2262-04-11 23:47:16
代码:
print "With format:\n",pd.to_datetime(['1901113','19031230'],format='%Y%m%d')运行结果:
With format:DatetimeIndex(['1901-11-03', '1903-12-30'], dtype='datetime64[ns]', freq=None)
d):如果一个字符串明显不是日期,无法转化。可以使用参数coerce设置为True强制转化:
代码:
print "Illegal date:\n",pd.to_datetime(['1901-11-13','not a date']) #第二个字符串无法转换,运行报错print "Illegal date:\n",pd.to_datetime(['1901-11-13','not a date'],coerce=True) #强制转化,得到非时间数NAT运行结果:
Illegal date:DatetimeIndex(['1901-11-13', 'NaT'], dtype='datetime64[ns]', freq=None)
10、数据透析表
a):数据透析表可以从一个平面文件中指定的行和列中聚合数据,这种聚合操作可以是求和、求平均值,求标准差等运算。import pandas as pdfrom numpy.random import seedfrom numpy.random import randfrom numpy.random import randint import numpy as npseed(42)N=7df=pd.DataFrame({'Weather':['cold','hot','cold','hot','cold','hot','cold'],'Food':['soup','soup','icecream','chocolate','icecream','icecream','soup'], 'Price':10*rand(7),'Number':randint(1,9,size=(7,))})print "DataFrame:\n",dfprint pd.pivot_table(df,index='Food',aggfunc=np.sum) #计算各类型Food的统计值运行结果:
DataFrame: Food Number Price Weather0 soup 8 3.745401 cold1 soup 5 9.507143 hot2 icecream 4 7.319939 cold3 chocolate 8 5.986585 hot4 icecream 8 1.560186 cold5 icecream 3 1.559945 hot6 soup 6 0.580836 cold Number PriceFood chocolate 8 5.986585icecream 15 10.440071soup 19 13.833380
阅读全文
0 0
- python数据分析(pandas入门)
- python数据分析pandas包入门学习(二)基本功能
- 利用Python数据分析:pandas入门(一)
- 利用Python数据分析:pandas入门(二)
- 利用Python数据分析:pandas入门(三)
- Python数据分析入门(一)-Pandas数据结构(Series)
- 利用Python进行数据分析(五)之pandas入门
- Python 数据分析包:pandas 入门
- 利用python进行数据分析-pandas入门
- 利用Python数据分析:pandas入门(四)
- 利用Python数据分析:pandas入门(五)
- 利用Python数据分析:pandas入门(六)
- Python——数据分析Pandas入门
- Python数据分析入门-Pandas环境搭建
- python数据分析pandas包入门学习(一)pandas数据结构介绍
- pandas 数据分析入门
- python数据分析---Pandas
- Python 数据分析 pandas
- 解析JSON数据
- 调用另一类的静态,非静态属性的方法,静态块,构造块,运行先后
- ibaits中sqlMapClient.getCurrentConnection()返回null
- c#操作XML文件的通用方法
- 喷水装置(一)
- python数据分析(pandas入门)
- CBK告诉你:民族企业为什么应该走出国门,走向世界
- js单元测试(占坑)
- Centos7安装-多节点PbsPro
- java中Object类型转String类型
- 软件测试全局认识
- Linux下Apache PHP MYSQL 默认安装路径
- 一个ip对应两个域名下IIS重定向
- 基于JS实现移动端左滑删除功能