python数据分析实践(四)

来源:互联网 发布:网络维护外包费用 编辑:程序博客网 时间:2024/06/03 18:18

任务一:将三个Dataframe进行合并

  • Dataframe1: energy supply and renewable electricity production from the United Nations for the year 2013
    导入”Energy Indicators.xls”去除不需要的数据,修改column名,更改列数据修改index以及更改index名
  • Dataframe2:GDP from 1960 to 2015 from World Bank
    导入”world_bank.csv”去除不需要数据,变更index进行个别修改
  • Dataframe3:Sciamgo Journal and Country Rank data for Energy Engineering and Power Technology
    导入”scimagojr-3.xlsx”
    最后返回只有[‘Rank’, ‘Documents’, ‘Citable documents’, ‘Citations’, ‘Self-citations’, ‘Citations per document’, ‘H index’, ‘Energy Supply’, ‘Energy Supply per Capita’, ‘% Renewable’, ‘2006’, ‘2007’, ‘2008’, ‘2009’, ‘2010’, ‘2011’, ‘2012’, ‘2013’, ‘2014’, ‘2015’]这些column的Top15数据框
import pandas as pdimport numpy as np### energyenergy = pd.read_excel('Energy Indicators.xls',skiprows=17,skip_footer = 38) \           .rename(columns={'Unnamed: 2':'Country',                            'Petajoules':'Energy Supply',                            'Gigajoules':'Energy Supply per Capita',                            '%':'% Renewable'})\           [['Country','Energy Supply','Energy Supply per Capita','% Renewable']] \           .set_index('Country')# 将缺失数据'...',填充为np.nanindex = energy[energy['Energy Supply']=='...'].indexenergy.loc[index,'Energy Supply']= np.nanenergy.loc[index,'Energy Supply per Capita']= np.nanenergy['Energy Supply'] *= 1000000energy.astype('float64')# 修改indexenergy.index = energy.index.str.split('\s\(').str[0]energy.index = energy.index.str.split('\d').str[0]energy.rename(index={'Republic of Korea': 'South Korea',                     'United States of America': 'United States',                     'United Kingdom of Great Britain and Northern Ireland': 'United Kingdom',                     'China, Hong Kong Special Administrative Region': 'Hong Kong'},inplace=True)### GDPGDP = pd.read_csv('world_bank.csv',skiprows=4)\        .set_index('Country Name')\        .rename(index={'Korea, Rep.': 'South Korea',                        'Iran, Islamic Rep.': 'Iran',                       'Hong Kong SAR, China': 'Hong Kong'})### ScimEnScimEn = pd.read_excel('scimagojr-3.xlsx').set_index('Country')def answer_one():    return pd.merge(pd.merge(ScimEn[:15], energy,left_index=True,right_index=True),                    GDP[['2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015']],                    left_index=True,                    right_index=True)answer_one()

由于采用python pandorable的编码方式,导致很多代码行过长,对于过长的代码行,在末尾加上 “\” 进行换行,并不影响代码的执行。一般一个操作我换一行,这样增加代码的可读性。对于修改缺失数据’…’为np.nan,做了好久,由于很多情况下我们得到的都是数据的copy而不是dataframe本身,所以我们提取数据的一部分出来修改是达不到目的的,最终还是屈服于运用loc函数进行修改。

任务二:对任务一返回的Dataframe建立新的列

新的列’HighRenew’要求根据列’% Renewable’的中位数进行判定,大于等于中位数就等于1,小于就等于0。

def answer_two():    Top15 = answer_one()    median = Top15['% Renewable'].median()    Top15['HighRenew'] = Top15['% Renewable'].map(lambda x: 1 if x >= median else 0)    return Top15.sort_values(by = 'Rank').HighRenewanswer_two()

任务三:得出’% Renewable’最大的国家及其百分比值构成一个tuple

def answer_three():    Top15 = answer_one()    Rank15 = Top15['% Renewable'].sort_values(ascending=False)    return tuple((Rank15.index[0],Rank15[0]))answer_three()

对于Dataframe和Series排序最好不采用sort和order, 使用sort_values进行升降序调整。

任务四:增加新的column ‘Continent’国家所属大洲,然后对洲进行groupby求取[‘size’, ‘sum’, ‘mean’, ‘std’]

ContinentDict  = {'China':'Asia',                   'United States':'North America',                   'Japan':'Asia',                   'United Kingdom':'Europe',                   'Russian Federation':'Europe',                   'Canada':'North America',                   'Germany':'Europe',                   'India':'Asia',                  'France':'Europe',                   'South Korea':'Asia',                   'Italy':'Europe',                   'Spain':'Europe',                   'Iran':'Asia',                  'Australia':'Australia',                   'Brazil':'South America'}def answer_four():    Top15 = answer_one()    Top15['PopEst'] = Top15['Energy Supply'] / Top15['Energy Supply per Capita']    Top15 = pd.concat([Top15,pd.Series(ContinentDict, name = 'Continent')],axis = 1)[['PopEst','Continent']]    return Top15.groupby('Continent').agg(['size','sum','mean','std'])['PopEst']answer_four()

由于dataframe和Series是无法merge的,所以采用纵向的concat进行合并(axis=1)

任务五:对列’% Renewable’进行等距分割,然后对Top15按照Continent和新的Renewable bins进行group,求每组的个数

def answer_five():    Top15 = answer_one()    Top15 = pd.concat([Top15,pd.Series(ContinentDict, name = 'Continent')],axis = 1)    return pd.concat([Top15['Continent'],pd.cut(Top15['% Renewable'],5)],axis=1).groupby(['Continent','% Renewable']).size()answer_five()

pd.cut完成等距变量求取,将Renewable列的数据段等距分为5段

任务六:对Popluation Estimate Series的数据进行格式化千分位

def answer_six():    Top15 = answer_one()    Top15['PopEst'] = Top15['Energy Supply'] / Top15['Energy Supply per Capita']    return Top15['PopEst'].map(lambda x: '{:,}'.format(x))answer_six()

运用python自带format函数直接完成千分位格式化。
map(),对象是Series,将Series中的每一项导入进map的函数进行处理。
apply(),对象是Dataframe,将Dataframe的每一列导入进apply的函数进行处理。

最后贴上一个作图

def plot_optional():    import matplotlib as plt    %matplotlib inline    Top15 = answer_one()    ax = Top15.plot(x='Rank', y='% Renewable', kind='scatter',                     c=['#e41a1c','#377eb8','#e41a1c','#4daf4a','#4daf4a','#377eb8','#4daf4a','#e41a1c',                       '#4daf4a','#e41a1c','#4daf4a','#4daf4a','#e41a1c','#dede00','#ff7f00'],                     xticks=range(1,16), s=6*Top15['2014']/10**10, alpha=.75, figsize=[16,6]);    for i, txt in enumerate(Top15.index):        ax.annotate(txt, [Top15['Rank'][i], Top15['% Renewable'][i]], ha='center')    print("This is an example of a visualization that can be created to help understand the data. \This is a bubble chart showing % Renewable vs. Rank. The size of the bubble corresponds to the countries' \2014 GDP, and the color corresponds to the continent.")

这里写图片描述

原创粉丝点击