5.6 应用实例
来源:互联网 发布:mysql创建序列sql语句 编辑:程序博客网 时间:2024/06/18 05:35
本节引入两个实际案例,介绍Pandas工具对数据集进行预处理的操作方法。
5.6.1 分析titanic数据
首先使用read_csv()读取CSV文件,并观察前5行数据。示例代码:
import pandas as pdimport numpy as nptitanic_survival = pd.read_csv("titanic_train.csv")print(titanic_survival.head())
运行结果:
PassengerId Survived Pclass \0 1 0 3 1 2 1 1 2 3 1 3 3 4 1 1 4 5 0 3 Name Sex Age SibSp \0 Braund, Mr. Owen Harris male 22.0 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 2 Heikkinen, Miss. Laina female 26.0 0 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 4 Allen, Mr. William Henry male 35.0 0 Parch Ticket Fare Cabin Embarked 0 0 A/5 21171 7.2500 NaN S 1 0 PC 17599 71.2833 C85 C 2 0 STON/O2. 3101282 7.9250 NaN S 3 0 113803 53.1000 C123 S 4 0 373450 8.0500 NaN S
统计Age列有多少值为空,示例代码:
age = titanic_survival["Age"]age_is_null =pd.isnull(age)age_null_true = age[age_is_null]age_null_count = len(age_null_true)print(age_null_count)
运行结果:
177
求Age列均值有两种方法,第一种方法首先取得Age列不为空的值,然后求解平均值,示例代码:
good_ages = titanic_survival["Age"][age_is_null==False] #把Age列不为空的值赋值给good_agescorrect_mean_age =sum(good_ages)/len(good_ages)print(correct_mean_age)
运行结果:
29.6991176471
方法二直接引入mean()方法,mean函数会自动取出Age列中为空的值,然后求解平均值,示例代码:
correct_mean_age = titanic_survival["Age"].mean() print(correct_mean_age)
运行结果:
29.69911764705882
统计每种等级船舱平均票价,示例代码:
passenger_classes = [1, 2, 3]fares_by_class = {}for this_class in passenger_classes: pclass_rows = titanic_survival[titanic_survival["Pclass"]==this_class] pclass_fares = pclass_rows["Fare"] fare_for_class = pclass_fares.mean() fares_by_class[this_class] = fare_for_classprint(fares_by_class)
运行结果:
{1: 84.15468749999992, 2: 20.66218315217391, 3: 13.675550101832997}
数据透视表(Pivot Table)是一种交互式的表,可以进行某些计算,如求和与计数等。所进行的计算与数据跟数据透视表中的排列有关。之所以称为数据透视表,是因为可以动态地改变它们的版面布置,以便按照不同方式分析数据,也可以重新安排行号、列标和页字段。每一次改变版面布置时,数据透视表会立即按照新的布置重新计算数据。另外,如果原始数据发生更改,则可以更新数据透视表。
Pandas通过透视表函数pivot_table(),找出每种Pclass所对应Survived的平均值,示例代码:
passenger_survival = titanic_survival.pivot_table(index="Pclass", values="Survived", aggfunc=np.mean)print(passenger_survival)
运行结果:
Pclass1 0.6296302 0.4728263 0.242363Name: Survived, dtype: float64
每种Pclass所对应Age的平均值,示例代码:
passenger_age = titanic_survival.pivot_table(index="Pclass", values="Age", aggfunc=np.mean)print(passenger_age)
运行结果
Pclass1 38.2334412 29.8776303 25.140620Name: Age, dtype: float64
每种Embarked与Fare和Survived列的和值,示例代码:
port_stats = titanic_survival.pivot_table(index="Embarked", values=["Fare", "Survived"], aggfunc=np.sum)print(port_stats)
运行结果:
Fare SurvivedEmbarked C 10072.2962 93Q 1022.2543 30S 17439.3988 217
5.6.2 分析贷款风险数据
从www.lendingclub.com/info/download-data.action网站获取2007-2011年贷款申请相关信息数据集(LoanStats3a.csv),如图所示:
然后这些历史数据建立模型去预测新申请人是否有贷款资格,为保证拟合数据模型有效,要清洗数据过滤无用特征。首先利用Pandas工具导入数据集,示例代码:
import pandas as pdloans_2007 = pd.read_csv("LoanStats3a.csv", skiprows = 1) #导入csv文件,并且忽略第一行数据half_count = len(loans_2007)/2loans_2007 = loans_2007.dropna(thresh = half_count, axis = 1) # 删除缺失值loans_2007.drop_duplicates() #删除重复的行print(loans_2007.iloc[0]) #输出第一行数据print(loans_2007.shape[1]) #输出一共有多少特征列
运行结果:
id 1077501member_id 1.2966e+06loan_amnt 5000funded_amnt 5000funded_amnt_inv 4975term 36 monthsint_rate 10.65%installment 162.87grade Bsub_grade B2emp_title NaNemp_length 10+ yearshome_ownership RENTannual_inc 24000verification_status Verifiedissue_d Dec-2011loan_status Fully Paidpymnt_plan npurpose credit_cardtitle Computerzip_code 860xxaddr_state AZdti 27.65delinq_2yrs 0earliest_cr_line Jan-1985inq_last_6mths 1open_acc 3pub_rec 0revol_bal 13648revol_util 83.7%total_acc 9initial_list_status fout_prncp 0out_prncp_inv 0total_pymnt 5863.16total_pymnt_inv 5833.84total_rec_prncp 5000total_rec_int 863.16total_rec_late_fee 0recoveries 0collection_recovery_fee 0last_pymnt_d Jan-2015last_pymnt_amnt 171.62last_credit_pull_d Nov-2016collections_12_mths_ex_med 0policy_code 1application_type INDIVIDUALacc_now_delinq 0chargeoff_within_12_mths 0delinq_amnt 0pub_rec_bankruptcies 0tax_liens 0Name: 0, dtype: object52
如结果所示,与预测模型无关的特征包括如下几个方面:
- 一、明显与申请贷款无任何影响的特征,如各种编号和名称包括id(编号)、member_id(会员号)、emp_title(公司名称)、zip_code(编码)等;
- 二、已预测后的特征对预测信息无实质影响,如funded_amnt(放款金额)、funded_amnt_inv(首轮放款金额)等;
- 三、高度重复的特征如grade(会员分值)、sub_grade(二级分值);
通过常识也可以筛选出对贷款人风险预测非常重要的特征如home_owership(住房情况,自购房还是租住房,涉及担保抵押问题),anual_inc(工资收入)等。特征工程的建立是一个非常复杂的过程,需要相关行业的专业人员进行评估进行筛查,以达到更出色的效果,鉴于学习案例的演示,作者只能按自己对贷款行业粗浅的认知对特征进行选取,示例代码:
运行结果:loans_2007 = loans_2007.drop(["id", "member_id", "funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d"], axis=1)loans_2007 = loans_2007.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1)loans_2007 = loans_2007.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1)print(loans_2007.iloc[0])print(loans_2007.shape[1])
如结果所示,经过几轮过滤,从原先的54列筛选出32列作为比较有价值的特征属性列。然而,目前数据集并没有明显的标注出结果特征标签列,即是否借出的指标(以True/False或者0/1标注)。loan_amnt 5000term 36 monthsint_rate 10.65%installment 162.87emp_length 10+ yearshome_ownership RENTannual_inc 24000verification_status Verifiedloan_status Fully Paidpymnt_plan npurpose credit_cardtitle Computeraddr_state AZdti 27.65delinq_2yrs 0earliest_cr_line Jan-1985inq_last_6mths 1open_acc 3pub_rec 0revol_bal 13648revol_util 83.7%total_acc 9initial_list_status flast_credit_pull_d Nov-2016collections_12_mths_ex_med 0policy_code 1application_type INDIVIDUALacc_now_delinq 0chargeoff_within_12_mths 0delinq_amnt 0pub_rec_bankruptcies 0tax_liens 0Name: 0, dtype: object32
经过对数据集的再次观察,发现loan_status表示的是当前贷款状态,通过value_counts()统计每个值出现的个数,示例代码:
运行结果:print(loans_2007['loan_status'].value_counts())
由结果可知,loan_status属性有几个候选值,如Fully Paid,即全额放款(是);Charged Off,即没有被批准(否);而其他属性没有前两项意义明确,且数据量比较小对数据集影像不大,可舍弃。因此,我们可以取loan_status中Fully Paid和Charged Off的值作为结果特征量,且映射为1/0二分类标签,示例代码:Fully Paid 33902Charged Off 5658Does not meet the credit policy. Status:Fully Paid 1988Does not meet the credit policy. Status:Charged Off 761Current 201Late (31-120 days) 10In Grace Period 9Late (16-30 days) 5Default 1
再次观察数据集,发现有的属性列,所有值都统一相同,如pymnt_plan其所有值都为n,这样的指标对预测模型没有贡献。因此,我们在做数据预处理的时候,要把所有值相同的列名提取出来,并弃之。示例代码:loans_2007 = loans_2007[(loans_2007['loan_status'] == "Fully Paid") | (loans_2007['loan_status'] == "Charged Off")]status_replace = { "loan_status" : { "Fully Paid": 1, "Charged Off": 0, }}loans_2007 = loans_2007.replace(status_replace) #将Full Paid和Charged Off映射为1和0属性值
运行结果:orig_columns = loans_2007.columnsdrop_columns = []for col in orig_columns: col_series = loans_2007[col].dropna().unique() if len(col_series) == 1: drop_columns.append(col)loans_2007 = loans_2007.drop(drop_columns, axis=1)print(drop_columns)print(loans_2007.shape)
经过滤剩下的24列我们认为有意义的属性列,其中loan_status是结果特征,其值为0或1,即是否发放贷款。最后将过滤清洗好的数据保存为csv文件,以备下一步建模使用,示例代码:['initial_list_status', 'collections_12_mths_ex_med', 'policy_code', 'application_type', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'tax_liens'](39560, 24)
经过以上步骤,我们将实际案例中复杂的数据集进行了简单的清洗过滤,对于列值比较多的数据集,首先要明确每列指标代表的意义,然后对指标进行筛选,要对特征价值低、噪音高的属性列进行舍弃,过多的特征值,会导致生成的模型过拟合的现象,这应当注意和防范。loans_2007.to_csv('filtered_loans_2007.csv', index=False)
- 5.6 应用实例
- CFTREE数据库应用实例
- 模式应用和实例
- 模式应用和实例
- XSL应用实例
- 串口中断应用实例
- Debug应用实例
- 模式应用和实例
- XSL应用实例
- DBGridEH应用实例(摘抄)
- Portlet应用开发实例
- DBGridEH应用实例
- 专家系统的应用实例
- Debug应用实例
- Ajax 之实例应用
- Jmail组件应用实例
- ibatis + dbunit 应用实例
- Apache Digester应用实例
- 5.1 Pandas库介绍及导入
- 5.2 Series对象
- 5.3 DataFrame对象
- 5.4 数据操作
- 5.5 数据预处理
- 5.6 应用实例
- 6.1 Matplotlib库介绍及快速绘图
- 6.2 折线图及绘制属性
- 6.3 图形、子图和坐标
- C++静态链接库与动态链接库的优缺点
- 6.4 条形图
- 有return的情况下try catch finally的执行顺序
- 6.5 散点图
- java将毫秒转成时间格式的日期