5.6 应用实例

来源：互联网发布：mysql创建序列sql语句编辑：程序博客网时间：2024/06/18 05:35

本节引入两个实际案例，介绍Pandas工具对数据集进行预处理的操作方法。

5.6.1 分析titanic数据

首先使用read_csv()读取CSV文件，并观察前5行数据。示例代码：

import pandas as pdimport numpy as nptitanic_survival = pd.read_csv("titanic_train.csv")print(titanic_survival.head())

运行结果：

   PassengerId  Survived  Pclass  \0            1         0       3   1            2         1       1   2            3         1       3   3            4         1       1   4            5         0       3                                                   Name     Sex   Age  SibSp  \0                            Braund, Mr. Owen Harris    male  22.0      1   1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   2                             Heikkinen, Miss. Laina  female  26.0      0   3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   4                           Allen, Mr. William Henry    male  35.0      0      Parch            Ticket     Fare Cabin Embarked  0      0         A/5 21171   7.2500   NaN        S  1      0          PC 17599  71.2833   C85        C  2      0  STON/O2. 3101282   7.9250   NaN        S  3      0            113803  53.1000  C123        S  4      0            373450   8.0500   NaN        S

统计Age列有多少值为空，示例代码：

age = titanic_survival["Age"]age_is_null =pd.isnull(age)age_null_true = age[age_is_null]age_null_count = len(age_null_true)print(age_null_count)

运行结果：

求Age列均值有两种方法，第一种方法首先取得Age列不为空的值，然后求解平均值，示例代码：

good_ages = titanic_survival["Age"][age_is_null==False]      #把Age列不为空的值赋值给good_agescorrect_mean_age =sum(good_ages)/len(good_ages)print(correct_mean_age)

运行结果：

29.6991176471

方法二直接引入mean()方法，mean函数会自动取出Age列中为空的值，然后求解平均值，示例代码：

correct_mean_age = titanic_survival["Age"].mean()      print(correct_mean_age)

运行结果：

29.69911764705882

统计每种等级船舱平均票价，示例代码：

passenger_classes = [1, 2, 3]fares_by_class = {}for this_class in passenger_classes:        pclass_rows = titanic_survival[titanic_survival["Pclass"]==this_class]        pclass_fares = pclass_rows["Fare"]        fare_for_class = pclass_fares.mean()        fares_by_class[this_class] = fare_for_classprint(fares_by_class)

运行结果：

{1: 84.15468749999992, 2: 20.66218315217391, 3: 13.675550101832997}

数据透视表（Pivot Table）是一种交互式的表，可以进行某些计算，如求和与计数等。所进行的计算与数据跟数据透视表中的排列有关。之所以称为数据透视表，是因为可以动态地改变它们的版面布置，以便按照不同方式分析数据，也可以重新安排行号、列标和页字段。每一次改变版面布置时，数据透视表会立即按照新的布置重新计算数据。另外，如果原始数据发生更改，则可以更新数据透视表。
Pandas通过透视表函数pivot_table()，找出每种Pclass所对应Survived的平均值，示例代码：

passenger_survival = titanic_survival.pivot_table(index="Pclass", values="Survived", aggfunc=np.mean)print(passenger_survival)

运行结果：

Pclass1    0.6296302    0.4728263    0.242363Name: Survived, dtype: float64

每种Pclass所对应Age的平均值，示例代码：

passenger_age = titanic_survival.pivot_table(index="Pclass", values="Age", aggfunc=np.mean)print(passenger_age)

运行结果

Pclass1    38.2334412    29.8776303    25.140620Name: Age, dtype: float64

每种Embarked与Fare和Survived列的和值，示例代码：

port_stats = titanic_survival.pivot_table(index="Embarked", values=["Fare", "Survived"], aggfunc=np.sum)print(port_stats)

运行结果：

                Fare  SurvivedEmbarked                      C         10072.2962        93Q          1022.2543        30S         17439.3988       217

5.6.2 分析贷款风险数据

从www.lendingclub.com/info/download-data.action网站获取2007-2011年贷款申请相关信息数据集（LoanStats3a.csv），如图所示：

www.lendingclub.com

然后这些历史数据建立模型去预测新申请人是否有贷款资格，为保证拟合数据模型有效，要清洗数据过滤无用特征。首先利用Pandas工具导入数据集，示例代码：

import pandas as pdloans_2007 = pd.read_csv("LoanStats3a.csv", skiprows = 1)    #导入csv文件，并且忽略第一行数据half_count = len(loans_2007)/2loans_2007 = loans_2007.dropna(thresh = half_count, axis = 1)   # 删除缺失值loans_2007.drop_duplicates()    #删除重复的行print(loans_2007.iloc[0])      #输出第一行数据print(loans_2007.shape[1])      #输出一共有多少特征列

运行结果：

id                                1077501member_id                      1.2966e+06loan_amnt                            5000funded_amnt                          5000funded_amnt_inv                      4975term                            36 monthsint_rate                           10.65%installment                        162.87grade                                   Bsub_grade                              B2emp_title                             NaNemp_length                      10+ yearshome_ownership                       RENTannual_inc                          24000verification_status              Verifiedissue_d                          Dec-2011loan_status                    Fully Paidpymnt_plan                              npurpose                       credit_cardtitle                            Computerzip_code                            860xxaddr_state                             AZdti                                 27.65delinq_2yrs                             0earliest_cr_line                 Jan-1985inq_last_6mths                          1open_acc                                3pub_rec                                 0revol_bal                           13648revol_util                          83.7%total_acc                               9initial_list_status                     fout_prncp                               0out_prncp_inv                           0total_pymnt                       5863.16total_pymnt_inv                   5833.84total_rec_prncp                      5000total_rec_int                      863.16total_rec_late_fee                      0recoveries                              0collection_recovery_fee                 0last_pymnt_d                     Jan-2015last_pymnt_amnt                    171.62last_credit_pull_d               Nov-2016collections_12_mths_ex_med              0policy_code                             1application_type               INDIVIDUALacc_now_delinq                          0chargeoff_within_12_mths                0delinq_amnt                             0pub_rec_bankruptcies                    0tax_liens                               0Name: 0, dtype: object52

如结果所示，与预测模型无关的特征包括如下几个方面：

一、明显与申请贷款无任何影响的特征，如各种编号和名称包括id（编号）、member_id（会员号）、emp_title（公司名称）、zip_code（编码）等；
二、已预测后的特征对预测信息无实质影响，如funded_amnt（放款金额）、funded_amnt_inv（首轮放款金额）等；

三、高度重复的特征如grade（会员分值）、sub_grade（二级分值）；
通过常识也可以筛选出对贷款人风险预测非常重要的特征如home_owership（住房情况，自购房还是租住房，涉及担保抵押问题），anual_inc（工资收入）等。特征工程的建立是一个非常复杂的过程，需要相关行业的专业人员进行评估进行筛查，以达到更出色的效果，鉴于学习案例的演示，作者只能按自己对贷款行业粗浅的认知对特征进行选取，示例代码：

loans_2007 = loans_2007.drop(["id", "member_id", "funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d"], axis=1)loans_2007 = loans_2007.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1)loans_2007 = loans_2007.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1)print(loans_2007.iloc[0])print(loans_2007.shape[1])

运行结果：

loan_amnt                            5000term                            36 monthsint_rate                           10.65%installment                        162.87emp_length                      10+ yearshome_ownership                       RENTannual_inc                          24000verification_status              Verifiedloan_status                    Fully Paidpymnt_plan                              npurpose                       credit_cardtitle                            Computeraddr_state                             AZdti                                 27.65delinq_2yrs                             0earliest_cr_line                 Jan-1985inq_last_6mths                          1open_acc                                3pub_rec                                 0revol_bal                           13648revol_util                          83.7%total_acc                               9initial_list_status                     flast_credit_pull_d               Nov-2016collections_12_mths_ex_med              0policy_code                             1application_type               INDIVIDUALacc_now_delinq                          0chargeoff_within_12_mths                0delinq_amnt                             0pub_rec_bankruptcies                    0tax_liens                               0Name: 0, dtype: object32

如结果所示，经过几轮过滤，从原先的54列筛选出32列作为比较有价值的特征属性列。然而，目前数据集并没有明显的标注出结果特征标签列，即是否借出的指标（以True/False或者0/1标注）。
经过对数据集的再次观察，发现loan_status表示的是当前贷款状态，通过value_counts()统计每个值出现的个数，示例代码：

print(loans_2007['loan_status'].value_counts())

运行结果：

Fully Paid                                             33902Charged Off                                             5658Does not meet the credit policy. Status:Fully Paid      1988Does not meet the credit policy. Status:Charged Off      761Current                                                  201Late (31-120 days)                                        10In Grace Period                                            9Late (16-30 days)                                          5Default                                                    1

由结果可知，loan_status属性有几个候选值，如Fully Paid，即全额放款（是）；Charged Off，即没有被批准（否）；而其他属性没有前两项意义明确，且数据量比较小对数据集影像不大，可舍弃。因此，我们可以取loan_status中Fully Paid和Charged Off的值作为结果特征量，且映射为1/0二分类标签，示例代码：

loans_2007 = loans_2007[(loans_2007['loan_status'] == "Fully Paid") | (loans_2007['loan_status'] == "Charged Off")]status_replace = {          "loan_status" : {      "Fully Paid": 1,      "Charged Off": 0,  }}loans_2007 = loans_2007.replace(status_replace)          #将Full Paid和Charged Off映射为1和0属性值

再次观察数据集，发现有的属性列，所有值都统一相同，如pymnt_plan其所有值都为n，这样的指标对预测模型没有贡献。因此，我们在做数据预处理的时候，要把所有值相同的列名提取出来，并弃之。示例代码：

orig_columns = loans_2007.columnsdrop_columns = []for col in orig_columns:  col_series = loans_2007[col].dropna().unique()  if len(col_series) == 1:      drop_columns.append(col)loans_2007 = loans_2007.drop(drop_columns, axis=1)print(drop_columns)print(loans_2007.shape)

运行结果：

['initial_list_status', 'collections_12_mths_ex_med', 'policy_code', 'application_type', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'tax_liens'](39560, 24)

经过滤剩下的24列我们认为有意义的属性列，其中loan_status是结果特征，其值为0或1，即是否发放贷款。最后将过滤清洗好的数据保存为csv文件，以备下一步建模使用，示例代码：

loans_2007.to_csv('filtered_loans_2007.csv', index=False)

经过以上步骤，我们将实际案例中复杂的数据集进行了简单的清洗过滤，对于列值比较多的数据集，首先要明确每列指标代表的意义，然后对指标进行筛选，要对特征价值低、噪音高的属性列进行舍弃，过多的特征值，会导致生成的模型过拟合的现象，这应当注意和防范。

阅读全文

0 0