特征工程

来源:互联网 发布:windows路径长度 编辑:程序博客网 时间:2024/05/08 02:19

特征工程

数据集来源于Data Hackathon 3.x,所有的特征处理也只做最基本的参考,可自行尝试更多的特征工程工作,参考github里Feature engineering和Kaggle Titanic的案例。

加载需要的库:

In [1]:
import pandas as pdimport numpy as np%matplotlib inline
In [2]:
#载入数据:train = pd.read_csv('Train.csv')test = pd.read_csv('Test.csv')
In [3]:
train.shape, test.shape
Out[3]:
((87020, 26), (37717, 24))

看看数据的基本情况

In [4]:
train.dtypes
Out[4]:
ID                        objectGender                    objectCity                      objectMonthly_Income             int64DOB                       objectLead_Creation_Date        objectLoan_Amount_Applied      float64Loan_Tenure_Applied      float64Existing_EMI             float64Employer_Name             objectSalary_Account            objectMobile_Verified           objectVar5                       int64Var1                      objectLoan_Amount_Submitted    float64Loan_Tenure_Submitted    float64Interest_Rate            float64Processing_Fee           float64EMI_Loan_Submitted       float64Filled_Form               objectDevice_Type               objectVar2                      objectSource                    objectVar4                       int64LoggedIn                   int64Disbursed                  int64dtype: object

拿前5条出来看看

In [5]:
train.head(5)
Out[5]:
 IDGenderCityMonthly_IncomeDOBLead_Creation_DateLoan_Amount_AppliedLoan_Tenure_AppliedExisting_EMIEmployer_Name...Interest_RateProcessing_FeeEMI_Loan_SubmittedFilled_FormDevice_TypeVar2SourceVar4LoggedInDisbursed0ID000002C20FemaleDelhi2000023-May-7815-May-15300000.05.00.0CYBOSOL...NaNNaNNaNNWeb-browserGS1221001ID000004E40MaleMumbai3500007-Oct-8504-May-15200000.02.00.0TATA CONSULTANCY SERVICES LTD (TCS)...13.25NaN6762.9NWeb-browserGS1223002ID000007H20MalePanchkula2250010-Oct-8119-May-15600000.04.00.0ALCHEMIST HOSPITALS LTD...NaNNaNNaNNWeb-browserBS1431003ID000008I30MaleSaharsa3500030-Nov-8709-May-151000000.05.00.0BIHAR GOVERNMENT...NaNNaNNaNNWeb-browserBS1433004ID000009J40MaleBengaluru10000017-Feb-8420-May-15500000.02.025000.0GLOBAL EDGE SOFTWARE...NaNNaNNaNNWeb-browserBS134310

5 rows × 26 columns

In [6]:
#合成一个总的datatrain['source']= 'train'test['source'] = 'test'data=pd.concat([train, test],ignore_index=True)data.shape
Out[6]:
(124737, 27)

数据应用/建模一个很重要的工作是,你要看看异常点,比如说缺省值

In [7]:
data.apply(lambda x: sum(x.isnull()))
Out[7]:
City                      1401DOB                          0Device_Type                  0Disbursed                37717EMI_Loan_Submitted       84901Employer_Name              113Existing_EMI               111Filled_Form                  0Gender                       0ID                           0Interest_Rate            84901Lead_Creation_Date           0Loan_Amount_Applied        111Loan_Amount_Submitted    49535Loan_Tenure_Applied        111Loan_Tenure_Submitted    49535LoggedIn                 37717Mobile_Verified              0Monthly_Income               0Processing_Fee           85346Salary_Account           16801Source                       0Var1                         0Var2                         0Var4                         0Var5                         0source                       0dtype: int64

要对数据有更深的认识,比如说,咱们看看这些字段,分别有多少种取值(甚至你可以看看分布)

In [8]:
var = ['Gender','Salary_Account','Mobile_Verified','Var1','Filled_Form','Device_Type','Var2','Source']for v in var:    print '\n%s这一列数据的不同取值和出现的次数\n'%v    print data[v].value_counts()
Gender这一列数据的不同取值和出现的次数Male      71398Female    53339Name: Gender, dtype: int64Salary_Account这一列数据的不同取值和出现的次数HDFC Bank                                          25180ICICI Bank                                         19547State Bank of India                                17110Axis Bank                                          12590Citibank                                            3398Kotak Bank                                          2955IDBI Bank                                           2213Punjab National Bank                                1747Bank of India                                       1713Bank of Baroda                                      1675Standard Chartered Bank                             1434Canara Bank                                         1385Union Bank of India                                 1330Yes Bank                                            1120ING Vysya                                            996Corporation bank                                     948Indian Overseas Bank                                 901State Bank of Hyderabad                              854Indian Bank                                          773Oriental Bank of Commerce                            761IndusInd Bank                                        711Andhra Bank                                          706Central Bank of India                                648Syndicate Bank                                       614Bank of Maharasthra                                  576HSBC                                                 474State Bank of Bikaner & Jaipur                       448Karur Vysya Bank                                     435State Bank of Mysore                                 385Federal Bank                                         377Vijaya Bank                                          354Allahabad Bank                                       345UCO Bank                                             344State Bank of Travancore                             333Karnataka Bank                                       279United Bank of India                                 276Dena Bank                                            268Saraswat Bank                                        265State Bank of Patiala                                263South Indian Bank                                    223Deutsche Bank                                        176Abhyuday Co-op Bank Ltd                              161The Ratnakar Bank Ltd                                113Tamil Nadu Mercantile Bank                           103Punjab & Sind bank                                    84J&K Bank                                              78Lakshmi Vilas bank                                    69Dhanalakshmi Bank Ltd                                 66State Bank of Indore                                  32Catholic Syrian Bank                                  27India Bulls                                           21B N P Paribas                                         15Firstrand Bank Limited                                11GIC Housing Finance Ltd                               10Bank of Rajasthan                                      8Kerala Gramin Bank                                     4Industrial And Commercial Bank Of China Limited        3Ahmedabad Mercantile Cooperative Bank                  1Name: Salary_Account, dtype: int64Mobile_Verified这一列数据的不同取值和出现的次数Y    80928N    43809Name: Mobile_Verified, dtype: int64Var1这一列数据的不同取值和出现的次数HBXX    84901HBXC    12952HBXB     6502HAXA     4214HBXA     3042HAXB     2879HBXD     2818HAXC     2171HBXH     1387HCXF      990HAYT      710HAVC      570HAXM      386HCXD      348HCYS      318HVYS      252HAZD      161HCXG      114HAXF       22Name: Var1, dtype: int64Filled_Form这一列数据的不同取值和出现的次数N    96740Y    27997Name: Filled_Form, dtype: int64Device_Type这一列数据的不同取值和出现的次数Web-browser    92105Mobile         32632Name: Device_Type, dtype: int64Var2这一列数据的不同取值和出现的次数B    53481G    47338C    20366E     1855D      918F      770A        9Name: Var2, dtype: int64Source这一列数据的不同取值和出现的次数S122    55249S133    42900S159     7999S143     6140S127     2804S137     2450S134     1900S161     1109S151     1018S157      929S153      705S144      447S156      432S158      294S123      112S141       83S162       60S124       43S150       19S160       11S136        5S138        5S155        5S139        4S129        4S135        2S131        1S130        1S132        1S125        1S140        1S142        1S126        1S154        1Name: Source, dtype: int64

紧接着你就可以开始处理你的字段(特征)了

我这里只做了一些简单的处理,你大可在我的基础上做更复杂的特征处理

City字段处理

In [9]:
len(data['City'].unique())
Out[9]:
724

好像city的类型好多,粗暴一点,这个字段咱们不要了

In [10]:
data.drop('City',axis=1,inplace=True)

DOB字段处理

DOB是出生的具体日期,咱们要具体日期作用没那么大,年龄段可能对我们有用,所有算一下年龄好了

In [11]:
data['DOB'].head()
Out[11]:
0    23-May-781    07-Oct-852    10-Oct-813    30-Nov-874    17-Feb-84Name: DOB, dtype: object
In [12]:
#创建一个年龄的字段Agedata['Age'] = data['DOB'].apply(lambda x: 115 - int(x[-2:]))data['Age'].head()
Out[12]:
0    371    302    343    284    31Name: Age, dtype: int64
In [13]:
#把原始的DOB字段去掉:data.drop('DOB',axis=1,inplace=True)

EMI_Load_Submitted字段处理

In [14]:
data.boxplot(column=['EMI_Loan_Submitted'],return_type='axes')
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a00d550>
In [15]:
#好像缺失值比较多,干脆就开一个新的字段,表明是缺失值还是不是缺失值data['EMI_Loan_Submitted_Missing'] = data['EMI_Loan_Submitted'].apply(lambda x: 1 if pd.isnull(x) else 0)data[['EMI_Loan_Submitted','EMI_Loan_Submitted_Missing']].head(10)
Out[15]:
 EMI_Loan_SubmittedEMI_Loan_Submitted_Missing0NaN116762.9002NaN13NaN14NaN156978.9206NaN17NaN1830824.650910883.380
In [16]:
#原始那一列就可以不要了data.drop('EMI_Loan_Submitted',axis=1,inplace=True)

Employer Name字段处理

看看个数

In [17]:
len(data['Employer_Name'].value_counts())
Out[17]:
57193

不看也知道,每个人都有一个名字,太多了,懒癌晚期的同学直接drop掉了

In [18]:
#丢掉data.drop('Employer_Name',axis=1,inplace=True)

Existing_EMI字段

In [19]:
data.boxplot(column='Existing_EMI',return_type='axes')
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x109a2c310>
In [21]:
data['Existing_EMI'].describe()
Out[21]:
count    1.246260e+05mean     3.636342e+03std      3.369124e+04min      0.000000e+0025%               NaN50%               NaN75%               NaNmax      1.000000e+07Name: Existing_EMI, dtype: float64
In [22]:
#缺省值不多,用均值代替data['Existing_EMI'].fillna(0, inplace=True)

Interest_Rate字段:

In [23]:
data.boxplot(column=['Interest_Rate'],return_type='axes')
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a05f250>
In [24]:
#缺省值太多,也造一个字段,表示有无data['Interest_Rate_Missing'] = data['Interest_Rate'].apply(lambda x: 1 if pd.isnull(x) else 0)print data[['Interest_Rate','Interest_Rate_Missing']].head(10)
   Interest_Rate  Interest_Rate_Missing0            NaN                      11          13.25                      02            NaN                      13            NaN                      14            NaN                      15          13.99                      06            NaN                      17            NaN                      18          14.85                      09          18.25                      0
In [25]:
data.drop('Interest_Rate',axis=1,inplace=True)

Lead Creation Date字段

In [26]:
#不!要!了!,是的,不要了!!!data.drop('Lead_Creation_Date',axis=1,inplace=True)data.head()
Out[26]:
 Device_TypeDisbursedExisting_EMIFilled_FormGenderIDLoan_Amount_AppliedLoan_Amount_SubmittedLoan_Tenure_AppliedLoan_Tenure_Submitted...Salary_AccountSourceVar1Var2Var4Var5sourceAgeEMI_Loan_Submitted_MissingInterest_Rate_Missing0Web-browser0.00.0NFemaleID000002C20300000.0NaN5.0NaN...HDFC BankS122HBXXG10train37111Web-browser0.00.0NMaleID000004E40200000.0200000.02.02.0...ICICI BankS122HBXAG313train30002Web-browser0.00.0NMaleID000007H20600000.0450000.04.04.0...State Bank of IndiaS143HBXXB10train34113Web-browser0.00.0NMaleID000008I301000000.0920000.05.05.0...State Bank of IndiaS143HBXXB310train28114Web-browser0.025000.0NMaleID000009J40500000.0500000.02.02.0...HDFC BankS134HBXXB317train3111

5 rows × 24 columns

Loan Amount and Tenure applied字段

In [27]:
#找中位数去填补缺省值(因为缺省的不多)data['Loan_Amount_Applied'].fillna(data['Loan_Amount_Applied'].median(),inplace=True)data['Loan_Tenure_Applied'].fillna(data['Loan_Tenure_Applied'].median(),inplace=True)
In [28]:
data.head()
Out[28]:
 Device_TypeDisbursedExisting_EMIFilled_FormGenderIDLoan_Amount_AppliedLoan_Amount_SubmittedLoan_Tenure_AppliedLoan_Tenure_Submitted...Salary_AccountSourceVar1Var2Var4Var5sourceAgeEMI_Loan_Submitted_MissingInterest_Rate_Missing0Web-browser0.00.0NFemaleID000002C20300000.0NaN5.0NaN...HDFC BankS122HBXXG10train37111Web-browser0.00.0NMaleID000004E40200000.0200000.02.02.0...ICICI BankS122HBXAG313train30002Web-browser0.00.0NMaleID000007H20600000.0450000.04.04.0...State Bank of IndiaS143HBXXB10train34113Web-browser0.00.0NMaleID000008I301000000.0920000.05.05.0...State Bank of IndiaS143HBXXB310train28114Web-browser0.025000.0NMaleID000009J40500000.0500000.02.02.0...HDFC BankS134HBXXB317train3111

5 rows × 24 columns

Loan Amount and Tenure selected

In [29]:
# 缺省值太多。。。是否缺省。。。data['Loan_Amount_Submitted_Missing'] = data['Loan_Amount_Submitted'].apply(lambda x: 1 if pd.isnull(x) else 0)data['Loan_Tenure_Submitted_Missing'] = data['Loan_Tenure_Submitted'].apply(lambda x: 1 if pd.isnull(x) else 0)
In [45]:
data.head()
Out[45]:
 Device_TypeDisbursedEmployer_NameExisting_EMIFilled_FormGenderIDLoan_Amount_AppliedLoan_Amount_SubmittedLoan_Tenure_Applied...Var1Var2Var4Var5sourceAgeEMI_Loan_Submitted_MissingInterest_Rate_MissingLoan_Amount_Submitted_MissingLoan_Tenure_Submitted_Missing0Web-browser0.0CYBOSOL0.0NFemaleID000002C20300000.0NaN5.0...HBXXG10train3711111Web-browser0.0TATA CONSULTANCY SERVICES LTD (TCS)0.0NMaleID000004E40200000.0200000.02.0...HBXAG313train3000002Web-browser0.0ALCHEMIST HOSPITALS LTD0.0NMaleID000007H20600000.0450000.04.0...HBXXB10train3411003Web-browser0.0BIHAR GOVERNMENT0.0NMaleID000008I301000000.0920000.05.0...HBXXB310train2811004Web-browser0.0GLOBAL EDGE SOFTWARE25000.0NMaleID000009J40500000.0500000.02.0...HBXXB317train311100

5 rows × 27 columns

In [30]:
#原来的字段就没用了data.drop(['Loan_Amount_Submitted','Loan_Tenure_Submitted'],axis=1,inplace=True)

LoggedIn

In [31]:
#没想好怎么用。。。不要了。。。data.drop('LoggedIn',axis=1,inplace=True)

salary account

In [32]:
# 可能对接多个银行,所以也不要了data.drop('Salary_Account',axis=1,inplace=True)

Processing_Fee

In [33]:
#和之前一样的处理,有或者没有data['Processing_Fee_Missing'] = data['Processing_Fee'].apply(lambda x: 1 if pd.isnull(x) else 0)#旧的字段不要了data.drop('Processing_Fee',axis=1,inplace=True)

Source

In [34]:
data['Source'] = data['Source'].apply(lambda x: 'others' if x not in ['S122','S133'] else x)data['Source'].value_counts()
Out[34]:
S122      55249S133      42900others    26588Name: Source, dtype: int64

最终的数据样式

In [35]:
data.head()
Out[35]:
 Device_TypeDisbursedExisting_EMIFilled_FormGenderIDLoan_Amount_AppliedLoan_Tenure_AppliedMobile_VerifiedMonthly_Income...Var2Var4Var5sourceAgeEMI_Loan_Submitted_MissingInterest_Rate_MissingLoan_Amount_Submitted_MissingLoan_Tenure_Submitted_MissingProcessing_Fee_Missing0Web-browser0.00.0NFemaleID000002C20300000.05.0N20000...G10train37111111Web-browser0.00.0NMaleID000004E40200000.02.0Y35000...G313train30000012Web-browser0.00.0NMaleID000007H20600000.04.0Y22500...B10train34110013Web-browser0.00.0NMaleID000008I301000000.05.0Y35000...B310train28110014Web-browser0.025000.0NMaleID000009J40500000.02.0Y100000...B317train3111001

5 rows × 22 columns

In [36]:
data.describe()
Out[36]:
 DisbursedExisting_EMILoan_Amount_AppliedLoan_Tenure_AppliedMonthly_IncomeVar4Var5AgeEMI_Loan_Submitted_MissingInterest_Rate_MissingLoan_Amount_Submitted_MissingLoan_Tenure_Submitted_MissingProcessing_Fee_Missingcount87020.0000001.247370e+051.247370e+05124737.0000001.247370e+05124737.000000124737.000000124737.000000124737.000000124737.000000124737.000000124737.000000124737.000000mean0.0146293.633107e+032.298744e+052.1380755.309073e+042.9505604.96477430.9069960.6806400.6806400.3971160.3971160.684208std0.1200623.367642e+043.539938e+052.0148741.823394e+061.6952615.6697847.1378600.4662310.4662310.4893020.4893020.464833min0.0000000.000000e+000.000000e+000.0000000.000000e+000.0000000.00000018.0000000.0000000.0000000.0000000.0000000.00000025%NaN0.000000e+000.000000e+000.0000001.650000e+041.0000000.00000026.0000000.0000000.0000000.0000000.0000000.00000050%NaN0.000000e+001.000000e+052.0000002.500000e+043.0000002.00000029.0000001.0000001.0000000.0000000.0000001.00000075%NaN3.500000e+033.000000e+054.0000004.000000e+045.00000011.00000034.0000001.0000001.0000001.0000001.0000001.000000max1.0000001.000000e+071.500000e+0710.0000004.445544e+087.00000018.000000100.0000001.0000001.0000001.0000001.0000001.000000
In [37]:
data.apply(lambda x: sum(x.isnull()))
Out[37]:
Device_Type                          0Disbursed                        37717Existing_EMI                         0Filled_Form                          0Gender                               0ID                                   0Loan_Amount_Applied                  0Loan_Tenure_Applied                  0Mobile_Verified                      0Monthly_Income                       0Source                               0Var1                                 0Var2                                 0Var4                                 0Var5                                 0source                               0Age                                  0EMI_Loan_Submitted_Missing           0Interest_Rate_Missing                0Loan_Amount_Submitted_Missing        0Loan_Tenure_Submitted_Missing        0Processing_Fee_Missing               0dtype: int64
In [38]:
data.dtypes
Out[38]:
Device_Type                       objectDisbursed                        float64Existing_EMI                     float64Filled_Form                       objectGender                            objectID                                objectLoan_Amount_Applied              float64Loan_Tenure_Applied              float64Mobile_Verified                   objectMonthly_Income                     int64Source                            objectVar1                              objectVar2                              objectVar4                               int64Var5                               int64source                            objectAge                                int64EMI_Loan_Submitted_Missing         int64Interest_Rate_Missing              int64Loan_Amount_Submitted_Missing      int64Loan_Tenure_Submitted_Missing      int64Processing_Fee_Missing             int64dtype: object

数值编码

In [39]:
from sklearn.preprocessing import LabelEncoderle = LabelEncoder()var_to_encode = ['Device_Type','Filled_Form','Gender','Var1','Var2','Mobile_Verified','Source']for col in var_to_encode:    data[col] = le.fit_transform(data[col])
In [40]:
data.head()
Out[40]:
 Device_TypeDisbursedExisting_EMIFilled_FormGenderIDLoan_Amount_AppliedLoan_Tenure_AppliedMobile_VerifiedMonthly_Income...Var2Var4Var5sourceAgeEMI_Loan_Submitted_MissingInterest_Rate_MissingLoan_Amount_Submitted_MissingLoan_Tenure_Submitted_MissingProcessing_Fee_Missing010.00.000ID000002C20300000.05.0020000...610train3711111110.00.001ID000004E40200000.02.0135000...6313train3000001210.00.001ID000007H20600000.04.0122500...110train3411001310.00.001ID000008I301000000.05.0135000...1310train2811001410.025000.001ID000009J40500000.02.01100000...1317train3111001

5 rows × 22 columns

In [41]:
data.dtypes
Out[41]:
Device_Type                        int64Disbursed                        float64Existing_EMI                     float64Filled_Form                        int64Gender                             int64ID                                objectLoan_Amount_Applied              float64Loan_Tenure_Applied              float64Mobile_Verified                    int64Monthly_Income                     int64Source                             int64Var1                               int64Var2                               int64Var4                               int64Var5                               int64source                            objectAge                                int64EMI_Loan_Submitted_Missing         int64Interest_Rate_Missing              int64Loan_Amount_Submitted_Missing      int64Loan_Tenure_Submitted_Missing      int64Processing_Fee_Missing             int64dtype: object

类别型的One-Hot 编码

In [42]:
data = pd.get_dummies(data, columns=var_to_encode)data.columns
Out[42]:
Index([u'Disbursed', u'Existing_EMI', u'ID', u'Loan_Amount_Applied',       u'Loan_Tenure_Applied', u'Monthly_Income', u'Var4', u'Var5', u'source',       u'Age', u'EMI_Loan_Submitted_Missing', u'Interest_Rate_Missing',       u'Loan_Amount_Submitted_Missing', u'Loan_Tenure_Submitted_Missing',       u'Processing_Fee_Missing', u'Device_Type_0', u'Device_Type_1',       u'Filled_Form_0', u'Filled_Form_1', u'Gender_0', u'Gender_1', u'Var1_0',       u'Var1_1', u'Var1_2', u'Var1_3', u'Var1_4', u'Var1_5', u'Var1_6',       u'Var1_7', u'Var1_8', u'Var1_9', u'Var1_10', u'Var1_11', u'Var1_12',       u'Var1_13', u'Var1_14', u'Var1_15', u'Var1_16', u'Var1_17', u'Var1_18',       u'Var2_0', u'Var2_1', u'Var2_2', u'Var2_3', u'Var2_4', u'Var2_5',       u'Var2_6', u'Mobile_Verified_0', u'Mobile_Verified_1', u'Source_0',       u'Source_1', u'Source_2'],      dtype='object')

区分训练和测试数据

In [43]:
train = data.loc[data['source']=='train']test = data.loc[data['source']=='test']
In [44]:
train.drop('source',axis=1,inplace=True)test.drop(['source','Disbursed'],axis=1,inplace=True)
/Library/Python/2.7/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrameSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy  if __name__ == '__main__':/Library/Python/2.7/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrameSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy  from ipykernel import kernelapp as app
In [45]:
train.to_csv('train_modified.csv',index=False)test.to_csv('test_modified.csv',index=False)
In [ ]:
 
0 0
原创粉丝点击