kaggle系列(一、Titanic入门比赛)

来源:互联网 发布:c语言二分法求根精度 编辑:程序博客网 时间:2024/04/29 13:17

Table of Contents

  • 1  背景介绍
  • 2  数据导入与分析
    • 2.1  导入有用的包
    • 2.2  导入数据
    • 2.3  去除离群点
    • 2.4  连接训练数据和测试数据
    • 2.5  查看缺失值
  • 3  特征分析与数据前处理
    • 3.1  数值变量
      • 3.1.1  Explore SibSp feature vs Survived
      • 3.1.2  Explore Parch feature vs Survived
      • 3.1.3  Explore Age distibution
        • 3.1.3.1  Age缺失值填补
      • 3.1.4  Explore Fare distribution
    • 3.2  分类变量
      • 3.2.1  sex
      • 3.2.2  Explore Pclass vs Survived
      • 3.2.3  Explore Pclass vs Survived by Sex
      • 3.2.4  Explore Embarked vs Survived
      • 3.2.5  Explore Pclass vs Embarked
  • 4  特征工程
    • 4.1  name
    • 4.2  SibSp,Parch
    • 4.3  Embarked
    • 4.4  Cabin
    • 4.5  Ticket
    • 4.6  Pclass
    • 4.7  Age
    • 4.8  Fare
    • 4.9  PassengerId
  • 5  基本模型求解
    • 5.1  简单模型
      • 5.1.1  KNN
      • 5.1.2  Logistic regression
      • 5.1.3  Naive Bayes
      • 5.1.4  SVC
    • 5.2  单一集成模型
      • 5.2.1  Random Forest
      • 5.2.2  ExtraTrees
      • 5.2.3  Gradient boosting
      • 5.2.4  xgboost
      • 5.2.5  Plot learning curves
      • 5.2.6  Feature importance of tree based classifiers
    • 5.3  多种模型集成方法
      • 5.3.1  四个集成模型投票法
      • 5.3.2  stacking法
  • 6  总结
  • 7  参考资料

背景介绍

发生在1912年的泰坦尼克事件,导致船上2224名游客阵亡1502(我们的男主角也牺牲了),我们掌握船上乘客的一些数据以及一部分乘客是否获救的信息。我们希望能通过探索这些数据,发现一些不为人知的秘密,顺便预测下另外一部分乘客是否能够获救!

数据导入与分析

导入有用的包

In [1]:
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as sns%matplotlib inlineimport warningswarnings.filterwarnings('ignore')from collections import Counterfrom sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier,\      ExtraTreesClassifier,VotingClassifierfrom sklearn.linear_model import LogisticRegressionfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.naive_bayes import GaussianNB from sklearn.svm import SVCfrom sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, learning_curve

导入数据

In [2]:
train = pd.read_csv("C:/Code/Kaggle/Titanic/train.csv")test = pd.read_csv("C:/Code/Kaggle/Titanic/test.csv")IDtest = test["PassengerId"]

去除离群点

In [3]:
def detect_outliers(df,n,features):    outlier_indices = []    for col in features:        Q1 = np.percentile(df[col],25)        Q3 = np.percentile(df[col],75)        IQR = Q3 - Q1        outlier_step = 1.5 * IQR        outlier_list_col = df[(df[col] < Q1 - outlier_step) | (df[col] > Q3 + outlier_step)].index        outlier_indices.extend(outlier_list_col)    outlier_indices = Counter(outlier_indices)    multiple_outliers = list(k for k, v in outlier_indices.items() if v>n)    return multiple_outliersOutliers_to_drop = detect_outliers(train,2,["Age","SibSp","Parch","Fare"])
In [4]:
train.loc[Outliers_to_drop]
Out[4]:
 PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked272801Fortune, Mr. Charles Alexandermale19.03219950263.00C23 C25 C27S888911Fortune, Miss. Mabel Helenfemale23.03219950263.00C23 C25 C27S15916003Sage, Master. Thomas HenrymaleNaN82CA. 234369.55NaNS18018103Sage, Miss. Constance GladysfemaleNaN82CA. 234369.55NaNS20120203Sage, Mr. FrederickmaleNaN82CA. 234369.55NaNS79279303Sage, Miss. Stella AnnafemaleNaN82CA. 234369.55NaNS32432503Sage, Mr. George John JrmaleNaN82CA. 234369.55NaNS84684703Sage, Mr. Douglas BullenmaleNaN82CA. 234369.55NaNS34134211Fortune, Miss. Alice Elizabethfemale24.03219950263.00C23 C25 C27S86386403Sage, Miss. Dorothy Edith "Dolly"femaleNaN82CA. 234369.55NaNS
In [5]:
train = train.drop(Outliers_to_drop,axis=0).reset_index(drop=True)

连接训练数据和测试数据

In [6]:
train_len = len(train)dataset = pd.concat([train,test], axis=0).reset_index(drop=True)dataset.tail()
Out[6]:
 AgeCabinEmbarkedFareNameParchPassengerIdPclassSexSibSpSurvivedTicket1294NaNNaNS8.0500Spector, Mr. Woolf013053male0NaNA.5. 3236129539.0C105C108.9000Oliva y Ocana, Dona. Fermina013061female0NaNPC 17758129638.5NaNS7.2500Saether, Mr. Simon Sivertsen013073male0NaNSOTON/O.Q. 31012621297NaNNaNS8.0500Ware, Mr. Frederick013083male0NaN3593091298NaNNaNC22.3583Peter, Master. Michael J113093male1NaN2668

查看缺失值

In [7]:
#dataset = dataset.fillna(np.nan)dataset.isnull().sum()
Out[7]:
Age             256Cabin          1007Embarked          2Fare              1Name              0Parch             0PassengerId       0Pclass            0Sex               0SibSp             0Survived        418Ticket            0dtype: int64
In [8]:
train.info()train.isnull().sum()
<class 'pandas.core.frame.DataFrame'>RangeIndex: 881 entries, 0 to 880Data columns (total 12 columns):PassengerId    881 non-null int64Survived       881 non-null int64Pclass         881 non-null int64Name           881 non-null objectSex            881 non-null objectAge            711 non-null float64SibSp          881 non-null int64Parch          881 non-null int64Ticket         881 non-null objectFare           881 non-null float64Cabin          201 non-null objectEmbarked       879 non-null objectdtypes: float64(2), int64(5), object(5)memory usage: 82.7+ KB
Out[8]:
PassengerId      0Survived         0Pclass           0Name             0Sex              0Age            170SibSp            0Parch            0Ticket           0Fare             0Cabin          680Embarked         2dtype: int64
In [9]:
train.describe()
Out[9]:
 PassengerIdSurvivedPclassAgeSibSpParchFarecount881.000000881.000000881.000000711.000000881.000000881.000000881.000000mean446.7139610.3859252.30760529.7316030.4551650.36322431.121566std256.6170210.4870900.83505514.5478350.8715710.79183947.996249min1.0000000.0000001.0000000.4200000.0000000.0000000.00000025%226.0000000.0000002.00000020.2500000.0000000.0000007.89580050%448.0000000.0000003.00000028.0000000.0000000.00000014.45420075%668.0000001.0000003.00000038.0000001.0000000.00000030.500000max891.0000001.0000003.00000080.0000005.0000006.000000512.329200

特征分析与数据前处理

数值变量

In [10]:
g = sns.heatmap(train[["Survived","SibSp","Parch","Age","Fare"]].corr(), annot=True, fmt=".2f", cmap = "coolwarm")

Explore SibSp feature vs Survived

In [11]:
g = sns.factorplot(x="SibSp",y="Survived",data=train,kind="bar")g = g.set_ylabels("survival probability")

Explore Parch feature vs Survived

In [12]:
g  = sns.factorplot(x="Parch",y="Survived",data=train,kind="bar")g = g.set_ylabels("survival probability")

Explore Age distibution

In [13]:
g = sns.kdeplot(train["Age"][(train["Survived"] == 0) & (train["Age"].notnull())], color="Red", shade = True)g = sns.kdeplot(train["Age"][(train["Survived"] == 1) & (train["Age"].notnull())], ax =g, color="Blue", shade= True)g.set_xlabel("Age")g.set_ylabel("Frequency")g = g.legend(["Not Survived","Survived"])

Age缺失值填补

我们发现数据集中总共有256个缺失值,并且年龄对存活率有着不小的影响(小孩更容易存活),所以我们需要保留这个特征,并且将缺失值进行填充

(As we see, Age column contains 256 missing values in the whole dataset. Since there is subpopulations that have more chance to survive (children for example), it is preferable to keep the age feature and to impute the missing values. To adress this problem, i looked at the most correlated features with Age (Sex, Parch , Pclass and SibSP).)

填充缺失值的三种方法

Completing a numerical continuous feature

Now we should start estimating and completing features with missing or null values. We will first do this for the Age feature.

We can consider three methods to complete a numerical continuous feature.

A simple way is to generate random numbers between mean and standard deviation.

More accurate way of guessing missing values is to use other correlated features. In our case we note correlation among Age, Gender, and Pclass. Guess Age values using median values for Age across sets of Pclass and Gender feature combinations. So, median Age for Pclass=1 and Gender=0, Pclass=1 and Gender=1, and so on...

Combine methods 1 and 2. So instead of guessing age values based on median, use random numbers between mean and standard deviation, based on sets of Pclass and Gender combinations.

Method 1 and 3 will introduce random noise into our models. The results from multiple executions might vary. We will prefer method 2.

In [14]:
# Explore Age vs Sex, Parch , Pclass and SibSPg = sns.factorplot(y="Age",x="Sex",data=dataset,kind="box")g = sns.factorplot(y="Age",x="Sex",hue="Pclass", data=dataset,kind="box")g = sns.factorplot(y="Age",x="Parch", data=dataset,kind="box")g = sns.factorplot(y="Age",x="SibSp", data=dataset,kind="box")
In [15]:
# convert Sex into categorical value 0 for male and 1 for femaledataset["Sex"] = dataset["Sex"].map({"male": 0, "female":1})g = sns.heatmap(dataset[["Age","Sex","SibSp","Parch","Pclass"]].corr(),cmap="coolwarm",annot=True)
In [16]:
# Filling missing value of Age ## Fill Age with the median age of similar rows according to Pclass, Parch and SibSp# Index of NaN age rowsindex_NaN_age = list(dataset["Age"][dataset["Age"].isnull()].index)for i in index_NaN_age :    age_med = dataset["Age"].median()    age_pred = dataset["Age"][((dataset['SibSp'] == dataset.iloc[i]["SibSp"]) & (dataset['Parch'] == dataset.iloc[i]["Parch"]) & (dataset['Pclass'] == dataset.iloc[i]["Pclass"]))].median()    if not np.isnan(age_pred) :        dataset['Age'].iloc[i] = age_pred    else :        dataset['Age'].iloc[i] = age_meddataset.tail()
Out[16]:
 AgeCabinEmbarkedFareNameParchPassengerIdPclassSexSibSpSurvivedTicket129425.0NaNS8.0500Spector, Mr. Woolf01305300NaNA.5. 3236129539.0C105C108.9000Oliva y Ocana, Dona. Fermina01306110NaNPC 17758129638.5NaNS7.2500Saether, Mr. Simon Sivertsen01307300NaNSOTON/O.Q. 3101262129725.0NaNS8.0500Ware, Mr. Frederick01308300NaN359309129816.0NaNC22.3583Peter, Master. Michael J11309301NaN2668
In [17]:
g = sns.factorplot(x="Survived", y = "Age",data = train, kind="box")g = sns.factorplot(x="Survived", y = "Age",data = train, kind="violin")

Explore Fare distribution

In [18]:
#Fill Fare missing values with the median valuedataset["Fare"] = dataset["Fare"].fillna(dataset["Fare"].median())g = sns.distplot(dataset["Fare"], color="m")g = g.legend(loc="best")
In [19]:
# Apply log to Fare to reduce skewness distributiondataset["Fare"] = dataset["Fare"].map(lambda i: np.log(i) if i > 0 else 0)g = sns.distplot(dataset["Fare"], color="b")g = g.legend(loc="best")

分类变量

sex

In [20]:
g = sns.factorplot(x="Sex",y="Survived",data=train,kind="bar")g = g.set_ylabels("Survival Probability")
In [21]:
train[["Sex","Survived"]].groupby('Sex').mean()
Out[21]:
 SurvivedSex female0.747573male0.190559

Explore Pclass vs Survived

In [22]:
g = sns.factorplot(x="Pclass",y="Survived",data=train,kind="bar", size = 6 , palette = "muted")g = g.set_ylabels("survival probability")

Explore Pclass vs Survived by Sex

In [23]:
g = sns.factorplot(x="Pclass", y="Survived", hue="Sex", data=train,                  size=6, kind="bar", palette="muted")g = g.set_ylabels("survival probability")

Explore Embarked vs Survived

In [24]:
#Fill Embarked nan values of dataset set with 'S' most frequent valuedataset["Embarked"] = dataset["Embarked"].fillna("S")g = sns.factorplot(x="Embarked", y="Survived",  data=train,                   size=6, kind="bar", palette="muted")g = g.set_ylabels("survival probability")

Explore Pclass vs Embarked

In [25]:
# Explore Pclass vs Embarked g = sns.factorplot("Pclass", col="Embarked",  data=train,                   size=6, kind="count", palette="muted")g = g.set_ylabels("Count")

特征工程

name

In [26]:
dataset["Name"].head()
Out[26]:
0                              Braund, Mr. Owen Harris1    Cumings, Mrs. John Bradley (Florence Briggs Th...2                               Heikkinen, Miss. Laina3         Futrelle, Mrs. Jacques Heath (Lily May Peel)4                             Allen, Mr. William HenryName: Name, dtype: object
In [27]:
# Get Title from Namedataset_title = [i.split(",")[1].split(".")[0].strip() for i in dataset["Name"]]dataset["Title"] = pd.Series(dataset_title)dataset["Title"].head()
Out[27]:
0      Mr1     Mrs2    Miss3     Mrs4      MrName: Title, dtype: object
In [28]:
g = sns.countplot(x="Title",data=dataset)g = plt.setp(g.get_xticklabels(), rotation=45)
In [29]:
# Convert to categorical values Title dataset["Title"] = dataset["Title"].replace(['Lady', 'the Countess','Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')dataset["Title"] = dataset["Title"].map({"Master":0, "Miss":1, "Ms" : 1 , "Mme":1, "Mlle":1, "Mrs":1, "Mr":2, "Rare":3})dataset["Title"] = dataset["Title"].astype(int)
In [30]:
g = sns.countplot(dataset["Title"])g = g.set_xticklabels(["Master","Miss/Ms/Mme/Mlle/Mrs","Mr","Rare"])
In [31]:
g = sns.factorplot(x="Title",y="Survived",data=dataset[:train_len],kind="bar")g = g.set_xticklabels(["Master","Miss-Mrs","Mr","Rare"])g = g.set_ylabels("survival probability")
In [32]:
# Drop Name variabledataset.drop(labels = ["Name"], axis = 1, inplace = True)   #inplace为True时返回None,为默认False时返回dataset
In [33]:
# convert to indicator values Title and Embarked Title_dummies = pd.get_dummies(dataset['Title'],prefix='Title')dataset = dataset.join(Title_dummies).drop(['Title'],axis=1)#dataset = pd.get_dummies(dataset, columns = ["Title"])dataset.drop(['Title_3'],axis=1,inplace=True) #这里去掉存活率最低的一列(冗余特征)

SibSp,Parch

In [34]:
# Create a family size descriptor from SibSp and Parchdataset["Fsize"] = dataset["SibSp"] + dataset["Parch"] + 1g = sns.factorplot(x="Fsize",y="Survived",data = dataset)g = g.set_ylabels("Survival Probability")
In [35]:
# Create new feature of family sizedataset['Single'] = dataset['Fsize'].map(lambda s: 1 if s == 1 else 0)dataset['SmallF'] = dataset['Fsize'].map(lambda s: 1 if  s == 2  else 0)dataset['MedF'] = dataset['Fsize'].map(lambda s: 1 if 3 <= s <= 4 else 0)dataset['LargeF'] = dataset['Fsize'].map(lambda s: 1 if s >= 5 else 0)
In [36]:
dataset.drop(['Fsize','SibSp','Parch'],axis=1,inplace=True)
In [37]:
dataset.columns
Out[37]:
Index([u'Age', u'Cabin', u'Embarked', u'Fare', u'PassengerId', u'Pclass',       u'Sex', u'Survived', u'Ticket', u'Title_0', u'Title_1', u'Title_2',       u'Single', u'SmallF', u'MedF', u'LargeF'],      dtype='object')

Embarked

In [38]:
dataset[:train_len][['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Embarked', ascending=True)
Out[38]:
 EmbarkedSurvived0C0.5535711Q0.3896102S0.341195
In [39]:
#dataset = pd.get_dummies(dataset, columns = ["Embarked"], prefix="Em")Embarked_dummies = pd.get_dummies(dataset['Embarked'], prefix='Em')dataset = dataset.join(Embarked_dummies).drop(['Embarked'],axis=1)dataset.drop(['Em_S'],axis=1,inplace=True)
In [40]:
dataset.columns
Out[40]:
Index([u'Age', u'Cabin', u'Fare', u'PassengerId', u'Pclass', u'Sex',       u'Survived', u'Ticket', u'Title_0', u'Title_1', u'Title_2', u'Single',       u'SmallF', u'MedF', u'LargeF', u'Em_C', u'Em_Q'],      dtype='object')

Cabin

In [41]:
dataset["Cabin"].head()
Out[41]:
0     NaN1     C852     NaN3    C1234     NaNName: Cabin, dtype: object
In [42]:
dataset["Cabin"].describe()
Out[42]:
count                 292unique                186top       B57 B59 B63 B66freq                    5Name: Cabin, dtype: object
In [43]:
dataset["Cabin"].isnull().sum()
Out[43]:
1007
In [44]:
dataset["Cabin"][dataset["Cabin"].notnull()].head()
Out[44]:
1      C853     C1236      E4610      G611    C103Name: Cabin, dtype: object
In [45]:
# Replace the Cabin number by the type of cabin 'X' if notdataset["Cabin"] = pd.Series([i[0] if not pd.isnull(i) else 'X' for i in dataset['Cabin'] ])
In [46]:
g = sns.countplot(dataset["Cabin"],order=['A','B','C','D','E','F','G','T','X'])
In [47]:
g = sns.factorplot(y="Survived",x="Cabin",data=dataset[:train_len],kind="bar",order=['A','B','C','D','E','F','G','T','X'])g = g.set_ylabels("Survival Probability")
In [48]:
dataset = pd.get_dummies(dataset, columns = ["Cabin"],prefix="Cabin")dataset.drop(['Cabin_T'],axis=1,inplace=True)

Ticket

In [49]:
dataset["Ticket"].head()
Out[49]:
0           A/5 211711            PC 175992    STON/O2. 31012823              1138034              373450Name: Ticket, dtype: object
In [50]:
## Treat Ticket by extracting the ticket prefix. When there is no prefix it returns X. Ticket = []for i in list(dataset.Ticket):    if not i.isdigit() :        Ticket.append(i.replace(".","").replace("/","").strip().split(' ')[0]) #Take prefix    else:        Ticket.append("X")        dataset["Ticket"] = Ticketdataset["Ticket"].head()
Out[50]:
0        A51        PC2    STONO23         X4         XName: Ticket, dtype: object
In [51]:
dataset[:train_len][['Ticket', 'Survived']].groupby(['Ticket'], as_index=False).mean().sort_values(by='Ticket', ascending=True)
Out[51]:
 TicketSurvived0A40.0000001A50.0952382AS0.0000003C0.4000004CA0.4117655CASOTON0.0000006FC0.0000007FCC0.8000008Fa0.0000009LINE0.25000010PC0.65000011PP0.66666712PPP0.50000013SC1.00000014SCA40.00000015SCAH0.66666716SCOW0.00000017SCPARIS0.42857118SCParis0.50000019SOC0.16666720SOP0.00000021SOPP0.00000022SOTONO20.00000023SOTONOQ0.13333324SP0.00000025STONO0.41666726STONO20.50000027SWPP1.00000028WC0.10000029WEP0.33333330X0.382979
In [52]:
dataset = pd.get_dummies(dataset, columns = ["Ticket"], prefix="T")dataset.drop(['T_A4'],axis=1,inplace=True)

Pclass

In [53]:
dataset[:train_len][['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Pclass', ascending=True)
Out[53]:
 PclassSurvived010.629108120.472826230.245868
In [54]:
# Create categorical values for Pclassdataset["Pclass"] = dataset["Pclass"].astype("category")dataset = pd.get_dummies(dataset, columns = ["Pclass"],prefix="Pc")dataset.drop(['Pc_3'],axis=1,inplace=True)

Age

In [55]:
dataset['Age']=dataset['Age'].astype(int)dataset['AgeBand'] = pd.cut(dataset['Age'], 5)dataset[:train_len][['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by='AgeBand', ascending=True)
Out[55]:
 AgeBandSurvived0(-0.08, 16.0]0.5321101(16.0, 32.0]0.3403362(32.0, 48.0]0.4120373(48.0, 64.0]0.4347834(64.0, 80.0]0.090909
In [56]:
dataset.tail()
Out[56]:
 AgeFarePassengerIdSexSurvivedTitle_0Title_1Title_2SingleSmallF...T_STONOT_STONO2T_STONOQT_SWPPT_WCT_WEPT_XPc_1Pc_2AgeBand1294252.08567213050NaN00110...000000000(16.0, 32.0]1295394.69043013061NaN00010...000000010(32.0, 48.0]1296381.98100113070NaN00110...000000000(32.0, 48.0]1297252.08567213080NaN00110...000000100(16.0, 32.0]1298163.10719813090NaN10000...000000100(-0.08, 16.0]

5 rows × 61 columns

In [57]:
dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3dataset.loc[ dataset['Age'] > 64, 'Age'] = 4 Age_dummies = pd.get_dummies(dataset['Age'], prefix='Age')dataset=dataset.join(Age_dummies).drop(['Age','AgeBand'],axis=1)dataset.drop(['Age_4'],axis=1,inplace=True)

Fare

In [58]:
dataset['FareBand'] = pd.cut(dataset['Fare'], 4)dataset[:train_len][['FareBand','Survived']].groupby(['FareBand'],as_index=False).mean().sort_values(by='FareBand', ascending=True)
Out[58]:
 FareBandSurvived0(-0.00624, 1.56]0.0625001(1.56, 3.119]0.2887192(3.119, 4.679]0.5170073(4.679, 6.239]0.750000
In [59]:
dataset.columns
Out[59]:
Index([u'Fare', u'PassengerId', u'Sex', u'Survived', u'Title_0', u'Title_1',       u'Title_2', u'Single', u'SmallF', u'MedF', u'LargeF', u'Em_C', u'Em_Q',       u'Cabin_A', u'Cabin_B', u'Cabin_C', u'Cabin_D', u'Cabin_E', u'Cabin_F',       u'Cabin_G', u'Cabin_X', u'T_A', u'T_A5', u'T_AQ3', u'T_AQ4', u'T_AS',       u'T_C', u'T_CA', u'T_CASOTON', u'T_FC', u'T_FCC', u'T_Fa', u'T_LINE',       u'T_LP', u'T_PC', u'T_PP', u'T_PPP', u'T_SC', u'T_SCA3', u'T_SCA4',       u'T_SCAH', u'T_SCOW', u'T_SCPARIS', u'T_SCParis', u'T_SOC', u'T_SOP',       u'T_SOPP', u'T_SOTONO2', u'T_SOTONOQ', u'T_SP', u'T_STONO', u'T_STONO2',       u'T_STONOQ', u'T_SWPP', u'T_WC', u'T_WEP', u'T_X', u'Pc_1', u'Pc_2',       u'Age_0', u'Age_1', u'Age_2', u'Age_3', u'FareBand'],      dtype='object')
In [60]:
dataset.loc[ dataset['Fare'] <= 1.56, 'Fare'] = 0dataset.loc[(dataset['Fare'] > 1.56) & (dataset['Fare'] <= 3.119), 'Fare'] = 1dataset.loc[(dataset['Fare'] > 3.119) & (dataset['Fare'] <= 4.679), 'Fare']   = 2dataset.loc[ dataset['Fare'] > 4.679, 'Fare'] = 3Fare_dummies = pd.get_dummies(dataset['Fare'], prefix='Fare')dataset = dataset.join(Fare_dummies).drop(['Fare','FareBand'],axis=1)
In [61]:
dataset.columns
Out[61]:
Index([u'PassengerId', u'Sex', u'Survived', u'Title_0', u'Title_1', u'Title_2',       u'Single', u'SmallF', u'MedF', u'LargeF', u'Em_C', u'Em_Q', u'Cabin_A',       u'Cabin_B', u'Cabin_C', u'Cabin_D', u'Cabin_E', u'Cabin_F', u'Cabin_G',       u'Cabin_X', u'T_A', u'T_A5', u'T_AQ3', u'T_AQ4', u'T_AS', u'T_C',       u'T_CA', u'T_CASOTON', u'T_FC', u'T_FCC', u'T_Fa', u'T_LINE', u'T_LP',       u'T_PC', u'T_PP', u'T_PPP', u'T_SC', u'T_SCA3', u'T_SCA4', u'T_SCAH',       u'T_SCOW', u'T_SCPARIS', u'T_SCParis', u'T_SOC', u'T_SOP', u'T_SOPP',       u'T_SOTONO2', u'T_SOTONOQ', u'T_SP', u'T_STONO', u'T_STONO2',       u'T_STONOQ', u'T_SWPP', u'T_WC', u'T_WEP', u'T_X', u'Pc_1', u'Pc_2',       u'Age_0', u'Age_1', u'Age_2', u'Age_3', u'Fare_0.0', u'Fare_1.0',       u'Fare_2.0', u'Fare_3.0'],      dtype='object')
In [62]:
dataset.drop(['Fare_0.0'],axis=1,inplace=True)

PassengerId

In [63]:
# Drop useless variables dataset.drop(labels = ["PassengerId"], axis = 1, inplace = True)
In [64]:
dataset.head()
Out[64]:
 SexSurvivedTitle_0Title_1Title_2SingleSmallFMedFLargeFEm_C...T_XPc_1Pc_2Age_0Age_1Age_2Age_3Fare_1.0Fare_2.0Fare_3.0000.000101000...0000100100111.001001001...0100010010211.001010000...0000100100311.001001000...1100010010400.000110000...1000010100

5 rows × 64 columns

In [65]:
dataset.columns
Out[65]:
Index([u'Sex', u'Survived', u'Title_0', u'Title_1', u'Title_2', u'Single',       u'SmallF', u'MedF', u'LargeF', u'Em_C', u'Em_Q', u'Cabin_A', u'Cabin_B',       u'Cabin_C', u'Cabin_D', u'Cabin_E', u'Cabin_F', u'Cabin_G', u'Cabin_X',       u'T_A', u'T_A5', u'T_AQ3', u'T_AQ4', u'T_AS', u'T_C', u'T_CA',       u'T_CASOTON', u'T_FC', u'T_FCC', u'T_Fa', u'T_LINE', u'T_LP', u'T_PC',       u'T_PP', u'T_PPP', u'T_SC', u'T_SCA3', u'T_SCA4', u'T_SCAH', u'T_SCOW',       u'T_SCPARIS', u'T_SCParis', u'T_SOC', u'T_SOP', u'T_SOPP', u'T_SOTONO2',       u'T_SOTONOQ', u'T_SP', u'T_STONO', u'T_STONO2', u'T_STONOQ', u'T_SWPP',       u'T_WC', u'T_WEP', u'T_X', u'Pc_1', u'Pc_2', u'Age_0', u'Age_1',       u'Age_2', u'Age_3', u'Fare_1.0', u'Fare_2.0', u'Fare_3.0'],      dtype='object')

基本模型求解

In [66]:
## Separate train dataset and test datasettrain = dataset[:train_len]test = dataset[train_len:]test.drop(labels=["Survived"],axis = 1,inplace=True)
In [67]:
## Separate train features and label train["Survived"] = train["Survived"].astype(int)Y_train = train["Survived"]X_train = train.drop(labels = ["Survived"],axis = 1)

简单模型

1.KNN

2.Logistic regression

3.Naive_bayes

4.SVC

In [68]:
# Cross validate model with Kfold stratified cross valkfold = StratifiedKFold(n_splits=10)

KNN

In [69]:
k_range=list([16,18])knn_param_grid={'n_neighbors' : k_range}gridKNN = GridSearchCV(KNeighborsClassifier(),param_grid = knn_param_grid, cv=kfold, scoring="accuracy", n_jobs= -1, verbose = 1)gridKNN.fit(X_train,Y_train)print(gridKNN.best_estimator_)print(gridKNN.best_score_)
Fitting 10 folds for each of 2 candidates, totalling 20 fitsKNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',           metric_params=None, n_jobs=1, n_neighbors=16, p=2,           weights='uniform')0.821793416572
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:    4.0s finished

Logistic regression

In [70]:
LR_param_grid={'penalty' : ['l1', 'l2'], 'C' : [0.001,0.01,0.1,1,10,100]}gridLR = GridSearchCV(LogisticRegression(),param_grid = LR_param_grid, cv=kfold, scoring="accuracy", n_jobs= -1, verbose = 1)gridLR.fit(X_train,Y_train)print(gridLR.best_estimator_)print(gridLR.best_score_)
Fitting 10 folds for each of 12 candidates, totalling 120 fits
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    4.6s
LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,          verbose=0, warm_start=False)0.822928490352
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:    7.6s finished

Naive Bayes

In [71]:
from sklearn.naive_bayes import GaussianNB GaussianNB=GaussianNB()GaussianNB.fit(X_train, Y_train)NB_score=cross_val_score(GaussianNB,X_train,Y_train, cv = kfold,scoring = "accuracy").mean()print(NB_score)#不知怎么回事,计算有问题
0.431307456588

SVC

In [72]:
C=[0.05,0.1,0.2,0.3,0.25,0.4,0.5,0.6,0.7,0.8,0.9,1]gamma=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]kernel=['rbf','linear']SVC_param_grid={'kernel':kernel,'C':C,'gamma':gamma}gridSVC = GridSearchCV(SVC(),param_grid = SVC_param_grid, cv=kfold, scoring="accuracy", n_jobs= -1, verbose = 1)gridSVC.fit(X_train,Y_train)print(gridSVC.best_estimator_)print(gridSVC.best_score_)
Fitting 10 folds for each of 240 candidates, totalling 2400 fits
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    4.3s[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    6.5s[Parallel(n_jobs=-1)]: Done 615 tasks      | elapsed:   12.2s[Parallel(n_jobs=-1)]: Done 1315 tasks      | elapsed:   20.5s[Parallel(n_jobs=-1)]: Done 2215 tasks      | elapsed:   31.1s[Parallel(n_jobs=-1)]: Done 2385 out of 2400 | elapsed:   33.1s remaining:    0.1s[Parallel(n_jobs=-1)]: Done 2400 out of 2400 | elapsed:   33.2s finished
SVC(C=0.6, cache_size=200, class_weight=None, coef0=0.0,  decision_function_shape='ovr', degree=3, gamma=0.3, kernel='rbf',  max_iter=-1, probability=False, random_state=None, shrinking=True,  tol=0.001, verbose=False)0.83314415437

经过简单的调参处理,支持向量机的交叉验证预测效果最好!最终调参后的本地正确率为0.83314415437!

In [73]:
test_Survived = pd.Series(gridSVC.best_estimator_.predict(test), name="Survived")results_SVC = pd.concat([IDtest,test_Survived],axis=1)results_SVC.to_csv("SVC_predict.csv",index=False)

单一集成模型

1 Random Forest

2 Extra Trees

3 Gradient Boosting

4 xgboost

Random Forest

In [74]:
# RFC Parameters tunning RFC = RandomForestClassifier()## Search grid for optimal parametersrf_param_grid = {"n_estimators" :[300, 500],                 "max_depth": [8, 15],              "min_samples_split": [2, 5, 10],              "min_samples_leaf": [1, 2, 5],              "max_features": ['log2', 'sqrt']}gsRFC = GridSearchCV(RFC,param_grid = rf_param_grid, cv=5, scoring="accuracy", n_jobs= -1, verbose = 1)gsRFC.fit(X_train,Y_train)RFC_best = gsRFC.best_estimator_# Best scoregsRFC.best_score_ ,RFC_best
Fitting 5 folds for each of 72 candidates, totalling 360 fits
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   14.3s[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   59.0s[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:  2.1min finished
Out[74]:
(0.83200908059023837, RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',             max_depth=8, max_features='log2', max_leaf_nodes=None,             min_impurity_decrease=0.0, min_impurity_split=None,             min_samples_leaf=1, min_samples_split=2,             min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1,             oob_score=False, random_state=None, verbose=0,             warm_start=False))
In [93]:
test_Survived = pd.Series(RFC_best.predict(test), name="Survived")results_RFC_best = pd.concat([IDtest,test_Survived],axis=1)results_RFC_best.to_csv("RFC_best.csv",index=False)

ExtraTrees

In [75]:
#ExtraTrees ExtC = ExtraTreesClassifier()## Search grid for optimal parametersex_param_grid = {"max_depth": [8, 15],              "max_features": ['log2', 'sqrt'],              "min_samples_split": [2,5, 10],              "min_samples_leaf": [1, 2, 5],              "n_estimators" :[300, 500]}gsExtC = GridSearchCV(ExtC,param_grid = ex_param_grid, cv=5, scoring="accuracy", n_jobs= -1, verbose = 1)gsExtC.fit(X_train,Y_train)ExtC_best = gsExtC.best_estimator_# Best scoregsExtC.best_score_, ExtC_best
Fitting 5 folds for each of 72 candidates, totalling 360 fits
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   13.7s[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   54.9s[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:  1.7min finished
Out[75]:
(0.82973893303064694, ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',            max_depth=8, max_features='log2', max_leaf_nodes=None,            min_impurity_decrease=0.0, min_impurity_split=None,            min_samples_leaf=1, min_samples_split=2,            min_weight_fraction_leaf=0.0, n_estimators=300, n_jobs=1,            oob_score=False, random_state=None, verbose=0, warm_start=False))

Gradient boosting

In [76]:
# Gradient boosting tunningGBC = GradientBoostingClassifier()gb_param_grid = {              'learning_rate': [0.1, 0.05, 0.01],              'max_depth': [3, 5, 10],              'min_samples_leaf': [50,100,150],             'max_features' :['sqrt','log2']              }gsGBC = GridSearchCV(GBC,param_grid = gb_param_grid,cv=5, scoring="accuracy", n_jobs= 4, verbose = 1)gsGBC.fit(X_train,Y_train)GBC_best = gsGBC.best_estimator_# Best scoregsGBC.best_score_,GBC_best
Fitting 5 folds for each of 54 candidates, totalling 270 fits
[Parallel(n_jobs=4)]: Done  52 tasks      | elapsed:    3.1s[Parallel(n_jobs=4)]: Done 270 out of 270 | elapsed:    7.2s finished
Out[76]:
(0.8161180476730987, GradientBoostingClassifier(criterion='friedman_mse', init=None,               learning_rate=0.1, loss='deviance', max_depth=10,               max_features='sqrt', max_leaf_nodes=None,               min_impurity_decrease=0.0, min_impurity_split=None,               min_samples_leaf=50, min_samples_split=2,               min_weight_fraction_leaf=0.0, n_estimators=100,               presort='auto', random_state=None, subsample=1.0, verbose=0,               warm_start=False))

xgboost

In [77]:
import xgboost as xgbfrom xgboost.sklearn import XGBClassifier## Search grid for optimal parametersxgb_param_grid = {"learning_rate": [0.01,0.5,1.0],                  "n_estimators" : [300,500],              "gamma": [0.1, 0.5,1.0],              "max_depth": [3, 5, 10],              "min_child_weight": [1, 3],              "subsample" : [0.8,1.0],                 "colsample_bytree" : [0.8,1.0]}gridxgb = GridSearchCV(XGBClassifier(),param_grid = xgb_param_grid, cv=5, scoring="accuracy", n_jobs= -1, verbose = 1)gridxgb.fit(X_train,Y_train)gridxgb_best = gridxgb.best_estimator_# Best scoregridxgb.best_score_
Fitting 5 folds for each of 432 candidates, totalling 2160 fits
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    6.8s[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   35.8s[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  1.4min[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:  2.2min[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:  3.6min[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed:  5.3min[Parallel(n_jobs=-1)]: Done 2160 out of 2160 | elapsed:  6.7min finished
Out[77]:
0.82973893303064694
In [78]:
print(gridxgb_best)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,       colsample_bytree=0.8, gamma=1.0, learning_rate=0.01,       max_delta_step=0, max_depth=3, min_child_weight=3, missing=None,       n_estimators=300, n_jobs=1, nthread=None,       objective='binary:logistic', random_state=0, reg_alpha=0,       reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,       subsample=1.0)
In [92]:
test_Survived = pd.Series(gridxgb_best.predict(test), name="Survived")results_gridxgb = pd.concat([IDtest,test_Survived],axis=1)results_gridxgb.to_csv("gridxgb.csv",index=False)

Plot learning curves

In [79]:
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,                        n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):    """Generate a simple plot of the test and training learning curve"""    plt.figure()    plt.title(title)    if ylim is not None:        plt.ylim(*ylim)    plt.xlabel("Training examples")    plt.ylabel("Score")    train_sizes, train_scores, test_scores = learning_curve(        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)    train_scores_mean = np.mean(train_scores, axis=1)    train_scores_std = np.std(train_scores, axis=1)    test_scores_mean = np.mean(test_scores, axis=1)    test_scores_std = np.std(test_scores, axis=1)    plt.grid()    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,                     train_scores_mean + train_scores_std, alpha=0.1,                     color="r")    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,                     test_scores_mean + test_scores_std, alpha=0.1, color="g")    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",             label="Training score")    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",             label="Cross-validation score")    plt.legend(loc="best")    return pltg = plot_learning_curve(gsRFC.best_estimator_,"RF mearning curves",X_train,Y_train,cv=kfold)g = plot_learning_curve(gsExtC.best_estimator_,"ExtraTrees learning curves",X_train,Y_train,cv=kfold)g = plot_learning_curve(gsGBC.best_estimator_,"GradientBoosting learning curves",X_train,Y_train,cv=kfold)g = plot_learning_curve(gridxgb.best_estimator_,"XGBoost learning curves",X_train,Y_train,cv=kfold)

Feature importance of tree based classifiers

In [80]:
nrows = ncols = 2fig, axes = plt.subplots(nrows = nrows, ncols = ncols, sharex="all", figsize=(15,15))names_classifiers = [("ExtraTrees",ExtC_best),("RandomForest",RFC_best),("GradientBoosting",GBC_best),("XGBoost", gridxgb_best)]nclassifier = 0for row in range(nrows):    for col in range(ncols):        name = names_classifiers[nclassifier][0]        classifier = names_classifiers[nclassifier][1]        indices = np.argsort(classifier.feature_importances_)[::-1][:40]        g = sns.barplot(y=X_train.columns[indices][:40],x = classifier.feature_importances_[indices][:40] , orient='h',ax=axes[row][col])        g.set_xlabel("Relative importance",fontsize=12)        g.set_ylabel("Features",fontsize=12)        g.tick_params(labelsize=9)        g.set_title(name + " feature importance")        nclassifier += 1
In [81]:
test_Survived_RFC = pd.Series(RFC_best.predict(test), name="RFC")test_Survived_ExtC = pd.Series(ExtC_best.predict(test), name="ExtC")test_Survived_GBC = pd.Series(GBC_best.predict(test), name="GBC")test_Survived_xgb = pd.Series(gridxgb_best.predict(test), name="xgb")# Concatenate all classifier resultsensemble_results = pd.concat([test_Survived_RFC,test_Survived_ExtC,test_Survived_GBC, test_Survived_xgb],axis=1)g= sns.heatmap(ensemble_results.corr(),annot=True)

多种模型集成方法

四个集成模型投票法

In [82]:
votingC = VotingClassifier(estimators=[('rfc', RFC_best), ('extc', ExtC_best), ('gbc',GBC_best), ('xgb', gridxgb_best)], voting='soft', n_jobs=-1)votingC = votingC.fit(X_train, Y_train)
In [83]:
test_Survived = pd.Series(votingC.predict(test), name="Survived")results_votingC = pd.concat([IDtest,test_Survived],axis=1)results_votingC.to_csv("ensemble_python_voting.csv",index=False)

stacking法

In [90]:
#第一层class Ensemble_stacking1(object):    def __init__(self, n_folds, base_models):        self.n_folds = n_folds        self.base_models = base_models    def get_data_to2(self, X, y, T):        X = np.array(X)        y = np.array(y)        T = np.array(T)        folds = StratifiedKFold(n_splits=self.n_folds, shuffle=True, random_state=2016).split(X,y)        S_train = np.zeros((X.shape[0], len(self.base_models)))        S_test = np.zeros((T.shape[0], len(self.base_models)))        for i, clf in enumerate(self.base_models):            S_test_i = np.zeros((T.shape[0], self.n_folds))            for j, (train_idx, test_idx) in enumerate(folds):                X_train = X[train_idx]                y_train = y[train_idx]                X_holdout = X[test_idx]                # y_holdout = y[test_idx]                clf.fit(X_train, y_train)                y_pred = clf.predict(X_holdout)[:]                S_train[test_idx, i] = y_pred                S_test_i[:, j] = clf.predict(T)[:]            S_test[:, i] = S_test_i.mean(1)        return S_train, S_test#第二层xgb2_param_grid = {"learning_rate": [0.01,0.5],                  "n_estimators" : [300,500],              "gamma": [0.1, 0.5,1.0],              "max_depth": [3, 5, 10],              "min_child_weight": [1, 3 , 5, 7],              "subsample" : [0.8,1.0],                 "colsample_bytree" : [0.6,0.8]}gridxgb2 = GridSearchCV(XGBClassifier(),param_grid = xgb2_param_grid, cv=5, scoring="accuracy", n_jobs= -1, verbose = 1)S_train, S_test = Ensemble_stacking1(5, [RFC_best,  ExtC_best, GBC_best, gridxgb]).get_data_to2(X_train, Y_train, test)gridxgb2.fit(S_train,Y_train)gridxgb2_best = gridxgb2.best_estimator_print(gridxgb2.best_score_)
Fitting 5 folds for each of 576 candidates, totalling 2880 fits
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    5.4s[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    8.3s[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   13.1s[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   19.8s[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:   28.4s[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed:   39.7s[Parallel(n_jobs=-1)]: Done 2434 tasks      | elapsed:   54.5s
0.829738933031
[Parallel(n_jobs=-1)]: Done 2880 out of 2880 | elapsed:  1.1min finished
In [91]:
test_Survived = pd.Series(gridxgb2.predict(S_test), name="Survived")results_stacking = pd.concat([IDtest,test_Survived],axis=1)results_stacking.to_csv("ensemble_python_stacking.csv",index=False)

总结

  • 学完基本的机器学习理论知识后,在kaggle上找了Titanic比赛练练手,看了不少大神在kernel上了分享,学到了不少知识。本着尽快熟悉数据处理操作和各种机器学习算法应用的原则,自己重新实现了一遍比赛的基本流程,并且加了一些自己的想法进去。许多细节处理的特别粗糙,特征工程处理和模型调参由于时间的原因还不够完善,也是接下来要去认真学习的地方。刚接触Python没多长时间,代码写的也比较乱,还要多学习学习别人的代码风格,尝试总结出一套规范的流程来。
  • 最后将结果提交到kaggle上,分数没有达到0.8,这说明模型有点过拟合了。一方面特征工程做的还不够好,另一方面模型调参比较粗糙。如果再花时间进行模型调参,应该可以获得不错的成绩。本想着进一步优化模型,但是本次练习的目的也主要是熟悉和入门kaggle比赛,更多细节还需要学习优胜者们分享的解决方案,所以还是等以后有时间再来深究吧

参考资料

https://www.kaggle.com/helgejo/an-interactive-data-science-tutorial/notebook

https://www.kaggle.com/omarelgabry/a-journey-through-titanic

https://www.kaggle.com/startupsci/titanic-data-science-solutions

https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python

https://www.kaggle.com/ash316/eda-to-prediction-dietanic

https://www.kaggle.com/yassineghouzam/titanic-top-4-with-ensemble-modeling

http://prozhuchen.com/2016/12/28/CCF%E5%A4%A7%E8%B5%9B%E6%90%9C%E7%8B%97%E7%94%A8%E6%88%B7%E7%94%BB%E5%83%8F%E6%80%BB%E7%BB%93/

  • 记录科研点滴,丰富研究生生活!

原创粉丝点击