Titanic: Machine Learning from Disaster

来源:互联网 发布:linux编译命令 编辑:程序博客网 时间:2024/05/16 15:06

1、问题的提出

•Thesinking of the RMS Titanic is one of the most infamous shipwrecks inhistory.  On April 15, 1912, during her maiden voyage, the Titanic sankafter colliding with an iceberg, killing 1502 out of 2224 passengers and crew.This sensational tragedy shocked the international community and led to bettersafety regulations for ships.
•Oneof the reasons that the shipwreck led to such loss of life was that there werenot enough lifeboats for the passengers and crew. Although there was someelement of luck involved in surviving the sinking, some groups of people weremore likely to survive than others, such as women, children, and theupper-class.
•Inthis challenge, we ask you to complete the analysis of what sorts of peoplewere likely to survive. In particular, we ask you to apply the tools of machinelearning to predict which passengers survived the tragedy.

(摘自https://www.kaggle.com/c/titanic)

简单来说,就是根据网站所给含是否幸存的train数据,训练一个模型,预测一个不知道是否幸存的test数据里面的乘客,最后把结果整理成一个(乘客,是否幸存)的csv文件


2、数据的图表

import pandas as pddata = pd.read_csv("titanictrain.csv")data.describe()

这里展示的是题目所给的train数据的表面,passengerID891个,说明891是全部乘客数量而Age只有714个,说明是缺的。





Survived_0 = data.Pclass[data.Survived == 0].value_counts()Survived_1 = data.Pclass[data.Survived == 1].value_counts()df=pd.DataFrame({'Survived':Survived_1, 'unSurvived':Survived_0})df.plot(kind='bar', stacked=True)plt.title("survived in pclass")plt.xlabel("pclass") plt.ylabel("persons") 

这里展示了各阶级获救(蓝色)和没获救(橙色)的比例。可以明显看出,上层阶级的人获救比例比下层的人高得多,说明做数据预测的时候pclass是个重要的参数。





Survived_0 = data.Sex[data.Survived == 0].value_counts()Survived_1 = data.Sex[data.Survived == 1].value_counts()df=pd.DataFrame({'Survived':Survived_1, 'unSurvived':Survived_0})df.plot(kind='bar', stacked=True)plt.title("survived in Sex ")plt.xlabel(" Sex ") plt.ylabel("persons") 

这里展示了各性别获救(蓝色)和没获救(橙色)的比例。女士的获救比例远高于男性,看来英国绅士的ladyfirst精神确实在泰坦尼克号中有所体现。那么sex也是一个很重要的参数。



Survived_0 = data.Parch[data.Survived == 0].value_counts()Survived_1 = data.Parch[data.Survived == 1].value_counts()df=pd.DataFrame({'Survived':Survived_1, 'unSurvived':Survived_0})df.plot(kind='bar', stacked=True)plt.title("survived in Parch ")plt.xlabel(" Parch ") plt.ylabel("persons") plt.show()

虽然不知道parch表示什么(各种检索没找到合适的翻译),但是看的出来还是有一定影响的。



Survived_0 = data.Embarked[data.Survived == 0].value_counts()Survived_1 = data.Embarked[data.Survived == 1].value_counts()df=pd.DataFrame({'Survived':Survived_1, 'unSurvived':Survived_0})df.plot(kind='bar', stacked=True)plt.title("survived in Embarked")plt.xlabel("Fare") plt.ylabel("persons") plt.show()

不同港口上船的人的获救比例也不一样,也许不同港口人的背景不一样,或者所在船舱不一样,总之对是否获救还是有一定影响的。



(部分代码)

age_young = []age_middle = []age_old = []for i in data.Age:    if i <=15 :        age_young.append(i)    elif i <=45:        age_middle.append(i)    elif i <=100:        age_old.append(i)    Age_Y = pd.DataFrame(age_young)Age_M = pd.DataFrame(age_middle)Age_O = pd.DataFrame(age_old)



由于年龄的离散程度太高(几乎可以认为是连续的),所以我分了3个区间分别是少年(小于15岁),中年(1545岁),老年(45岁以上)来分别看其获救比例。可以看出少年和老年的获救比例较低(获救密度和死亡密度接近),而中年以上比较高。可以看出,在灾难来临的时候,中年人的自救能力更强,在是否能获救的问题中,年龄是一个重要的影响因素。



Survived_0 = data.Fare[data.Survived == 0].value_counts()Survived_1 = data.Fare[data.Survived == 1].value_counts()df=pd.DataFrame({'Survived':Survived_1, 'unSurvived':Survived_0})df.plot(kind='kde', stacked=True)plt.title("survived in Fare")plt.xlabel("Fare") plt.ylabel("persons density") plt.show()

车费密度来说,可能只是正好010区间的人数比较多,所以获救的人比较多,个人认为这个数据对获救影响不大。但是车票越贵可能代表的阶级越高,和获救比例可能存在一定的正相关。



3、数据的训练处理

from sklearn.ensemble import RandomForestRegressordef set_missing_ages(df):age_df = df[['Age','Fare', 'Parch', 'SibSp', 'Pclass']]known_age = age_df[age_df.Age.notnull()].as_matrix()    unknown_age = age_df[age_df.Age.isnull()].as_matrix()y = known_age[:, 0]X = known_age[:, 1:]    # fit到RandomForestRegressor之中    rfr = RandomForestRegressor(random_state=0, n_estimators=2000, n_jobs=-1)    rfr.fit(X, y)    # 用得到的模型进行未知年龄结果预测    predictedAges = rfr.predict(unknown_age[:, 1::])    # 用得到的预测结果填补原缺失数据    df.loc[ (df.Age.isnull()), 'Age' ] = predictedAges     return df, rfrdata, rfr = set_missing_ages(data)

前面说了,年龄只有714个,有百分之19左右的年龄缺失,决定要用年龄作为因素的话,可能会给预测带来很大的影响。以上为设计随机森林法则,根据已有的数据,预测缺失年龄数据的可能年龄再补充进去。预测后年龄的个数也变成了891


import sklearn.preprocessing as preprocessingdummies_Embarked = pd.get_dummies(data['Embarked'], prefix= 'Embarked')dummies_Sex = pd.get_dummies(data['Sex'], prefix= 'Sex')dummies_Pclass = pd.get_dummies(data['Pclass'], prefix= 'Pclass')df = pd.concat([data, dummies_Embarked, dummies_Sex, dummies_Pclass], axis=1)df.drop(['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked','SibSp'], axis=1, inplace=True)scaler = preprocessing.StandardScaler()age_scale_param = scaler.fit(df['Age'])df['Age_scaled'] = scaler.fit_transform(df['Age'], age_scale_param)fare_scale_param = scaler.fit(df['Fare'])df['Fare_scaled'] = scaler.fit_transform(df['Fare'], fare_scale_param)

先把所给的数据转化成独热编码(逻辑回归函数看得懂的数据,详见参考文献),然后drop掉原本不能用的数据(姓名、兄弟数量等等),和转化成独热编码之前的数据。最后把年龄和fare标准化(转化为-11之间的数值),不然可能导致逻辑回归函数无法收敛(数学问题)。



from sklearn import linear_modeldf.drop(['PassengerId', 'Age', 'Fare'], axis=1, inplace=True)train_np = df.as_matrix()y = train_np[:, 0]X = train_np[:, 1:]clf = linear_model.LogisticRegression(C=1.0, penalty='l1', tol=1e-6)  #调参数clf.fit(X, y)     #训练

drop掉没用的数据(passengerid等),在numpy中把X(其他参数),y(是否获救)传入线性回归逻辑函数中,适当调好参数得到模型clf如图。



4、数据的测试处理

data_test = pd.read_csv("titanicttest.csv")#处理成之前设计的补全年龄代码可以用的数据类型data_test.loc[ (data_test.Fare.isnull()), 'Fare' ] = 0tmp_df = data_test[['Age','Fare', 'Parch', 'SibSp', 'Pclass']]null_age = tmp_df[data_test.Age.isnull()].as_matrix()#补上测试数据中缺失的年龄数据 X = null_age[:, 1:]predictedAges = rfr.predict(X)data_test.loc[ (data_test.Age.isnull()), 'Age' ] = predictedAges#data_test = set_Cabin_type(data_test)#转化成函数能读懂的独热编码dummies_Embarked = pd.get_dummies(data_test['Embarked'], prefix= 'Embarked')dummies_Sex = pd.get_dummies(data_test['Sex'], prefix= 'Sex')dummies_Pclass = pd.get_dummies(data_test['Pclass'], prefix= 'Pclass')df_test = pd.concat([data_test, dummies_Embarked, dummies_Sex, dummies_Pclass], axis=1)#去掉没用的数据,并且把年龄和fare标准化df_test.drop(['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1, inplace=True)df_test['Age_scaled'] = scaler.fit_transform(df_test['Age'], age_scale_param)df_test['Fare_scaled'] = scaler.fit_transform(df_test['Fare'], fare_scale_param)

在训练数据上如何处理的在测试数据上也要这样处理,保证前后的数据项和数据类型一致。具体看代码中的注释



df_test.drop(['PassengerId', 'Age', 'Fare','SibSp'], axis=1, inplace=True)predictions = clf.predict(df_test)result = pd.DataFrame({'PassengerId':data_test['PassengerId'], 'Survived':predictions})result.to_csv("predictions.csv",index = False)  # index = False  是不要第一列作为索引的意思

最后把处理后的df_test数据放入clf模型中进行预测,保存在predictions变量中,再保存在predictions.csv的文件中,完成题目的要求。




参考:

1、http://www.cnblogs.com/zhizhan/p/5238908.html

2、http://blog.csdn.net/wy250229163/article/details/52983760(独热编码)




阅读全文
0 0