Titanic: Machine Learning from Disaster初练习详解
来源:互联网 发布:cda数据分析师报考 编辑:程序博客网 时间:2024/06/13 11:25
一、预览数据
#-*-coding=utf-8-*-import pandas as pdimport numpy as npfrom pandas import DataFrameimport matplotlib.pyplot as plt import seaborn as sns#sns.set_style('whitegrid')from sklearn.linear_model import LogisticRegressionfrom sklearn.ensemble import RandomForestClassifiertrain=pd.read_csv(r'C:\Users\0011\Desktop\kaggle\train.csv')#891*12test=pd.read_csv(r'C:\Users\0011\Desktop\kaggle\test.csv')#891*11train.info()train_dms=pd.get_dummies(train['Sex'])#891*2train1=pd.concat([train,train_dms],axis=1)#891*(12+2)
发现age,Cabin, Embarked有缺失值。
看看处理缺失后的年龄和存活的分布
#-*-coding=utf-8-*-import pandas as pdimport numpy as npfrom pandas import DataFrameimport matplotlib.pyplot as plt import seaborn as sns#sns.set_style('whitegrid')from sklearn.linear_model import LogisticRegressionfrom sklearn.ensemble import RandomForestClassifiertrain=pd.read_csv(r'C:\Users\0011\Desktop\kaggle\train.csv')#891*12test=pd.read_csv(r'C:\Users\0011\Desktop\kaggle\test.csv')#891*11train_dms=pd.get_dummies(train['Sex'])#891*2train1=pd.concat([train,train_dms],axis=1)#891*(12+2)nan_num = train['Age'].isnull().sum()# there are 177 missing value, fill with random intage_mean = train['Age'].mean()age_std = train['Age'].std()filling = np.random.randint(age_mean-age_std, age_mean+age_std, size=nan_num)train['Age'][train['Age'].isnull()==True] = fillingnan_num = train['Age'].isnull().sum()# dealing the missing val in testnan_num = test['Age'].isnull().sum()# 86 nullage_mean = test['Age'].mean()age_std = test['Age'].std()filling = np.random.randint(age_mean-age_std,age_mean+age_std,size=nan_num)test['Age'][test['Age'].isnull()==True]=fillingnan_num = test['Age'].isnull().sum()#look into the age cols = sns.FacetGrid(train,hue='Survived',aspect=3)#aspect表示纵横比s.map(sns.kdeplot,'Age',shade=True)#核密度统计方式s.set(xlim=(0,train['Age'].max()))s.add_legend()
由于随机数的影响:
def under15(row): result = 0.0 if row<15: result = 1.0 return resultdef young(row): result = 0.0 if row>=15 and row<30: result = 1.0 return resulttrain['under15'] = train['Age'].apply(under15)test['under15'] = test['Age'].apply(under15)train['young'] = train['Age'].apply(young)test['young'] = test['Age'].apply(young)
#family# chekprint(train['SibSp'].value_counts(dropna=False))print(train['Parch'].value_counts(dropna=False))sns.factorplot('SibSp','Survived',data=train,size=5)sns.factorplot('Parch','Survived',data=train,size=5)'''through the plot, we suggest that with more family member, the survival rate will drop, we can create the new coladd up the parch and sibsp to check our theory''' train['family'] = train['SibSp'] + train['Parch']test['family'] = test['SibSp'] + test['Parch']sns.factorplot('family','Survived',data=train,size=5)train.drop(['SibSp','Parch'],axis=1,inplace=True)test.drop(['SibSp','Parch'],axis=1,inplace=True)
# fare# checking null, found one in test group. leave it alone til we find outprint(train.Fare.isnull().sum())print(test.Fare.isnull().sum())result:
0
1
sns.factorplot('Survived','Fare',data=train,size=5)
i=train.Embarked.value_counts()print(i)S 644C 168Q 77Name: Embarked, dtype: int64
二、模型预测
# testing1, using all the featuretrain_ft=train.drop('Survived',axis=1)train_y=train['Survived']#set kfkf = KFold(n_splits=3,random_state=1)acc_lst = []ml(train_ft,train_y,'test_1')
# testing 2, lose youngtrain_ft_2=train.drop(['Survived','young'],axis=1)test_2 = test.drop('young',axis=1)train_ft.head()# mlkf = KFold(n_splits=3,random_state=1)acc_lst=[]ml(train_ft_2,train_y,'test_2')
#test3, lose young, ctrain_ft_3=train.drop(['Survived','young','C'],axis=1)test_3 = test.drop(['young','C'],axis=1)train_ft.head()# mlkf = KFold(n_splits=3,random_state=1)acc_lst = []ml(train_ft_3,train_y,'test_3')
# test4, no FAREtrain_ft_4=train.drop(['Survived','Fare'],axis=1)test_4 = test.drop(['Fare'],axis=1)train_ft.head()# mlkf = KFold(n_splits=3,random_state=1)acc_lst = []ml(train_ft_4,train_y,'test_4')
# test5, get rid of c train_ft_5=train.drop(['Survived','C'],axis=1)test_5 = test.drop('C',axis=1)# mlkf = KFold(n_splits=3,random_state=1)acc_lst = []ml(train_ft_5,train_y,'test_5')
# test6, lose Fare and youngtrain_ft_6=train.drop(['Survived','Fare','young'],axis=1)test_6 = test.drop(['Fare','young'],axis=1)train_ft.head()# mlkf = KFold(n_splits=3,random_state=1)acc_lst = []ml(train_ft_6,train_y,'test_6')
accuracy_df=pd.DataFrame(data=accuracy, index=['test1','test2','test3','test4','test5','test6'], columns=['logistic','rf','svc','knn'])
原作者最好是test4
svc 0.832727但是提交后结果:
总结一下源码:
#-*-coding=utf-8-*-import pandas as pdpd.options.mode.chained_assignment = Noneimport numpy as npfrom pandas import DataFrameimport matplotlib.pyplot as plt import seaborn as snssns.set_style('whitegrid')from sklearn.linear_model import LogisticRegressionfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.svm import SVC, LinearSVCfrom sklearn.ensemble import GradientBoostingClassifierfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.naive_bayes import GaussianNBfrom sklearn.metrics import accuracy_scorefrom sklearn.model_selection import cross_val_score, KFold#第一部分. Exploring the Datatrain=pd.read_csv(r'C:\Users\0011\Desktop\kaggle\train.csv')test=pd.read_csv(r'C:\Users\0011\Desktop\kaggle\test.csv')#print(train.describe()) #train.info()#sns.factorplot("Pclass","Survived",data=train)#plt.show()#2. Data Cleaning and Features Choosingdef dummies(col,train,test): #定义指示函数处理某个离散型特征,等价把各个变量用数字表示1,2,3 train_dms=pd.get_dummies(train[col])#891*sex(2),Pclass(3) test_dms =pd.get_dummies(test[col]) train=pd.concat([train,train_dms],axis=1)#axis=1表示第一行#891*(12+2) test =pd.concat([test,test_dms],axis=1) train.drop(col,axis=1,inplace=True)#等价于train=train.drop(col,axis=1,inplace=False) test.drop(col,axis=1,inplace=True) return train, test#get rid of uselessdropping = ['PassengerId', 'Name', 'Ticket']train.drop(dropping,axis=1, inplace=True)test.drop(dropping,axis=1, inplace=True)#print(train.Pclass.value_counts())#Returns object containing counts of unique values.#pclasstrain, test = dummies('Pclass',train,test)'''观察Sex和Survived的关系,女性生还率显著高于男性'''#print(train.Sex.value_counts(dropna=False))#sns.factorplot('Sex','Survived',data=train)#plt.show()train,test = dummies('Sex',train,test)train.drop('male',axis=1,inplace=True)test.drop('male',axis=1,inplace=True)# ensure no na contained#age #dealing the missing datanan_num = train['Age'].isnull().sum()# there are 177 missing value, fill with random intage_mean = train['Age'].mean()#平均值age_std = train['Age'].std()#方差filling = np.random.randint(age_mean-age_std, age_mean+age_std, size=nan_num)train['Age'][train['Age'].isnull()==True] = fillingnan_num = train['Age'].isnull().sum()# dealing the missing val in testnan_num = test['Age'].isnull().sum()age_mean=test['Age'].mean()age_std=test['Age'].std()filling=np.random.randint(age_mean-age_std, age_mean+age_std, size=nan_num)test['Age'][test['Age'].isnull()==True]=fillingnan_num = test['Age'].isnull().sum()# from the graph, we see that the survival rate of children# is higher than other and the 15-30 survival rate is lowerdef under15(row): result = 0.0 if row<15: result = 1.0 return resultdef young(row): result = 0.0 if row>=15 and row<30: result = 1.0 return resulttrain['under15'] = train['Age'].apply(under15)test['under15'] = test['Age'].apply(under15)train['young'] = train['Age'].apply(young)test['young'] = test['Age'].apply(young)train.drop('Age',axis=1,inplace=True)test.drop('Age',axis=1,inplace=True)#family cheak can adjust(i think)train['family'] = train['SibSp'] + train['Parch']test['family'] = test['SibSp'] + test['Parch']train.drop(['SibSp','Parch'],axis=1,inplace=True)test.drop(['SibSp','Parch'],axis=1,inplace=True)# fare# checking null, found one in test group. leave it alone til we find outtest['Fare'].fillna(test['Fare'].median(),inplace=True)#Cabin# checking missing val# 687 out of 891 are missing, drop this coltrain.drop('Cabin',axis=1,inplace=True)test.drop('Cabin',axis=1,inplace=True)#Embark#train.Embarked.isnull().sum()# 2 missing value# fill the majority val,'s', into missing val coltrain['Embarked'].fillna('S',inplace=True)# c has higher survival rate, drop the other twotrain,test = dummies('Embarked',train,test)train.drop(['S','Q'],axis=1,inplace=True)test.drop(['S','Q'],axis=1,inplace=True)#第二部分kf = KFold(n_splits=3,random_state=1)def modeling(clf,ft,target): acc = cross_val_score(clf,ft,target,cv=kf)#cross_val_score(clf,X,y) acc_lst.append(acc.mean()) return accuracy = []def ml(ft,target,time): accuracy.append(acc_lst) #logisticregression logreg = LogisticRegression() modeling(logreg,ft,target) #RandomForest rf = RandomForestClassifier(n_estimators=50,min_samples_split=4,min_samples_leaf=2) modeling(rf,ft,target) #svc svc = SVC() modeling(svc,ft,target) #knn knn = KNeighborsClassifier(n_neighbors = 3) modeling(knn,ft,target) # see the coefficient logreg.fit(ft,target) feature = pd.DataFrame(ft.columns) feature.columns = ['Features'] feature["Coefficient Estimate"] = pd.Series(logreg.coef_[0]) #print(feature) return # testing no.1, using all the featuretrain_ft=train.drop('Survived',axis=1)train_y=train['Survived']#set kfacc_lst = []ml(train_ft,train_y,'test_1')# testing 2, lose youngtrain_ft_2=train.drop(['Survived','young'],axis=1)test_2 = test.drop('young',axis=1)train_ft.head()# mlacc_lst=[]ml(train_ft_2,train_y,'test_2')#test3, lose young, ctrain_ft_3=train.drop(['Survived','young','C'],axis=1)test_3 = test.drop(['young','C'],axis=1)train_ft.head()# mlacc_lst = []ml(train_ft_3,train_y,'test_3')# test4, no FAREtrain_ft_4=train.drop(['Survived','Fare'],axis=1)test_4 = test.drop(['Fare'],axis=1)train_ft.head()# mlacc_lst = []ml(train_ft_4,train_y,'test_4')# test5, get rid of c train_ft_5=train.drop(['Survived','C'],axis=1)test_5 = test.drop('C',axis=1)# mlacc_lst = []ml(train_ft_5,train_y,'test_5')# test6, lose Fare and youngtrain_ft_6=train.drop(['Survived','Fare','young'],axis=1)test_6 = test.drop(['Fare','young'],axis=1)train_ft.head()# mlacc_lst = []ml(train_ft_6,train_y,'test_6')accuracy_df=pd.DataFrame(data=accuracy, index=['test1','test2','test3','test4','test5','test6'], columns=['logistic','rf','svc','knn'])#test4 svc as submissionsvc = SVC()svc.fit(train_ft_4,train_y)svc_pred = svc.predict(test_4)print(svc.score(train_ft_4,train_y))test=pd.read_csv(r'C:\Users\0011\Desktop\kaggle\test.csv')submission = pd.DataFrame({ 'PassengerId': test['PassengerId'], 'Survived': svc_pred })#submission.to_csv("kaggle.csv", index=False)
三、试试自己的改进
#自己添加了一些都试试不同组合,没有什么提升效果def up60(row): result = 0.0 if row>60: result = 1.0 return resultdef f38t48(row): result = 0.0 if row>=38 and row<48: result = 1.0 return resultdef f48t60(row): result = 0.0 if row>=48 and row<60: result = 1.0 return resulttrain['up60'] = train['Age'].apply(up60)test['up60'] = test['Age'].apply(up60)train['f38t48'] = train['Age'].apply(f38t48)test['f38t48'] = test['Age'].apply(f38t48)train['f48t60'] = train['Age'].apply(f48t60)test['f48t60'] = test['Age'].apply(f48t60)
试试姓氏和年龄的关系结果提交的过拟合了
# ensure no na contained#age train['Initial']=train.Name.str.extract('([A-Za-z]+)\.')test['Initial']=test.Name.str.extract('([A-Za-z]+)\.')train['Initial'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess', 'Jonkheer','Col','Rev', 'Capt','Sir','Don'],['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs', 'Other','Other','Other','Mr','Mr','Mr'],inplace=True) test['Initial'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess', 'Jonkheer','Col','Rev', 'Capt','Sir','Don'],['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs', 'Other','Other','Other','Mr','Mr','Mr'],inplace=True)#检查每一个姓氏的平均年龄#print(train.groupby('Initial')['Age'].mean()) ## Assigning the NaN Values with the Ceil values of the mean agestrain.loc[(train.Age.isnull())&(train.Initial=='Mr'),'Age']=33train.loc[(train.Age.isnull())&(train.Initial=='Mrs'),'Age']=36train.loc[(train.Age.isnull())&(train.Initial=='Master'),'Age']=5train.loc[(train.Age.isnull())&(train.Initial=='Miss'),'Age']=22train.loc[(train.Age.isnull())&(train.Initial=='Other'),'Age']=46test.loc[(test.Age.isnull())&(test.Initial=='Mr'),'Age']=33test.loc[(test.Age.isnull())&(test.Initial=='Mrs'),'Age']=36test.loc[(test.Age.isnull())&(test.Initial=='Master'),'Age']=5test.loc[(test.Age.isnull())&(test.Initial=='Miss'),'Age']=22test.loc[(test.Age.isnull())&(test.Initial=='Other'),'Age']=46train.drop(['Name','Initial'],axis=1, inplace=True)test.drop(['Name','Initial'],axis=1, inplace=True)
还不错有进步:
更多数据分析图参看
参考
阅读全文
1 0
- Titanic: Machine Learning from Disaster初练习详解
- Titanic: Machine Learning from Disaster
- Titanic: Machine Learning from Disaster
- Titanic: Machine Learning from Disaster
- Titanic: Machine Learning from Disaster
- Titanic : Machine Learning from Disaster
- 【Kaggle练习赛】之Titanic: Machine Learning from Disaster
- [Kaggle] Titanic: Machine Learning from Disaster入门版练习笔记
- Kaggle Titanic: Machine Learning from Disaster
- Kaggle | Titanic: Machine Learning from Disaster
- Kaggle之Titanic: Machine Learning from Disaster
- kaggle: Titanic: Machine Learning from Disaster
- 机器学习一小步:Kaggle上的练习Titanic: Machine Learning from Disaster(一)
- 机器学习一小步:Kaggle上的练习Titanic: Machine Learning from Disaster(二)
- kaggle competition 之 Titanic: Machine Learning from Disaster
- Titanic: Machine Learning from Disaster(Kaggle 数据挖掘竞赛)
- Kaggle Titanic: Machine Learning from Disaster 一种思路
- Titanic: Machine Learning from Disaster——Linear regression
- IPAddressHelper.cs
- 十六进制转换十进制
- Ajax的应用
- React 可控组件input
- Ubuntu16.04 git连接远程库出现推送错误问题的解决方法
- Titanic: Machine Learning from Disaster初练习详解
- HDOJ2015 偶数求和
- Sum of Triangular Numbers
- Codeforces Round #349 (Div. 1)
- java中short类型自动转int类型注意事项
- 解决webstorm拉取Vue项目时卡顿,及内存爆满问题
- tensorflow API 之 tf.logging.set_verbosity
- HDU2255 奔小康赚大钱(二分图的最大完备匹配,KM算法)
- HDOJ2016 数据的交换输出