kaggle之Predict survival on the Titanic

来源：互联网发布：红色警戒2mac版下载编辑：程序博客网时间：2024/04/26 22:31

Predict survival on the Titanic

“The aim of our Titanic Tutorial was to show you an easy way into more difficult problems, so don't be too disheartened if your super-complicated random forest doesn't beat the gender based model!”

一、数据导入

train <- read.csv("train.csv", header=TRUE)test <- read.csv("test.csv", header=TRUE)  #默认情况下，R会将数据从string型转换成factor型str(train) #查看数据结构.如果数据类型有误，经常会被当做缺失值处理，所以要用该命令查看!str(test)

二、变量描述
Passenger: 890位乘客
Survived: 0,1
Pclass: 1,2,3 代表社会经济水平,因此是一个非常重要的变量.
Sex: male,female 性别也是一个很重要的变量
Age: 年龄也是一个重要变量 (最大值为80，最小值为0.42)

Cabin:Cabin 船舱位置,缺失值很多
Fare:Passenger Fare 船费
Ticket:Ticket Number 船票号码

Embarked:Port of Embarkation(C=Cherbourg;Q=Queenstown;S=Southampton) 上船地点

SibSp: Number of Siblings/Spouses Aboard (最大值8，最小值0)
Parch: Number of Parents/Children Aboard (最大值6，最小值0)
其中Sibling\Spouse\Parent\Child分别表示:
Sibling:Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse:Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent:Mother or Father of Passenger Aboard Titanic
Child:Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

美国新泽西州州立大学教授、著名社会学家戴维·波普诺认为泰坦尼克号的幸存几率主要取决于船舱等级，而不是性别和年龄，即妇女儿童优先的惯例,因此更准确的说法应该是头等舱和二等舱的妇女和儿童优先。————戴维·波普诺《社会学》

让我们就此探究一下，如果数据足够真实！

三、寻找变量之间的关联模式

为此提出以下假设:

假设1头等舱的生存几率更大,或头等舱距离救生艇较近，生存几率更大
假设2女性的生存几率更大
假设3儿童的生存几率更大
假设4发达地区的生存几率更大

假设5兄弟姐妹\夫妻\父母\儿童人数越多，生存几率越大

... ...

prop.table(table(train$Pclass,train$Survived),1)

prop.table(table(train$Sex, train$Survived),1)

train$Child <- 0

train$Child[train$Age < 18] <- 1 #此时，0表示成年人，1表示儿童
aggregate(Survived ~ Child + Sex, data=train, FUN=sum)
aggregate(Survived ~ Child + Sex, data=train, FUN=length)
aggregate(Survived ~ Child + Sex, data=train, FUN=function(x) {sum(x)/length(x)})
... ...

四、处理变量的缺失值

首先分析train数据集，其次是test数据集:

Fare变量：

summary(train$Fare)train$Fare2 <- '45+'train$Fare2[train$Fare < 45 & train$Fare >= 30] <- '30-45'train$Fare2[train$Fare < 30 & train$Fare >= 15] <- '15-30'train$Fare2[train$Fare < 15 & train$Fare >= 5] <- '5-15'train$Fare2[train$Fare < 5] <- '<5'aggregate(Survived ~ Fare2 + Pclass + Sex, data=train, FUN=function(x) {sum(x)/length(x)})

可得出以下结论:

1,三等舱女性生存率很低，其中票价高于45+的生存率为0.

2,无论几等舱,男性生存率都比较低!

首先看最小值

subset(train, Fare < 7)[order(subset(train, Fare < 7)$Fare, subset(train, Fare < 7)$Pclass),                         c("Survived","Pclass","Sex","Age", "Fare")]

结合上面的结论，我们可以将Fare为0的值替换为所在舱的Fare平均值

imputeMedian <- function(impute.var, filter.var, var.levels) {  for (v in var.levels) {    impute.var[ which( filter.var == v)] <- impute(impute.var[       which( filter.var == v)])  }  return (impute.var)}train$Fare[which(train$Fare == 0)] <- NArequire(Hmisc)bystats(train$Fare, train$Pclass, fun=function(x)c(Mean=mean(x),Median=median(x)))Pclass.na.train <- c("1", "2", "3")train$Fare <- imputeMedian(train$Fare, train$Pclass, Pclass.na.train)summary(train$Fare)

其次看最大值

bystats(train$Fare, train$Pclass, fun=function(x)c(Mean=mean(x),Median=median(x)))subset(train, Fare > 45)[order(subset(train, Fare > 45)$Fare, subset(train, Fare > 45)$Pclass),                          c("Survived","Pclass","Sex","Age", "FamilySize","Fare")]

票价还算正常反应了平均值，除了极个别的很高之外。

options(digits=2)summary(train$Age)head(train$Name, n=10L)getTitle <- function(data) {  title.dot.start <- regexpr("\\, [A-Z ]{1,20}\\.", data$Name, TRUE)  #用于字符串的提取操作，返回匹配的具体位置和字符串长度信息.[]中的空格是匹配the Countess.  title.comma.end <- title.dot.start+                  attr(title.dot.start, "match.length")-1   #注意:代码换行时，操作符应写在上一行末尾，不能作为下一行的开头!!!  data$Title <- substr(data$Name, title.dot.start+2, title.comma.end-1)  #第1个参数为要拆分的字符串向量，第2个参数为截取的起始位置向量，第3个参数为截取的终止位置向量  return (data$Title)}  train$Title <- getTitle(train)unique(train$Title)

Age变量:

subset(train, Age > 18)[order(subset(train, Age > 18)$Age, subset(train, Age > 18)$Title),                         c("Survived","Pclass","Sex","Age","Title")]

Age和Title比较关联，所以用中值替代也合理

bystats(train$Age, train$Title, fun=function(x)c(Mean=mean(x),Median=median(x)))titles.na.train <- c("Dr", "Master", "Mrs", "Miss", "Mr")train$Age <- imputeMedian(train$Age, train$Title, titles.na.train)summary(train$Age)

Title变量:

## function for assigning a new title value to old title(s) changeTitles <- function(data, old.titles, new.title) {  for (honorific in old.titles) {    data$Title[ which( data$Title == honorific)] <- new.title  }  return (data$Title)}## Title consolidation 生成Title变量！train$Title <- changeTitles(train, c("Capt", "Col", "Don", "Dr", "Jonkheer", "Lady", "Major", "Rev", "Sir"), "Noble")train$Title <- changeTitles(train, c("the Countess", "Ms"), "Mrs") train$Title <- changeTitles(train, c("Mlle", "Mme"), "Miss")train$Title <- as.factor(train$Title)

Name分析:

Duff Gordon, Sir. Cosmo Edmund ("Mr Morgan") 和 Duff Gordon, Lady. (Lucille Christiana Sutherland) ("Mrs Morgan")，其中Mr Morgan和Mrs Morgan,有可能是别名或昵称.

Harmer, Mr. Abraham (Daivid Lishin)

1,David Livshin purchased his ticket under the name Abraham Harmer, is that right?
2,这种形式的名字大部分是女性，如下
Baclini, Mrs. Solomon (Latifa Qurban) and Baclini, Miss. Marie Catherine
Mrs. indicates that she is married, Miss indicates that she is unmarried
Solomon is the name of her husband. This is a (old-ish) custom where wives can be referred to by their husbands name.
Latifa is her first name, Marie is her first name
Catherine is her middle name.
Qurban is her "maiden" name. This is the family/last name that she had before getting married.
Baclini is her (married) family/last name, i.e. the last name of her husband Solomon.

... ...

Jonkheer is actually male (jong = young and "heer" = male) but is also a title of nobility.

"To get a more exact answer at what age that might be requires some research in the customs of the British in that era."

Embarked变量:

summary(train$Embarked)which(train$Embarked == '') #此处为空白，不同于NA!train$Embarked[c(62,830)] = "S"train$Embarked <- factor(train$Embarked)

注意:as.factor和factor的区别，前者赋值后仍然保留原值的水平(level)，后者则合并到赋值的水平之中.

FamilySize变量:

train$FamilySize <- train$SibSp + train$Parch + 1

test数据集的相应处理:

test$Title <- getTitle(test)unique(test$Title) bystats(test$Age, test$Title,         fun=function(x)c(Mean=mean(x),Median=median(x)))titles.na.test <- c("Master","Miss", "Mr","Mrs","Ms") test$Age <- imputeMedian(test$Age, test$Title, titles.na.test)test$Age[which(is.na(test$Age))] <- 28 summary(test$Age)#由于Ms仅此一个值，无平均值替代，所以仍然缺失，此处用训练集的均值替代#生成Title变量！test$Title <- changeTitles(test, c("Dona", "Ms"), "Mrs")test$Title <- changeTitles(test, c("Col", "Dr", "Rev"), "Noble")test$Title <- as.factor(test$Title)summary(test$Fare)test$Fare[which(test$Fare == 0)] <- NAbystats(test$Fare, test$Pclass, fun=function(x)c(Mean=mean(x),Median=median(x)))Pclass.na.test <- c("1", "2", "3")test$Fare <- imputeMedian(test$Fare, test$Pclass, Pclass.na.test)which(is.na(test$Fare))test$Fare[153] <- 7.9summary(test$Fare)test$FamilySize <- test$SibSp + test$Parch + 1

五、模型估计

这里仅选取效果较好的模型，因此有些变量也没用上，如fare等。

library(randomForest)set.seed(1234)fit <- randomForest(as.factor(Survived) ~ Pclass+Sex+Age+Embarked+SibSp+Parch+FamilySize,                  data=train,importance=TRUE,proximity=TRUE)  prediction <- predict(fit,test)submit <- data.frame(PassengerId = test$PassengerId, Survived = prediction)write.csv(submit, file = "firstforest.csv", row.names = FALSE)

得分:0.79426！

其他提高预测率的方法，主要是:
1,Revisit your assumptions about how you cleaned and filled the data.
2,Be creative with additional feature engineering, so that your chosen model has more columns to train from.
3,Use the sklearn documentation to experiment with different parameters for your random forest.(R似乎还没有包支持scikit-learn)
4,Consider a different model approach. For example, a logistic regression model is often used to predict binary outcomes like 0/1.
具体有:
1,变量的处理方式(factor or character? numeric转换成factor等)
Having PClass as an integer gives slightly better results.
As there is a natural ordering.
2,不同Title的分组
'Miss', ' Mlle' are the same. 'Mme', ' Mrs' are the same, …(Mme in Mrs?)
Just different languages. If the group is too small the algorithm will ignore it.
It is worth grouping military titles and aristocratic (贵族) titles.
3,不同Age缺失值的插入方法
Using the Title, SibSp, Parch will give better estimates.
Combine train.csv and test.csv data to determine missing values, but could be considered border line cheating.
4,fare
Ticket groups passengers and the fare is always the total family price for each passenger.
Some fares include travel from the home city to the Embarked port. 不一定都是到纽约去!
A lot of work can be done to calculate a passengers fare.
5,增加变量
Separate SibSp and Parch into Siblings, Spouse, Parents, Children
These columns are overloaded.
6,家庭关系
Ticket, surname, salutation, … can be used to determine relationships.
i.e. is a nanny (保姆) looking after the children
7,新的特征
New features such as, did other family members survive or die.
8,其他
考虑船票号码与生存率的关联，Age缺失值模式与生存率的关联，fare为0是否应该用其他值替代，cabin的缺失值处理，模型估计方法（利用caret包选择最佳参数）等

Reference:
https://www.kaggle.com/c/titanic-gettingStarted
https://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-random-forests
http://trevorstephens.com/post/72916401642/titanic-getting-started-with-r 特点:变量选取部分很详细
https://github.com/wehrley/wehrley.github.io/blob/master/SOUPTONUTS.md 特点:"深入全面"
https://www.kaggle.com/c/titanic-gettingStarted/forums/t/6699/sharing-experiences-about-data-munging-and-classification-steps-with-python

1 0