Kaggle竞赛:泰坦尼克号灾难数据分析简单案例
来源:互联网 发布:淘宝店铺认证复核2017 编辑:程序博客网 时间:2024/05/01 20:45
Kaggle竞赛:泰坦尼克号灾难数据分
https://www.kaggle.com/c/titanic
- 目标确定:根据已有数据预测未知旅客生死
- 数据准备:
- 数据获取,载入训练集csv、测试集csv
- 数据清洗,补齐或抛弃缺失值,数据类型变换(字符串转数字)
- 数据重构,根据需要重新构造数据(重组数据,构建新特征)
- 数据分析:
- 描述性分析,画图,直观分析
- 探索性分析,机器学习模型
- 成果输出:csv文件上传得到正确率和排名
载入库
import numpy as npimport matplotlib.pyplot as pltimport pandas as pd
数据获取
train = pd.read_csv('train.csv')test = pd.read_csv('test.csv')
train.head() # 显示头几行数据
test.head()# 显示头几行数据
数据概览
train.shape, test.shape # 查看数据的行数,列数
((891, 12), (418, 11))
train.info() # 查看具体信息字段
<class 'pandas.core.frame.DataFrame'>RangeIndex: 891 entries, 0 to 890Data columns (total 12 columns):PassengerId 891 non-null int64Survived 891 non-null int64Pclass 891 non-null int64Name 891 non-null objectSex 891 non-null objectAge 714 non-null float64SibSp 891 non-null int64Parch 891 non-null int64Ticket 891 non-null objectFare 891 non-null float64Cabin 204 non-null objectEmbarked 889 non-null objectdtypes: float64(2), int64(5), object(5)memory usage: 83.6+ KB
test.info()
<class 'pandas.core.frame.DataFrame'>RangeIndex: 418 entries, 0 to 417Data columns (total 11 columns):PassengerId 418 non-null int64Pclass 418 non-null int64Name 418 non-null objectSex 418 non-null objectAge 332 non-null float64SibSp 418 non-null int64Parch 418 non-null int64Ticket 418 non-null objectFare 417 non-null float64Cabin 91 non-null objectEmbarked 418 non-null objectdtypes: float64(2), int64(4), object(5)memory usage: 36.0+ KB
train.csv 具体数据格式
- PassengerId 乘客ID
- Survived 是否幸存。0遇难,1幸存
- Pclass 船舱等级,1Upper,2Middle,3Lower
- Name 姓名,object——————————
- Sex 性别,object—————————
- Age 年龄 缺失177——m————————
- SibSp 兄弟姐妹及配偶个数
- Parch 父母或子女个数
- Ticket 乘客的船票号,object————————
- Fare 乘客的船票价
- Cabin 乘客所在舱位,object,缺失687———————
- Embarked 乘客登船口岸,object,缺失3————————
train.head() # head()方法查看头部几行信息,如果打train则返回所有数据列表
数据清洗
缺失过多或无关值抛弃
# .loc 通过自定义索引获取数据 , 其中 .loc[:,:]中括号里面逗号前面的表示行,逗号后面的表示列train2 = train.loc[:,['PassengerId','Survived','Pclass','Sex','Age','SibSp','Parch','Fare']]test2 = test.loc[:, ['PassengerId','Pclass','Sex','Age','SibSp','Parch','Fare']]
train2.head()
test2.head()
train2.info(), test2.info()
test2.info()
<class 'pandas.core.frame.DataFrame'>RangeIndex: 418 entries, 0 to 417Data columns (total 7 columns):PassengerId 418 non-null int64Pclass 418 non-null int64Sex 418 non-null objectAge 332 non-null float64SibSp 418 non-null int64Parch 418 non-null int64Fare 417 non-null float64dtypes: float64(2), int64(4), object(1)memory usage: 22.9+ KB
填充年龄空值
age = train2['Age'].median() # 年龄中位数age
28.0
train2['Age'].isnull() # 空值转bool值
0 False1 False2 False3 False4 False5 True6 False7 False8 False9 False10 False11 False12 False13 False14 False15 False16 False17 True18 False19 True20 False21 False22 False23 False24 False25 False26 True27 False28 True29 True ... 861 False862 False863 True864 False865 False866 False867 False868 True869 False870 False871 False872 False873 False874 False875 False876 False877 False878 True879 False880 False881 False882 False883 False884 False885 False886 False887 False888 True889 False890 FalseName: Age, Length: 891, dtype: bool
train2.loc[train2['Age'].isnull(), 'Age'] = age # 为train2年龄为空值的填充年龄中位数
train2.info()
test2.loc[test2['Age'].isnull(), 'Age'] = age # 为test2中年龄为空值的数据填充年龄中位数
test2.info()
<class 'pandas.core.frame.DataFrame'>RangeIndex: 418 entries, 0 to 417Data columns (total 7 columns):PassengerId 418 non-null int64Pclass 418 non-null int64Sex 418 non-null objectAge 418 non-null float64SibSp 418 non-null int64Parch 418 non-null int64Fare 417 non-null float64dtypes: float64(2), int64(4), object(1)memory usage: 22.9+ KB
填充船票价格空值
#取众数填充船票价格 FareFare = test2['Fare'].mode()Faretest2.loc[test['Fare'].isnull(),'Fare'] = Fare[0]train2.info(),test2.info()
train2.head()
数据类型转换
train2.dtypes,test2.dtypes # 列数据类型
(PassengerId int64 Survived int64 Pclass int64 Sex object Age float64 SibSp int64 Parch int64 Fare float64 dtype: object, PassengerId int64 Pclass int64 Sex object Age float64 SibSp int64 Parch int64 Fare float64 dtype: object)
性别转换成整型数据
train2['Sex'] = train2['Sex'].map({'female':0, 'male':1}).astype(int)test2['Sex'] = test2['Sex'].map({'female': 0, 'male': 1}).astype(int)train2.head()
数据重构
将SibSp、Parch特征构建两个新特征
- 家庭人口总数 familysize
- 是否单身 isalone
train2.loc[:,'SibSp'] #兄妹个数train2.loc[:,'Parch'] #父母子女个数train2['familysize'] = train2.loc[:,'SibSp'] + train2.loc[:,'Parch'] + 1test2['familysize'] = test2.loc[:,'SibSp'] + test2.loc[:,'Parch'] + 1
train2.head()
train2['isalone'] = 0train2.loc[train2['familysize'] == 1,'isalone'] = 1
train2.head()
数据重构后的最终数据
train3 = train2.loc[:,['PassengerId','Survived','Pclass','Sex','Age','Fare','familysize','isalone']]train3.head()test3 = test2.loc[:,['PassengerId','Pclass','Sex','Age','Fare','familysize','isalone']]test3.head()
数据分析
描述性分析
#单身存活率d = train3[['isalone', 'Survived']].groupby(['isalone']).mean()d# d.loc[0,'Survived']
#单身与否死亡率plt.bar( [0,1], [1-d.loc[0,'Survived'],1-d.loc[1,'Survived']], 0.5, color='r', alpha=0.5,)plt.xticks([0,1],['notalone','alone'])plt.show()
#男性女性存活率n = train3[['Sex', 'Survived']].groupby(['Sex']).mean()n
# 不同性别死亡率条形图plt.bar( [0,1], [1-n.loc[0,'Survived'],1-n.loc[1,'Survived']], 0.5, color='g', alpha=0.7)plt.xticks([0,1],['female','male'])plt.show()
#仓位存活率c = train3[['Pclass', 'Survived']].groupby(['Pclass']).mean()c
#三等仓位死亡率条形图plt.bar( [0,1,2], [1-c.loc[1,'Survived'],1-c.loc[2,'Survived'],1-c.loc[3,'Survived']], 0.5, color='b', alpha=0.7)plt.xticks([0,1,2],[1,2,3])plt.show()
#年龄存活率age = train3[['Age', 'Survived']].groupby(['Age']).mean()age
88 rows × 1 columns
#不同年龄存活率plt.figure(2, figsize=(20,5))plt.bar( age.index, age.values, 0.5, color='r', alpha=0.7)# plt.axis([0,80,0,20])plt.xticks(age.index,rotation=90)plt.show()
#票价存活率fare = train3[['Fare', 'Survived']].groupby(['Fare']).mean()fare
248 rows × 1 columns
plt.figure(2, figsize=(20,5))plt.bar( fare.index, fare.values, 0.5, color='r', alpha=0.7)# plt.axis([0,80,0,20])plt.xticks(fare.index,rotation=90)plt.show()
得出结论
# 单身死亡率70%jieguo = pd.DataFrame(np.arange(0,418),index=test3.loc[:,'PassengerId'])jieguo.loc[:,0] = 1
jieguo.head()
jieguo.loc[test3[test3.loc[:,'isalone'] == 1].loc[:,'PassengerId'].values] = 0 #单身死
jieguo.head()
输出结论
jieguo.to_csv('isalone.csv')
#判断:男性全死,女性全活,三等仓全死new3 = pd.DataFrame(np.arange(0,418),index=test3.loc[:,'PassengerId'].values)new3[0] = 0 #默认全死new3.head()
new3.loc[test3[test3.loc[:,'Sex'] == 0].loc[:,'PassengerId'].values] = 1 #女性活new3.head()
new3.loc[test2[test2.loc[:,'Pclass'] == 3].loc[:,'PassengerId'].values] = 0 #三等仓死new3.head()
#写入csv上传new3.to_csv('cangwei-xingbie.csv')#判断:男性全死,女性全活,三等仓全死
机器学习建模
train3.head()
from sklearn import neighbors,datasets
x = train3.loc[:,['Pclass','Sex','familysize']]y = train3.loc[:,'Survived'] #生死clf = neighbors.KNeighborsClassifier(n_neighbors = 20)clf.fit(x,y) #knn训练clf
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=20, p=2, weights='uniform')
#knn预测z = clf.predict(test3.loc[:,['Pclass','Sex','familysize']])z
array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0], dtype=int64)
# 构造表s = np.arange(892, 1310)sresults = pd.DataFrame(z, index=s)results.head()
# 写入csv上传results.to_csv('Titanic_knn.csv')
阅读全文
0 0
- Kaggle竞赛:泰坦尼克号灾难数据分析简单案例
- [kaggle数据] 泰坦尼克号生存预测分析
- Kaggle上的泰坦尼克生还数据分析
- Kaggle竞赛 —— 泰坦尼克号(Titanic)
- kaggle之泰坦尼克号
- Kaggle泰坦尼克预测(完整分析)
- Kaggle泰坦尼克预测(完整分析)
- 数据科学工程师面试宝典系列之二---Python机器学习kaggle案例:泰坦尼克号船员获救预测
- 数据科学工程师面试宝典系列之二---Python机器学习kaggle案例:泰坦尼克号船员获救预测
- 数据科学工程师面试宝典系列之二---Python机器学习kaggle案例:泰坦尼克号船员获救预测
- Kaggle 入门:探索泰坦尼克号事故幸存情况分析
- 泰坦尼克号生还分析数据
- 今晚直播 | 泰坦尼克号经典案例分析
- 对泰坦尼克号案例进行数据挖掘
- kaggle 泰坦尼克号生还者预测
- Kaggle竞赛(1)——Tantic泰坦尼克之灾
- [Kaggle] 数据建模分析与竞赛平台介绍
- [Kaggle] 数据建模分析与竞赛平台介绍
- 如何用位运算符(~)和数据运算符(-)来计算表示n + 1和n
- 点九图的制作方法
- java作用域public protected private,以及不写时的区别
- 01 Hibernate测试
- linux中class_create和class_register说明
- Kaggle竞赛:泰坦尼克号灾难数据分析简单案例
- LeetCode
- 虚拟机文件共享
- EDA与VHDL作业(2)
- 火星车开发板”SDR Receiver分析说明
- 黑客丛林之旅
- Linux 下 C/C++ 静态库、动态库的区别
- poj2135-费用流&费用流模板-Farm Tour
- 392. Is Subsequence 双指针 简单