kaggle实例学习-Titanic(1)
来源:互联网 发布:宁芙 淘宝店地址 编辑:程序博客网 时间:2024/04/20 14:52
比赛地址:https://www.kaggle.com/c/titanic/data?train.csv
部分内容来源于(尤其是代码)http://blog.csdn.net/han_xiaoyang/article/details/49797143
import pandas as pdimport numpy as npfrom pandas import Series,DataFramedata_train=pd.read_csv("F:/Machine Learning/kaggle/Titanic/train.csv")data_train
结果如下:
data_train.info()#Dataframe 的info方法可以显示数据的主要信息
data_train.describe()
对数据了解又多了一点了
接下来对数据进行可视化:
import matplotlib.pyplot as pltfig=plt.figure()fig.set(alpha=0.2)#设定图表颜色fig.set_size_inches(18.5, 10.5)plt.subplot2grid((2,3),(0,0))#画几个镶嵌的小图data_train.Survived.value_counts().plot(kind='bar')#将Survived画成柱状图plt.title(u"rescue")plt.ylabel(u"numbers")plt.subplot2grid((2,3),(0,1))#同样对Pclass进行处理data_train.Pclass.value_counts().plot(kind='bar')plt.title(u"Pclass distribution")plt.ylabel(u"numbers")plt.subplot2grid((2,3),(0,2))#查看Survived与Age的关系plt.scatter(data_train.Survived,data_train.Age)plt.grid(b=True,which='major',axis='y')plt.title(u"Age and rescue")plt.ylabel(u"Age")plt.subplot2grid((2,3),(1,0),colspan=2)data_train.Age[data_train.Pclass==1].plot(kind='kde')data_train.Age[data_train.Pclass==2].plot(kind='kde')data_train.Age[data_train.Pclass==3].plot(kind='kde')plt.xlabel(u"Age")plt.ylabel(u"density")plt.title(u"Pclass and Age distribution")plt.legend((u'first class',u'second class',u'third class'),loc='best')plt.subplot2grid((2,3),(1,2))#对Embarked进行处理data_train.Embarked.value_counts().plot(kind='bar')plt.title(u"numbers of rescued on land")plt.ylabel(u"numbers")plt.show()
效果图:
下面来看看乘客的等级和获救情况是否有关:
fig=plt.figure()fig.set(alpha=0.2)Survived_0=data_train.Pclass[data_train.Survived==0].value_counts()Survived_1=data_train.Pclass[data_train.Survived==1].value_counts()df=pd.DataFrame({u'rescued':Survived_1,u'unrescued':Survived_0})df.plot(kind='bar',stacked=True)plt.title(u"Pclass and the rescued distribution")plt.xlabel(u"Pclass")plt.ylabel(u"numbers")plt.show()
确实有关系,明显Pclass==3的未获救的更多。
同样地,看看性别对获救情况的影响:
fig=plt.figure()fig.set(alpha=0.2)Survived_male=data_train.Survived[data_train.Sex=='male'].value_counts()Survived_female=data_train.Survived[data_train.Sex=='female'].value_counts()df=pd.DataFrame({u'Survived male':Survived_male,u'Survived female':Survived_female})df.plot(kind='bar',stacked=True)plt.title(u"gender and rescue")plt.xlabel(u"gender")plt.ylabel(u"numbers")plt.show()
可以看出,女性的获救人数明显多于男性。
下面做一个综合一点的图表:
fig=plt.figure()fig.set(alpha=0.45)plt.title(u"Pclass,gender and Survive")fig.set_size_inches(16.5, 8.5)ax1=fig.add_subplot(141)data_train.Survived[data_train.Sex=='female'][data_train.Pclass!=3].value_counts().p\lot(kind='bar',label='female,high class',color='#FA2479')ax1.set_xticklabels([u"unsurvived",u"survived"],rotation=0)ax1.legend([u"female/Pclass3"],loc='best')ax2=fig.add_subplot(142)data_train.Survived[data_train.Sex=='female'][data_train.Pclass==3].value_counts().p\lot(kind='bar',label='female,lower class',color='#FA2479')ax2.set_xticklabels([u"unsurvived",u"survived"],rotation=0)ax2.legend([u"female/Pclass"],loc='best')ax3=fig.add_subplot(143,sharey=ax1)data_train.Survived[data_train.Sex=='male'][data_train.Pclass!=3].value_counts().p\lot(kind='bar',label='male,high class',color='lightblue')ax3.set_xticklabels([u"unsurvived",u"survived"],rotation=0)ax3.legend([u"male/Pclass3"],loc='best')ax4=fig.add_subplot(144,sharey=ax1)data_train.Survived[data_train.Sex=='male'][data_train.Pclass==3].value_counts().p\lot(kind='bar',label='male,low class',color='lightblue')ax4.set_xticklabels([u"unsurvived",u"survived"],rotation=0)ax4.legend([u"male/Pclass"],loc='best')效果图
接下来检查Cabin属性:
data_train.Cabin.value_counts()
C23 C25 C27 4
B96 B98 4
G6 4
E101 3
C22 C26 3
F2 3
D 3
F33 3
C124 2
C65 2
C93 2
D20 2
C83 2
B35 2
D35 2
B77 2
E33 2
D33 2
E121 2
B28 2
B51 B53 B55 2
D26 2
E25 2
B58 B60 2
C2 2
E24 2
C126 2
C68 2
D17 2
D36 2
..
D49 1
E31 1
A34 1
C70 1
C45 1
C104 1
C7 1
D9 1
C110 1
C50 1
B4 1
C46 1
D30 1
A6 1
D21 1
E34 1
D7 1
B71 1
T 1
B38 1
C111 1
E50 1
B69 1
A36 1
B79 1
D45 1
A10 1
A32 1
C49 1
C103 1
Name: Cabin, dtype: int64
数据代表的具体意义不是很清楚,来看看对Survive是否有影响
fig=plt.figure()fig.set(alpha=0.2)Survived_cabin=data_train.Survived[pd.notnull(data_train.Cabin)].value_counts()Survived_nocabin=data_train.Survived[pd.isnull(data_train.Cabin)].value_counts()df=pd.DataFrame({u'isCabin':Survived_cabin,u'noCabin':Survived_nocabin}).transpose()df.plot(kind='bar',stacked=True)plt.title(u'Cabin and Survived')plt.xlabel(u"Cabin")plt.ylabel(u"numbers")plt.show()
结果如下:
对数据的大体情况就了解到这里
- kaggle实例学习-Titanic(1)
- kaggle实例学习-Titanic(2)
- kaggle实例学习-Titanic(3)
- kaggle实例学习-Titanic(4)
- 机器学习-Kaggle竞赛-Titanic
- Kaggle实例-Titanic分析(一)
- 【kaggle】Titanic
- Kaggle: Titanic
- kaggle:titanic
- kaggle-Titanic
- Kaggle竞赛之-titanic学习笔记
- Kaggle学习:A Journey through Titanic
- Kaggle Titanic 机器学习实践笔记
- kaggle titanic 机器学习流程 top30%
- kaggle实战之Titanic (1)-预处理
- Kaggle练习1——Titanic
- Kaggle实践1:“Titanic之灾”整理
- 机器学习笔记(1)-分析框架-以Kaggle Titanic问题为例
- DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command
- SSH(一)~——Struts实现简单登录(附源码)
- 关于本人树莓派捣鼓过程中的一些记录
- ROS玩AR.drone(-2-)
- 在华为海思HI3518上移植和运行QT4.8.6 undefined reference to accept4
- kaggle实例学习-Titanic(1)
- 工厂模式(Factory)
- dSploit--开源的专业的Android平台安全管理工具包
- 字符串操作函数模拟实现大全
- 网站开发感性认识(学习思路)
- 关系数据模型和范式
- Systemverilog的一个牛人总结
- 黑客工具箱dSploit使用教程-干掉别人家的Wifi
- LeetCode 189. Rotate Array(旋转数组)