Kaggle 新手教程(一)
来源:互联网 发布:stc89c52单片机资料 编辑:程序博客网 时间:2024/03/29 15:26
在DATAQUEST上学习kaggle的教程,感觉有些数据预处理的代码很实用,并且用的是之前没接触过的pandas写的,所以记录下来。原文链接:https://www.dataquest.io/mission/74/getting-started-with-kaggle
本教程解决的问题是泰坦尼克,链接为https://www.kaggle.com/c/titanic 这个题目比较简单,之后可能还会在针对这个问题学习更多代码知识。
关于pandas的一些基本用法,可以查阅http://pandas.pydata.org/pandas-docs/stable/10min.html
首先是读取.CSV格式的文件,再利用.describe()做一些基本的统计。
# We can use the pandas library in python to read in the csv file.# This creates a pandas dataframe and assigns it to the titanic variable.titanic = pandas.read_csv("titanic_train.csv")# Print the first 5 rows of the dataframe.print(titanic.head(5))print(titanic.describe())通过统计我们会发现有一些数据有所缺失,还有一些数据并没有什么用。在这个时候我们需要考虑使用什么数据,补全什么数据,舍弃什么数据。这是依据我们对这个问题的常识去理解的。比如对于这个问题,name对存活的影响很小,并且我们很难对name进行处理,所以舍弃。
而针对age来说,缺失了少量数据,我们需要对它进行补全。补全使用了均值。
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())
接下来,要对非数字化的数据进行数字化,从而可以进行机器学习。其中,输出.unique()可以看出在该列中有多少种文字描述,以免疏漏。在这里,将male定为0,female定为1.
# Find all the unique genders -- the column appears to contain only male and female.print(titanic["Sex"].unique())# Replace all the occurences of male with the number 0.titanic.loc[titanic["Sex"] == "male", "Sex"] = 0titanic.loc[titanic["Sex"] == "female", "Sex"] = 1
# Find all the unique values for "Embarked".print(titanic["Embarked"].unique())titanic["Embarked"] = titanic["Embarked"].fillna("S");titanic.loc[titanic["Embarked"] == "S", "Embarked"]= 0 titanic.loc[titanic["Embarked"] == "C", "Embarked"]= 1 titanic.loc[titanic["Embarked"] == "Q", "Embarked"]= 2
到这里,预处理部分基本完成,开始算法部分。原文讲了一些线性回归和交叉验证基础知识,这里不赘述。利用scikit-learn库进行预测,生成预测文件。
# Import the linear regression classfrom sklearn.linear_model import LinearRegression# Sklearn also has a helper that makes it easy to do cross validationfrom sklearn.cross_validation import KFold# The columns we'll use to predict the targetpredictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]# Initialize our algorithm classalg = LinearRegression()# Generate cross validation folds for the titanic dataset. It return the row indices corresponding to train and test.# We set random_state to ensure we get the same splits every time we run this.kf = KFold(titanic.shape[0], n_folds=3, random_state=1)predictions = []for train, test in kf: # The predictors we're using the train the algorithm. Note how we only take the rows in the train folds. train_predictors = (titanic[predictors].iloc[train,:]) # The target we're using to train the algorithm. train_target = titanic["Survived"].iloc[train] # Training the algorithm using the predictors and target. alg.fit(train_predictors, train_target) # We can now make predictions on the test fold test_predictions = alg.predict(titanic[predictors].iloc[test,:]) predictions.append(test_predictions)
接下来计算一下误差。
import numpy as np# The predictions are in three separate numpy arrays. Concatenate them into one. # We concatenate them on axis 0, as they only have one axis.predictions = np.concatenate(predictions, axis=0)# Map predictions to outcomes (only possible outcomes are 1 and 0)predictions[predictions > .5] = 1predictions[predictions <=.5] = 0result=sum(np.array(titanic["Survived"]==predictions))accuracy=result/len(predictions)
from sklearn import cross_validation# Initialize our algorithmalg = LogisticRegression(random_state=1)# Compute the accuracy score for all the cross validation folds. (much simpler than what we did before!)scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)# Take the mean of the scores (because we have one for each fold)print(scores.mean())
最终,在测试集上进行出具的预处理,仿照 上文。
titanic_test = pandas.read_csv("titanic_test.csv")titanic_test["Age"] = titanic_test["Age"].fillna(titanic["Age"].median())titanic_test.loc[titanic_test["Sex"]=="male","Sex"]=0titanic_test.loc[titanic_test["Sex"]=="female","Sex"]=1titanic_test["Embarked"] = titanic_test["Embarked"].fillna("S")titanic_test.loc[titanic_test["Embarked"]=="S","Embarked"] = 0titanic_test.loc[titanic_test["Embarked"]=="C","Embarked"] = 1titanic_test.loc[titanic_test["Embarked"]=="Q","Embarked"] = 2titanic_test["Fare"] = titanic_test["Fare"].fillna(titanic["Fare"].median())
生成我们的提交文件:
# Initialize the algorithm classalg = LogisticRegression(random_state=1)# Train the algorithm using all the training dataalg.fit(titanic[predictors], titanic["Survived"])# Make predictions using the test set.predictions = alg.predict(titanic_test[predictors])# Create a new dataframe with only the columns Kaggle wants from the dataset.submission = pandas.DataFrame({ "PassengerId": titanic_test["PassengerId"], "Survived": predictions })
0 0
- Kaggle 新手教程(一)
- Kaggle 新手教程(二)
- Kaggle竞赛入门教程之Kaggle简介(新手向)
- Kaggle入门系列:(一)Kaggle简介
- 进军Kaggle(一)
- ADF 新手教程一(Jdeveloper12c)
- TF新手使用教程(一)
- Kaggle竞赛优胜者源代码剖析(一)
- kaggle中的可视化(一):House Prices
- Kaggle实例-Titanic分析(一)
- kaggle系列(一、Titanic入门比赛)
- ios新手上路:UICollectionView使用教程(一)
- 新手kaggle比赛总结之一
- kaggle比赛之路(一) —— 新手注册账号并fork一个notebook
- kaggle可视化教程翻译
- KAGGLE比赛中集成方法使用教程(KAGGLE ENSEMBLING GUIDE)
- 关于多线程和GCD新手教程(一)
- Oracle ADF 新手教程(一) JDeveloper IDE
- Xftp连接失败,解决办法
- MySQL优化--where条件字段的顺序对效率的影响 (02)
- java 静态导入,一看就懂了
- 圈叉棋不败策略研究
- java设计模式之策略模式
- Kaggle 新手教程(一)
- 自定义控件三部曲之动画篇(九)——联合动画的代码实现
- 关于javascript垃圾回收机制
- codeforces 186D Mushroom Scientists 不等式
- Android获取屏幕分辨率及dp与 pix间的转换
- XMind带你盘点贝爷吃过的 “大餐”
- 自定义控件三部曲之动画篇(十)——联合动画的XML实现与使用示例
- java—file复制到指定位置
- WINDOWS2008server安全策略设置v