[实践]自行车租赁预测
来源:互联网 发布:孙悟空 知乎 编辑:程序博客网 时间:2024/04/27 23:50
认识数据
这是一个城市自行车租赁系统,提供的数据为2年内华盛顿按小时记录的自行车租赁数据,其中训练集由每个月的前19天组成,测试集由20号之后的时间组成(需要我们自己去预测)。数据来源:Kaggle自行车租赁预测比赛
项目数据描述如下: (1) datetime:日期,以年-月-日 小时的形式给出。 (2) season:季节。1 为春季, 2为夏季,3 为秋季,4 为冬季。(3) hodliday:是否为假期。1代表是,0代表不是。 (4) workingday:是否为工作日,1代表是,0代表不是。 (5) weather:天气: 1: 天气晴朗或者少云/部分有云。 2: 有雾和云/风等。 3: 小雪/小雨,闪电及多云。 4: 大雨/冰雹/闪电和大雾/大雪。 (6) temp - 摄氏温度。 (7) atemp - 人们感觉的温度。 (8) humidity - 湿度。 (9) windspeed - 风速。 (10) casual -随机预定自行车的人数 (11) registered - 登记预定自行车的人数。 (12) count - 总租车数,即casual+registered数目。 其中10~12不属于特征,12为我们需要预测的值。
数据预处理
导入相关数据分析包,将matplotlib的图表直接嵌入到Notebook之中,读取训练数据,观察训练集前十行,获取数据类型与数据集大小。
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns%matplotlib inlinedf_train = pd.read_csv('kaggle_bike_competition_train.csv',header = 0)df_train.head(10)
print "字段名称与类型:", '\n' , df_train.dtypesprint "数据集大小:", '\n' , df_train.shapeprint "列统计:", '\n' , df_train.count()
字段名称与类型: datetime objectseason int64holiday int64workingday int64weather int64temp float64atemp float64humidity int64windspeed float64casual int64registered int64count int64dtype: object数据集大小: (10886, 12)列统计: datetime 10886season 10886holiday 10886workingday 10886weather 10886temp 10886atemp 10886humidity 10886windspeed 10886casual 10886registered 10886count 10886dtype: int64
训练集有10886个样本,12个变量,没有缺省值。
发现datetime数值包含的信息很多,我们将月、日、和 小时单独拎出来,放到3列中,然后删除与模型学习无关的变量。
df_train['month'] = pd.DatetimeIndex(df_train.datetime).monthdf_train['day'] = pd.DatetimeIndex(df_train.datetime).dayofweekdf_train['hour'] = pd.DatetimeIndex(df_train.datetime).hourdf_train_origin = df_train #保存原数据集df_train = df_train.drop(['datetime','casual','registered'], axis = 1)
将数据集分为两部分:
1. df_train_target:目标,也就是count字段。
2. df_train_data:用于产出特征的数据
df_train_target = df_train['count'].valuesdf_train_data = df_train.drop(['count'],axis = 1).values
特征工程
应用机器学习算法的过程,多半是在调参,各种不同的参数会带来不同的结果(比如正则化系数,比如决策树类的算法的树深和棵树,比如距离判定准则等等等等)
我们使用交叉验证的方式(交叉验证集约占全部数据的20%)来看看模型的效果,我们会试 支持向量回归/Suport Vector Regression, 岭回归/Ridge Regression 和 随机森林回归/Random Forest Regressor。每个模型会跑3趟看平均的结果。
from sklearn import linear_modelfrom sklearn import cross_validationfrom sklearn import svmfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.learning_curve import learning_curvefrom sklearn.grid_search import GridSearchCVfrom sklearn.metrics import explained_variance_score# 切分数据(训练集和测试集)cv = cross_validation.ShuffleSplit(len(df_train_data), n_iter=3, test_size=0.2, random_state=0)print "岭回归" for train, test in cv: svc = linear_model.Ridge().fit(df_train_data[train], df_train_target[train]) print("train score: {0:.3f}, test score: {1:.3f}\n".format( svc.score(df_train_data[train], df_train_target[train]), svc.score(df_train_data[test], df_train_target[test])))print "支持向量回归/SVR(kernel='rbf',C=10,gamma=.001)"for train, test in cv: svc = svm.SVR(kernel ='rbf', C = 10, gamma = .001).fit(df_train_data[train], df_train_target[train]) print("train score: {0:.3f}, test score: {1:.3f}\n".format( svc.score(df_train_data[train], df_train_target[train]), svc.score(df_train_data[test], df_train_target[test])))print "随机森林回归/Random Forest(n_estimators = 100)" for train, test in cv: svc = RandomForestRegressor(n_estimators = 100).fit(df_train_data[train], df_train_target[train]) print("train score: {0:.3f}, test score: {1:.3f}\n".format( svc.score(df_train_data[train], df_train_target[train]), svc.score(df_train_data[test], df_train_target[test])))
岭回归train score: 0.339, test score: 0.332train score: 0.330, test score: 0.370train score: 0.342, test score: 0.320支持向量回归/SVR(kernel='rbf',C=10,gamma=.001)train score: 0.417, test score: 0.408train score: 0.406, test score: 0.452train score: 0.419, test score: 0.390随机森林回归/Random Forest(n_estimators = 100)train score: 0.981, test score: 0.866train score: 0.981, test score: 0.880train score: 0.981, test score: 0.870
模型调参
随机森林回归获得了最佳结果,利用GridSearch尝试寻找最优参数,大概耗时2分钟左右的时间。
X = df_train_datay = df_train_targetX_train, X_test, y_train, y_test = cross_validation.train_test_split( X, y, test_size=0.2, random_state=0)tuned_parameters = [{'n_estimators':[10,100,500]}] scores = ['r2']for score in scores: print score clf = GridSearchCV(RandomForestRegressor(), tuned_parameters, cv=5, scoring=score) clf.fit(X_train, y_train) #最优模型 print(clf.best_estimator_) print "" print("得分分别是:") for params, mean_score, scores in clf.grid_scores_: print("%0.3f (+/-%0.03f) for %r" % (mean_score, scores.std() / 2, params))
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False)得分分别是:0.846 (+/-0.008) for {'n_estimators': 10}0.862 (+/-0.006) for {'n_estimators': 100}0.863 (+/-0.005) for {'n_estimators': 500}
再看看模型的学习曲线,是否过拟合或欠拟合
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)): plt.figure() plt.title(title) if ylim is not None: plt.ylim(*ylim) plt.xlabel("Training examples") plt.ylabel("Score") train_sizes, train_scores, test_scores = learning_curve( estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes) train_scores_mean = np.mean(train_scores, axis=1) train_scores_std = np.std(train_scores, axis=1) test_scores_mean = np.mean(test_scores, axis=1) test_scores_std = np.std(test_scores, axis=1) plt.grid() plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r") plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g") plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score") plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score") plt.legend(loc="best") return plttitle = "Learning Curves (Random Forest, n_estimators = 100)"cv = cross_validation.ShuffleSplit(df_train_data.shape[0], n_iter=10,test_size=0.2, random_state=0)estimator = RandomForestRegressor(n_estimators = 100)plot_learning_curve(estimator, title, X, y, (0.0, 1.01), cv=cv, n_jobs=4)plt.show()
随机森林的算法学习能力比较强,由图可以发现,训练集和测试集的得分差距也是蛮大的,过拟合还比较明显,尝试一下缓解过拟合,效果不是太好。
print "随机森林回归/Random Forest(n_estimators=200, max_features=0.6, max_depth=15)"for train, test in cv: svc = RandomForestRegressor(n_estimators = 200, max_features=0.6, max_depth=15).fit(df_train_data[train], df_train_target[train]) print("train score: {0:.3f}, test score: {1:.3f}\n".format( svc.score(df_train_data[train], df_train_target[train]), svc.score(df_train_data[test], df_train_target[test])))
随机森林回归/Random Forest(n_estimators=200, max_features=0.6, max_depth=15)train score: 0.965, test score: 0.870train score: 0.966, test score: 0.885train score: 0.965, test score: 0.872train score: 0.965, test score: 0.877train score: 0.967, test score: 0.870train score: 0.965, test score: 0.872train score: 0.966, test score: 0.864train score: 0.966, test score: 0.873train score: 0.965, test score: 0.873train score: 0.966, test score: 0.870
阅读全文
0 0
- [实践]自行车租赁预测
- 公共自行车租赁系统
- 台北公共自行车预测
- 西安自行车租赁办卡点 公共自行车的使用方法
- 基于WEB的自行车租赁管理系统设计与实现
- 【压箱底】一种基于物联网的公共自行车租赁系统
- 【FCN实践】04 预测
- [实践]房价预测
- 自行车
- 自行车
- 自行车
- 自行车
- 自行车自行车自行车自行车
- 自行车
- 《时间序列预测实践教程》
- 预测异常报警模型实践
- C++实践参考——摩托车继承自行车和机动车
- 《时间序列预测实践教程》2
- Cointainer With Most Water
- hdu5289ST表+二分
- shell流程控制
- A、B、C、D、E类IP
- Map.Entry使用详解
- [实践]自行车租赁预测
- py-faster-rcnn在Windows下的end2end训练
- Netty学习笔记(一)
- ICA特征脸试验
- The Tower of Babylon UVA
- java多线程(一)
- struts2概念
- 深度学习之windows python faster rcnn 配置及demo运行
- Java实训课6