基于R语言的Kaggle案例分析学习笔记（五）

来源：互联网发布：apm性能监控 java 编辑：程序博客网时间：2024/06/05 08:50

药店销量预测

本案例大纲：

1、xgboost理论介绍

2、R语言中xgboost相关函数的参数

3、案例背景

4、数据预处理

5、R语言的xgb模型实现代码

1、xgboost理论介绍

这部分我直接把一些牛人写的关于xgb的理论介绍引用过来了，大家可以直接看以下博客链接资料，既有原理介绍又有代码的函数参数介绍：

http://blog.csdn.net/a819825294/article/details/51206410

http://blog.csdn.net/sb19931201/article/details/52557382

http://blog.csdn.net/sb19931201/article/details/52577592

2、R语言中xgboost相关函数的参数

R语言的XGBOOST包的参数包括三个方面的参数：常规参数、模型参数和任务参数。通用参数用于选择哪一类分类器，是树模型还是线性模型；模型参数取决于常规函数中选择的模型类型；任务参数取决于学习的场景。

常规数：

booster [default=gbtree]
选择基分类器
silent [default=0]
设置成1则没有运行信息输出，最好是设置为0.
nthread [default to maximum number of threads available if not set]
线程数
num_pbuffer
[set automatically by xgboost, no need to be set by user]
缓冲区大小
num_feature
[set automatically by xgboost, no need to be set by user]
特征维度

模型参数：

（1）树模型的参数

eta[default=0.3]

学习率，一般设置小一些。
range: [0,1]
gamma [default=0]
后剪枝时，用于控制是否剪枝，值越大，算法越保守。
range: [0,∞]
max_depth [default=6]
树的最大深度
范围: [1,∞]
min_child_weight [default=1]
这个参数默认是 1，是每个叶子里面 h 的和至少是多少，对正负样本不均衡时的 0-1 分类而言，假设 h 在 0.01 附近，min_child_weight 为 1 意味着叶子节点中最少需要包含 100 个样本。这个参数非常影响结果，控制叶子节点中二阶导的和的最小值，该参数值越小，越容易 overfitting。
range: [0,∞]
max_delta_step [default=0]
这个参数在更新步骤中起作用，如果取0表示没有约束，如果取正值则使得更新步骤更加保守。可以防止做太大的更新步子，使更新更加平缓。
range: [0,∞]
subsample [default=1]
样本随机采样，较低的值使得算法更加保守，防止过拟合，但是太小的值也会造成欠拟合。
range: (0,1]
colsample_bytree [default=1]
列采样，对每棵树的生成用的特征进行列采样.一般设置为： 0.5-1
range: (0,1]
lambda [default=1]
权重L2正则化
alpha [default=0]
权重L1正则化

（2）线性模型参数

lambda[default=0]
权重L2正则化
alpha [default=0]
权重L1正则化
lambda_bias
L2 regularization term on bias, default 0(no L1 reg on bias because it is notimportant)

偏导L2正则化参数，默认为0（没有偏导L1正则化参数）

任务参数：

objective [default=reg:linear ] 定义最小化损失函数类型，常用参数如下：
“reg:linear” –linear regression
“reg:logistic” –logistic regression
“binary:logistic” –logistic regression for binary classification, outputprobability
“binary:logitraw” –logistic regression for binary classification, output scorebefore logistic transformation
“count:poisson” –poisson regression for count data, output mean of poissondistribution
max_delta_step is set to 0.7 by default in poisson regression (used tosafeguard optimization)
“multi:softmax” –set XGBoost to do multiclass classification using the softmaxobjective, you also need to set num_class(number of classes)
“multi:softprob” –same as softmax, but output a vector of ndata * nclass, whichcan be further reshaped to ndata, nclass matrix. The result contains predictedprobability of each data point belonging to each class.
“rank:pairwise” –set XGBoost to do ranking task by minimizing the pairwiseloss
base_score [ default=0.5 ]
the initial prediction score of all instances, global bias 所有实例的初始预测评分, 全局偏差
eval_metric [ default according to objective ] 评估指标选择
evaluation metrics for validation data, a default metric will be assignedaccording to objective( rmse for regression, and error for classification, meanaverage precision for ranking )
User can add multiple evaluation metrics, for Python user, remember to pass the metrics in as list of parameters pairsinstead of map, so that latter ‘eval_metric’ won’t override previous one
The choices are listed below: 评估指标可选列表如下：
“rmse”: root mean square error
“logloss”: negative log-likelihood
“error”: Binary classification error rate. It is calculated as #(wrongcases)/#(all cases). For the predictions, the evaluation will regard theinstances with prediction value larger than 0.5 as positive instances, and theothers as negative instances.
“merror”: Multiclass classification error rate. It is calculated as #(wrongcases)/#(all cases).
“mlogloss”: Multiclass logloss
“auc”: Area under the curve for ranking evaluation.
“ndcg”:Normalized Discounted Cumulative Gain
“map”:Mean average precision
“ndcg@n”,”map@n”: n can be assigned as an integer to cut off the top positionsin the lists for evaluation.
“ndcg-“,”map-“,”ndcg@n-“,”map@n-“: In XGBoost, NDCG and MAP will evaluate thescore of a list without any positive samples as 1. By adding “-” in theevaluation metric XGBoost will evaluate these score as 0 to be consistent undersome conditions. training repeatively
seed [ default=0 ] 随机种子
random number seed.

从xgboost原理部分的第二个链接那位博主给出的python的xgboost参数几乎一致，也就是R语言的xgboost的参数与python是一样的。

3、案例背景

Rossmann在7个欧洲国家拥有3,000家药店。目前，罗斯曼店经理的任务是提前六周预测其日销量。商店销售受到诸多因素的影响，包括促销，竞争，学校和国家假日，季节性和地点。成千上万的个人经理根据其独特的情况预测销售量，结果的准确性可能会有很大的变化。

Kaggle所提供的数据的字段如下：

英文名称

英文解释

中文解释

an Id that represents a (Store, Date) duple within the test set

表示测试集中（存储，日期）副本的Id

Store

a unique Id for each store

每个商店的独特Id

Sales

the turnover for any given day (this is what you are predicting)

每天的销量（这是需要预测的因变量）

Customers

the number of customers on a given day

某一天的客户数量

Open

an indicator for whether the store was open: 0 = closed, 1 = open

商店是否打开的指示器：0 =关闭，1 =打开

StateHoliday

indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None

表示一个国家假期。通常所有商店，除了少数例外，在国营假期关闭。请注意，所有学校在公众假期和周末关闭。a =公众假期，b =复活节假期，c =圣诞节，0 =无

SchoolHoliday

indicates if the (Store, Date) was affected by the closure of public schools

表示（商店，日期）是否受到公立学校关闭的影响

StoreType

differentiates between 4 different store models: a, b, c, d

区分4种不同的商店模式：a，b，c，d

Assortment

describes an assortment level: a = basic, b = extra, c = extended

描述分类级别：a = basic，b = extra，c = extended

CompetitionDistance

distance in meters to the nearest competitor store

距离最接近的竞争对手商店的距离

CompetitionOpenSince[Month/Year]

gives the approximate year and month of the time the nearest competitor was opened

给出最近的竞争对手开放时间的大约年和月

Promo

indicates whether a store is running a promo on that day

指示商店是否在当天运行促销

Promo2

Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating

Promo2是一些持续和连续推广的一些商店：0 =商店不参与，1 =商店正在参与

Promo2Since[Year/Week]

describes the year and calendar week when the store started participating in Promo2

描述商店开始参与Promo2的日期

PromoInterval

describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

描述了Promo2的连续间隔开始，命名新的促销活动的月份。例如“二月，五月，八月，十一月”是指每一轮在该店的任何一年的二月，五月，八月，十一月份开始

4、数据预处理

由于本案例主要讲解xgboost模型，所以对于数据预处理和特征工程都做得比较少。只做了两方面的处理，第一，Kaggle官网把商店的一些属性数据与训练集、测试集分开放，在不同文件，所以要把store数据集与train、test数据集按列合并；第二，将数据按照xgb要求的格式进行转换，R语言的xgboost包的xgb.Matrix是转换格式的包。

5、代码实现

数据下载地址：https://www.kaggle.com/c/rossmann-store-sales/data

library(readr)library(xgboost)library(lubridate)train<-read.csv('D:/R语言kaggle案例实战/Kaggle第五节课/train.csv')test<-read.csv('D:/R语言kaggle案例实战/Kaggle第五节课/test.csv')store<-read.csv('D:/R语言kaggle案例实战/Kaggle第五节课/store.csv')#store是对店铺属性的补充train<-merge(train,store)#将两个数据集按列合并test<-merge(test,store)#将两个数据集按列合并train$Date<-as.POSIXct(train$Date)#将日期字符变成时间格式test$Date<-as.POSIXct(test$Date)#将日期字符变成时间格式train[is.na(train)]<-0#将空值置为零test[is.na(test)]<-0train<-train[which(train$Open=='1'),]#选择开门的且销售额不为0的样本train<-train[which(train$Sales!='0'),]train$month<-month(train$Date)#提取月份train$year<-year(train$Date)#提取年份train$day<-day(train$Date)#提取日train<-train[,-c(3,8)]#删除日期列和缺失值较多的列test<-test[,-c(4,7)]#删除日期列和缺失值较多的列feature.names<-names(train)[c(1,2,5:19)]#这一步主要使测试集和训练集的结构一致。for(f in feature.names){  if(class(train[[f]])=="character"){    levels<-unique(c(train[[f]],test[[f]]))    train[[f]]<-as.integer(factor(train[[f]],levels = levels))    test[[f]]<-as.integer(factor(test[[f]],levels = levels))  }}tra<-train[,feature.names]RMPSE<-function(preds,dtrain){ #定义一个评价函数，Kaggle官方给的评价函数作为xgboost中的评价函数。  labels<-getinfo(dtrain,"label")  elab<-exp(as.numeric(labels))-1  epreds<-exp(as.numeric(preds))-1  err<-sqrt(mean((epreds/elab-1)^2))  return(list(metric="RMPSE",value=err))}h<-sample(nrow(train),10000)#进行10000次抽样dval<-xgb.DMatrix(data=data.matrix(tra[h,]),label=log(train$Sales+1)[h])#用于以下构建watchlist dtrain<-xgb.DMatrix(data=data.matrix(tra[-h,]),label=log(train$Sales+1)[-h])#构建xgb特定的矩阵形式watchlist<-list(val=dval,train=dtrain)#构建模型参数的watchlist,watchlist是用于监听每次模型运行时的模型性能情况。param<-list(objective="reg:linear",            booster="gbtree",            eta=0.02,            max_depth=12,            subsample=0.9,            colsample_bytree=0.7,            num_parallel_tree=2,            alpha=0.0001,            lambda=1)clf<-xgb.train(  params=param,                 data=dtrain,                 nrounds = 3000,                 verbose = 0,                 early.stop.round=100,                 watchlist = watchlist,                 maximize = FALSE,                 feval = RMPSE  )ptest<- predict(clf,test,outputmargin=TRUE)

阅读全文

0 0