基于R语言的Kaggle案例分析学习笔记(五)
来源:互联网 发布:apm性能监控 java 编辑:程序博客网 时间:2024/06/05 08:50
药店销量预测
本案例大纲:
1、xgboost理论介绍
2、R语言中xgboost相关函数的参数
3、案例背景
4、数据预处理
5、R语言的xgb模型实现代码
1、xgboost理论介绍
这部分我直接把一些牛人写的关于xgb的理论介绍引用过来了,大家可以直接看以下博客链接资料,既有原理介绍又有代码的函数参数介绍:
http://blog.csdn.net/a819825294/article/details/51206410
http://blog.csdn.net/sb19931201/article/details/52557382
http://blog.csdn.net/sb19931201/article/details/52577592
2、R语言中xgboost相关函数的参数
R语言的XGBOOST包的参数包括三个方面的参数:常规参数、模型参数和任务参数。通用参数用于选择哪一类分类器,是树模型还是线性模型;模型参数取决于常规函数中选择的模型类型;任务参数取决于学习的场景。
常规数:
booster [default=gbtree]
选择基分类器
silent [default=0]
设置成1则没有运行信息输出,最好是设置为0.
nthread [default to maximum number of threads available if not set]
线程数
num_pbuffer
[set automatically by xgboost, no need to be set by user]
缓冲区大小
num_feature
[set automatically by xgboost, no need to be set by user]
特征维度
模型参数:
(1)树模型的参数
eta[default=0.3]
学习率,一般设置小一些。
range: [0,1]
gamma [default=0]
后剪枝时,用于控制是否剪枝,值越大,算法越保守。
range: [0,∞]
max_depth [default=6]
树的最大深度
范围: [1,∞]
min_child_weight [default=1]
这个参数默认是 1,是每个叶子里面 h 的和至少是多少,对正负样本不均衡时的 0-1 分类而言,假设 h 在 0.01 附近,min_child_weight 为 1 意味着叶子节点中最少需要包含 100 个样本。这个参数非常影响结果,控制叶子节点中二阶导的和的最小值,该参数值越小,越容易 overfitting。
range: [0,∞]
max_delta_step [default=0]
这个参数在更新步骤中起作用,如果取0表示没有约束,如果取正值则使得更新步骤更加保守。可以防止做太大的更新步子,使更新更加平缓。
range: [0,∞]
subsample [default=1]
样本随机采样,较低的值使得算法更加保守,防止过拟合,但是太小的值也会造成欠拟合。
range: (0,1]
colsample_bytree [default=1]
列采样,对每棵树的生成用的特征进行列采样.一般设置为: 0.5-1
range: (0,1]
lambda [default=1]
权重L2正则化
alpha [default=0]
权重L1正则化
(2)线性模型参数
lambda[default=0]
权重L2正则化
alpha [default=0]
权重L1正则化
lambda_bias
L2 regularization term on bias, default 0(no L1 reg on bias because it is notimportant)
偏导L2正则化参数,默认为0(没有偏导L1正则化参数)
任务参数:
objective [default=reg:linear ] 定义最小化损失函数类型,常用参数如下:
“reg:linear” –linear regression
“reg:logistic” –logistic regression
“binary:logistic” –logistic regression for binary classification, outputprobability
“binary:logitraw” –logistic regression for binary classification, output scorebefore logistic transformation
“count:poisson” –poisson regression for count data, output mean of poissondistribution
max_delta_step is set to 0.7 by default in poisson regression (used tosafeguard optimization)
“multi:softmax” –set XGBoost to do multiclass classification using the softmaxobjective, you also need to set num_class(number of classes)
“multi:softprob” –same as softmax, but output a vector of ndata * nclass, whichcan be further reshaped to ndata, nclass matrix. The result contains predictedprobability of each data point belonging to each class.
“rank:pairwise” –set XGBoost to do ranking task by minimizing the pairwiseloss
base_score [ default=0.5 ]
the initial prediction score of all instances, global bias 所有实例的初始预测评分, 全局偏差
eval_metric [ default according to objective ] 评估指标选择
evaluation metrics for validation data, a default metric will be assignedaccording to objective( rmse for regression, and error for classification, meanaverage precision for ranking )
User can add multiple evaluation metrics, for Python user, remember to pass the metrics in as list of parameters pairsinstead of map, so that latter ‘eval_metric’ won’t override previous one
The choices are listed below: 评估指标可选列表如下:
“rmse”: root mean square error
“logloss”: negative log-likelihood
“error”: Binary classification error rate. It is calculated as #(wrongcases)/#(all cases). For the predictions, the evaluation will regard theinstances with prediction value larger than 0.5 as positive instances, and theothers as negative instances.
“merror”: Multiclass classification error rate. It is calculated as #(wrongcases)/#(all cases).
“mlogloss”: Multiclass logloss
“auc”: Area under the curve for ranking evaluation.
“ndcg”:Normalized Discounted Cumulative Gain
“map”:Mean average precision
“ndcg@n”,”map@n”: n can be assigned as an integer to cut off the top positionsin the lists for evaluation.
“ndcg-“,”map-“,”ndcg@n-“,”map@n-“: In XGBoost, NDCG and MAP will evaluate thescore of a list without any positive samples as 1. By adding “-” in theevaluation metric XGBoost will evaluate these score as 0 to be consistent undersome conditions. training repeatively
seed [ default=0 ] 随机种子
random number seed.
3、案例背景
Rossmann在7个欧洲国家拥有3,000家药店。目前,罗斯曼店经理的任务是提前六周预测其日销量。商店销售受到诸多因素的影响,包括促销,竞争,学校和国家假日,季节性和地点。成千上万的个人经理根据其独特的情况预测销售量,结果的准确性可能会有很大的变化。
Kaggle所提供的数据的字段如下:
英文名称
英文解释
中文解释
Id
an Id that represents a (Store, Date) duple within the test set
表示测试集中(存储,日期)副本的Id
Store
a unique Id for each store
每个商店的独特Id
Sales
the turnover for any given day (this is what you are predicting)
每天的销量(这是需要预测的因变量)
Customers
the number of customers on a given day
某一天的客户数量
Open
an indicator for whether the store was open: 0 = closed, 1 = open
商店是否打开的指示器:0 =关闭,1 =打开
StateHoliday
indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
表示一个国家假期。通常所有商店,除了少数例外,在国营假期关闭。请注意,所有学校在公众假期和周末关闭。a =公众假期,b =复活节假期,c =圣诞节,0 =无
SchoolHoliday
indicates if the (Store, Date) was affected by the closure of public schools
表示(商店,日期)是否受到公立学校关闭的影响
StoreType
differentiates between 4 different store models: a, b, c, d
区分4种不同的商店模式:a,b,c,d
Assortment
describes an assortment level: a = basic, b = extra, c = extended
描述分类级别:a = basic,b = extra,c = extended
CompetitionDistance
distance in meters to the nearest competitor store
距离最接近的竞争对手商店的距离
CompetitionOpenSince[Month/Year]
gives the approximate year and month of the time the nearest competitor was opened
给出最近的竞争对手开放时间的大约年和月
Promo
indicates whether a store is running a promo on that day
指示商店是否在当天运行促销
Promo2
Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
Promo2是一些持续和连续推广的一些商店:0 =商店不参与,1 =商店正在参与
Promo2Since[Year/Week]
describes the year and calendar week when the store started participating in Promo2
描述商店开始参与Promo2的日期
PromoInterval
describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store
描述了Promo2的连续间隔开始,命名新的促销活动的月份。例如“二月,五月,八月,十一月”是指每一轮在该店的任何一年的二月,五月,八月,十一月份开始
4、数据预处理
由于本案例主要讲解xgboost模型,所以对于数据预处理和特征工程都做得比较少。只做了两方面的处理,第一,Kaggle官网把商店的一些属性数据与训练集、测试集分开放,在不同文件,所以要把store数据集与train、test数据集按列合并;第二,将数据按照xgb要求的格式进行转换,R语言的xgboost包的xgb.Matrix是转换格式的包。
5、代码实现
数据下载地址:https://www.kaggle.com/c/rossmann-store-sales/data
library(readr)library(xgboost)library(lubridate)train<-read.csv('D:/R语言kaggle案例实战/Kaggle第五节课/train.csv')test<-read.csv('D:/R语言kaggle案例实战/Kaggle第五节课/test.csv')store<-read.csv('D:/R语言kaggle案例实战/Kaggle第五节课/store.csv')#store是对店铺属性的补充train<-merge(train,store)#将两个数据集按列合并test<-merge(test,store)#将两个数据集按列合并train$Date<-as.POSIXct(train$Date)#将日期字符变成时间格式test$Date<-as.POSIXct(test$Date)#将日期字符变成时间格式train[is.na(train)]<-0#将空值置为零test[is.na(test)]<-0train<-train[which(train$Open=='1'),]#选择开门的且销售额不为0的样本train<-train[which(train$Sales!='0'),]train$month<-month(train$Date)#提取月份train$year<-year(train$Date)#提取年份train$day<-day(train$Date)#提取日train<-train[,-c(3,8)]#删除日期列和缺失值较多的列test<-test[,-c(4,7)]#删除日期列和缺失值较多的列feature.names<-names(train)[c(1,2,5:19)]#这一步主要使测试集和训练集的结构一致。for(f in feature.names){ if(class(train[[f]])=="character"){ levels<-unique(c(train[[f]],test[[f]])) train[[f]]<-as.integer(factor(train[[f]],levels = levels)) test[[f]]<-as.integer(factor(test[[f]],levels = levels)) }}tra<-train[,feature.names]RMPSE<-function(preds,dtrain){ #定义一个评价函数,Kaggle官方给的评价函数作为xgboost中的评价函数。 labels<-getinfo(dtrain,"label") elab<-exp(as.numeric(labels))-1 epreds<-exp(as.numeric(preds))-1 err<-sqrt(mean((epreds/elab-1)^2)) return(list(metric="RMPSE",value=err))}h<-sample(nrow(train),10000)#进行10000次抽样dval<-xgb.DMatrix(data=data.matrix(tra[h,]),label=log(train$Sales+1)[h])#用于以下构建watchlist dtrain<-xgb.DMatrix(data=data.matrix(tra[-h,]),label=log(train$Sales+1)[-h])#构建xgb特定的矩阵形式watchlist<-list(val=dval,train=dtrain)#构建模型参数的watchlist,watchlist是用于监听每次模型运行时的模型性能情况。param<-list(objective="reg:linear", booster="gbtree", eta=0.02, max_depth=12, subsample=0.9, colsample_bytree=0.7, num_parallel_tree=2, alpha=0.0001, lambda=1)clf<-xgb.train( params=param, data=dtrain, nrounds = 3000, verbose = 0, early.stop.round=100, watchlist = watchlist, maximize = FALSE, feval = RMPSE )ptest<- predict(clf,test,outputmargin=TRUE)
- 基于R语言的Kaggle案例分析学习笔记(五)
- 基于R语言的Kaggle案例分析学习笔记(一)
- 基于R语言的Kaggle案例分析学习笔记(二)
- 基于R语言的Kaggle案例分析学习笔记(三)
- 基于R语言的Kaggle案例分析学习笔记(四)
- 基于R语言的Kaggle案例分析学习笔记(六)
- 基于R语言的Kaggle案例分析学习笔记(七)
- 基于R语言的Kaggle案例分析学习笔记(八)
- 基于R语言的Kaggle案例分析学习笔记(九)
- 基于Python的Kaggle案例分析(一)
- R语言学习笔记(五)
- 机器学习实验(二):kaggle保险索赔案例分析
- 《R语言经典示例》学习笔记(五)
- R语言笔记五
- R语言学习(五)
- R语言学习五
- R语言学习五
- 92、R语言分析案例
- java在调试模式下打断点的时候,断点显示为禁用状态的解决
- 提高PHP代码执行效率小结
- hdu 1233 堆优化prim
- 对海康28181摄像头PS流解码的支持(一)
- codeforces contest 13 problem E(分块)
- 基于R语言的Kaggle案例分析学习笔记(五)
- 【入门】Java登录注册
- mongodb_权限
- js中数组Array的一些常用方法总结
- ROW_NUMBER() OVER(PARTTON BY T.ITEM_CODE ORDER BY T.VERSION DESC)
- PAT乙级1044. 火星数字(20)
- SQL Server 2008无法登录问题
- OkHttp3 简单介绍(一)
- ubuntu16.04安装wine2.0 staging