vtreat cross frames
来源:互联网 发布:带鱼孩子刷爆网络 编辑:程序博客网 时间:2024/06/08 07:45
vtreat cross frames
John Mount, Nina Zumel
2016-05-05
As a follow on to “On Nested Models” we work R examples demonstrating “cross validated training frames” (or “cross frames”) in vtreat.
Consider the following data frame. The outcome only depends on the “good” variables, not on the (high degree of freedom) “bad” variables. Modeling such a data set runs a high risk of overfit.
set.seed(22626)mkData <- function(n) { d <- data.frame(xBad1=sample(paste('level',1:1000,sep=''),n,replace=TRUE), xBad2=sample(paste('level',1:1000,sep=''),n,replace=TRUE), xBad3=sample(paste('level',1:1000,sep=''),n,replace=TRUE), xGood1=rnorm(n), xGood2=rnorm(n)) # outcome only depends on "good" variables d$y <- rnorm(nrow(d))+0.2*d$xGood1 + 0.3*d$xGood2>0.5 # the random group used for splitting the data set, not a variable. d$rgroup <- sample(c("cal","train","test"),nrow(d),replace=TRUE) d}d <- mkData(2000)# devtools::install_github("WinVector/WVPlots")# library('WVPlots')plotRes <- function(d,predName,yName,title) { print(title) tab <- table(truth=d[[yName]],pred=d[[predName]]>0.5) print(tab) diag <- sum(vapply(seq_len(min(dim(tab))), function(i) tab[i,i],numeric(1))) acc <- diag/sum(tab)# if(requireNamespace("WVPlots",quietly=TRUE)) {# print(WVPlots::ROCPlot(d,predName,yName,title))# } print(paste('accuracy',acc))}
The Wrong Way
Bad practice: use the same set of data to prepare variable encoding and train a model.
dTrain <- d[d$rgroup!='test',,drop=FALSE]dTest <- d[d$rgroup=='test',,drop=FALSE]treatments <- vtreat::designTreatmentsC(dTrain,c('xBad1','xBad2','xBad3','xGood1','xGood2'), 'y',TRUE, rareCount=0 # Note: usually want rareCount>0, setting to zero to illustrate problem)
## [1] "desigining treatments Thu May 5 07:17:01 2016"## [1] "design var xBad1 Thu May 5 07:17:01 2016"## [1] "design var xBad2 Thu May 5 07:17:01 2016"## [1] "design var xBad3 Thu May 5 07:17:01 2016"## [1] "design var xGood1 Thu May 5 07:17:01 2016"## [1] "design var xGood2 Thu May 5 07:17:01 2016"## [1] "scoring treatments Thu May 5 07:17:01 2016"## [1] "have treatment plan Thu May 5 07:17:01 2016"## [1] "rescoring complex variables Thu May 5 07:17:01 2016"## [1] "done rescoring complex variables Thu May 5 07:17:01 2016"
dTrainTreated <- vtreat::prepare(treatments,dTrain, pruneSig=c() # Note: usually want pruneSig to be a small fraction, setting to null to illustrate problem)m1 <- glm(y~xBad1_catB + xBad2_catB + xBad3_catB + xGood1_clean + xGood2_clean, data=dTrainTreated,family=binomial(link='logit'))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
print(summary(m1)) # notice low residual deviance
## ## Call:## glm(formula = y ~ xBad1_catB + xBad2_catB + xBad3_catB + xGood1_clean + ## xGood2_clean, family = binomial(link = "logit"), data = dTrainTreated)## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.70438 0.00000 0.00000 0.03995 2.61063 ## ## Coefficients:## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.5074 0.3350 -1.515 0.12983 ## xBad1_catB 2.9432 0.5549 5.304 1.13e-07 ***## xBad2_catB 2.5338 0.5857 4.326 1.52e-05 ***## xBad3_catB 3.4172 0.6092 5.610 2.03e-08 ***## xGood1_clean 0.7288 0.2429 3.001 0.00269 ** ## xGood2_clean 0.7788 0.2585 3.012 0.00259 ** ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 1724.55 on 1331 degrees of freedom## Residual deviance: 132.59 on 1326 degrees of freedom## AIC: 144.59## ## Number of Fisher Scoring iterations: 12
dTrain$predM1 <- predict(m1,newdata=dTrainTreated,type='response')plotRes(dTrain,'predM1','y','model1 on train')
## [1] "model1 on train"## pred## truth FALSE TRUE## FALSE 848 18## TRUE 6 460## [1] "accuracy 0.981981981981982"
dTestTreated <- vtreat::prepare(treatments,dTest,pruneSig=c())dTest$predM1 <- predict(m1,newdata=dTestTreated,type='response')plotRes(dTest,'predM1','y','model1 on test')
## [1] "model1 on test"## pred## truth FALSE TRUE## FALSE 360 114## TRUE 153 41## [1] "accuracy 0.600299401197605"
Notice above that we see a training accuracy of 98% and a test accuracy of 60%.
The Right Way: A Calibration Set
Now try a proper calibration/train/test split:
dCal <- d[d$rgroup=='cal',,drop=FALSE]dTrain <- d[d$rgroup=='train',,drop=FALSE]dTest <- d[d$rgroup=='test',,drop=FALSE]treatments <- vtreat::designTreatmentsC(dCal,c('xBad1','xBad2','xBad3','xGood1','xGood2'), 'y',TRUE, rareCount=0 # Note: usually want rareCount>0, setting to zero to illustrate problem)
## [1] "desigining treatments Thu May 5 07:17:01 2016"## [1] "design var xBad1 Thu May 5 07:17:01 2016"## [1] "design var xBad2 Thu May 5 07:17:01 2016"## [1] "design var xBad3 Thu May 5 07:17:01 2016"## [1] "design var xGood1 Thu May 5 07:17:01 2016"## [1] "design var xGood2 Thu May 5 07:17:01 2016"## [1] "scoring treatments Thu May 5 07:17:01 2016"## [1] "have treatment plan Thu May 5 07:17:01 2016"## [1] "rescoring complex variables Thu May 5 07:17:01 2016"## [1] "done rescoring complex variables Thu May 5 07:17:02 2016"
dTrainTreated <- vtreat::prepare(treatments,dTrain, pruneSig=c() # Note: usually want pruneSig to be a small fraction, setting to null to illustrate problem)m1 <- glm(y~xBad1_catB + xBad2_catB + xBad3_catB + xGood1_clean + xGood2_clean, data=dTrainTreated,family=binomial(link='logit'))print(summary(m1))
## ## Call:## glm(formula = y ~ xBad1_catB + xBad2_catB + xBad3_catB + xGood1_clean + ## xGood2_clean, family = binomial(link = "logit"), data = dTrainTreated)## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.5853 -0.9177 -0.6876 1.1651 2.3241 ## ## Coefficients:## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.73798 0.12720 -5.802 6.56e-09 ***## xBad1_catB -0.02380 0.02637 -0.903 0.367 ## xBad2_catB -0.02495 0.02608 -0.957 0.339 ## xBad3_catB 0.02058 0.02508 0.821 0.412 ## xGood1_clean 0.39234 0.08632 4.545 5.49e-06 ***## xGood2_clean 0.56252 0.09673 5.816 6.04e-09 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 832.55 on 642 degrees of freedom## Residual deviance: 769.28 on 637 degrees of freedom## AIC: 781.28## ## Number of Fisher Scoring iterations: 4
dTrain$predM1 <- predict(m1,newdata=dTrainTreated,type='response')plotRes(dTrain,'predM1','y','model1 on train')
## [1] "model1 on train"## pred## truth FALSE TRUE## FALSE 378 40## TRUE 157 68## [1] "accuracy 0.693623639191291"
dTestTreated <- vtreat::prepare(treatments,dTest,pruneSig=c())dTest$predM1 <- predict(m1,newdata=dTestTreated,type='response')plotRes(dTest,'predM1','y','model1 on test')
## [1] "model1 on test"## pred## truth FALSE TRUE## FALSE 422 52## TRUE 149 45## [1] "accuracy 0.699101796407186"
Notice above that we now see training and test accuracies of 70%. We have defeated overfit in two ways: training performance is closer to test performance, and test performance is better. Also we see that the model now properly considers the “bad” variables to be insignificant.
Another Right Way: Cross-Validation
Below is a more statistically efficient practice: building a cross training frame.
The intuition
Consider any trained statistical model (in this case our treatment plan and variable selection plan) as a two-argument functionf(A,B). The first argument is the training data and the second argument is the application data. In our case f(A,B) is:designTreatmentsC(A) %>% prepare(B)
, and it produces a treated data frame.
When we use the same data in both places to build our training frame, as in
TrainTreated = f(TrainData,TrainData),
we are not doing a good job simulating the future application off(,), which will be f(TrainData,FutureData).
To improve the quality of our simulation we can call
TrainTreated = f(CalibrationData,TrainData)
where CalibrationData and TrainData are disjoint datasets (as we did in the earlier example) and expect this to be a good imitation of future f(CalibrationData,FutureData).
Cross-Validation and vtreat: The cross-frame.
Another approach is to build a “cross validated” version of f. We split TrainData into a list of 3 disjoint row intervals: Train1,Train2,Train3. Instead of computing f(TrainData,TrainData) compute:
TrainTreated = f(Train2+Train3,Train1) + f(Train1+Train3,Train2) + f(Train1+Train2,Train3)
(where + denotes rbind()
).
The idea is this looks a lot like f(TrainData,TrainData) except it has the important property that no row in the right-hand side is ever worked on by a model built using that row (a key characteristic that future data will have) so we have a good imitation off(TrainData,FutureData).
In other words: we use cross validation to simulate future data. The main thing we are doing differently is remembering that we can apply cross validation to any two argument function f(A,B) and not only to functions of the form f(A,B) = buildModel(A) %>% scoreData(B)
. We can use this formulation in stacking or super-learning with f(A,B) of the form buildSubModels(A) %>% combineModels(B)
(to produce a stacked or ensemble model); the idea applies to improving ensemble methods in general.
See:
- “General oracle inequalities for model selection” Charles Mitchell and Sara van de Geer
- “On Cross-Validation and Stacking: Building seemingly predictive models on random data” Claudia Perlich and Grzegorz Swirszcz
- “Super Learner” Mark J. van der Laan, Eric C. Polley, and Alan E. Hubbard
In fact (though it was developed independently) you can think of vtreat as a superlearner.
In super learning cross validation techniques are used to simulate having built sub-model predictions on novel data. The simulated out of sample-applications of these sub models (and not the sub models themselves) are then used as input data for the next stage learner. In future application the actual sub-models are applied and their immediate outputs is used by the super model.
In vtreat the sub-models are single variable treatments and the outer model construction is left to the practitioner (using the cross-frames for simulation and not the treatmentplan). In application the treatment plan is used.
Example
Below is the example cross-run. The functionmkCrossFrameCExperiment
returns a treatment plan for use in preparing future data, and a cross-frame for use in fitting a model.
dTrain <- d[d$rgroup!='test',,drop=FALSE]dTest <- d[d$rgroup=='test',,drop=FALSE]prep <- vtreat::mkCrossFrameCExperiment(dTrain, c('xBad1','xBad2','xBad3','xGood1','xGood2'), 'y',TRUE, rareCount=0 # Note: usually want rareCount>0, setting to zero to illustrate problem)dTrainTreated <- prep$crossFrametreatments <- prep$treatmentsprint(treatments$scoreFrame[,c('varName','lsig','csig')])
## varName lsig csig## 1 xBad1_catP 8.172932e-01 8.172186e-01## 2 xBad1_catB 5.675615e-01 5.676541e-01## 3 xBad2_catP 7.446537e-01 7.441869e-01## 4 xBad2_catB 5.792325e-01 5.793585e-01## 5 xBad3_catP 4.356331e-01 4.342227e-01## 6 xBad3_catB 1.786048e-01 1.770493e-01## 7 xGood1_clean 6.529637e-12 6.072599e-12## 8 xGood2_clean 8.584085e-21 8.286789e-21
Now fit the model to the cross-frame rather than toprepare(treatments, dTrain)
(the treated training data).
m1 <- glm(y~xBad1_catB + xBad2_catB + xBad3_catB + xGood1_clean + xGood2_clean, data=dTrainTreated,family=binomial(link='logit'))print(summary(m1))
## ## Call:## glm(formula = y ~ xBad1_catB + xBad2_catB + xBad3_catB + xGood1_clean + ## xGood2_clean, family = binomial(link = "logit"), data = dTrainTreated)## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.6842 -0.9236 -0.6573 1.1824 2.3257 ## ## Coefficients:## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.690824 0.091313 -7.565 3.87e-14 ***## xBad1_catB 0.001813 0.017218 0.105 0.916 ## xBad2_catB -0.023835 0.017128 -1.392 0.164 ## xBad3_catB 0.024460 0.016978 1.441 0.150 ## xGood1_clean 0.404827 0.061885 6.542 6.09e-11 ***## xGood2_clean 0.570083 0.064988 8.772 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 1724.6 on 1331 degrees of freedom## Residual deviance: 1587.7 on 1326 degrees of freedom## AIC: 1599.7## ## Number of Fisher Scoring iterations: 4
dTrain$predM1 <- predict(m1,newdata=dTrainTreated,type='response')plotRes(dTrain,'predM1','y','model1 on train')
## [1] "model1 on train"## pred## truth FALSE TRUE## FALSE 776 90## TRUE 335 131## [1] "accuracy 0.680930930930931"
dTestTreated <- vtreat::prepare(treatments,dTest,pruneSig=c())dTest$predM1 <- predict(m1,newdata=dTestTreated,type='response')plotRes(dTest,'predM1','y','model1 on test')
## [1] "model1 on test"## pred## truth FALSE TRUE## FALSE 421 53## TRUE 145 49## [1] "accuracy 0.703592814371258"
The model fit to the cross-frame behaves similarly to the model produced via the process f(CalibrationData, TrainData).
- vtreat cross frames
- frames
- Frames
- cross
- id3v2_4_0 frames
- Jumbo Frames
- jumbo frames
- frames框架
- Canvas Frames
- Canvas Frames
- document.frames()与document.frames[]
- window.frames[].location window.frames[].src
- frames中的文件装载
- 多窗口页面(Frames)
- 多窗口页面(Frames)
- frames分割窗口
- 多窗口页面(Frames)
- frames的使用
- ImageView你不知道的一些问题
- Java 里http协议的get请求
- poj 3252 Round Number 数位dp
- springMVC详解
- 剖析大数据之Hadoop简介
- vtreat cross frames
- 输出一个整数val的每一位
- 手把手教你使用Git
- 每天laravel-20160801| Container -4
- 资料
- 对象的比较与排序:IComparable和IComparer接口
- iOS数据持久化
- 当js运行在java上,会有那些令人惊喜的表现呢!
- 删除单向链表中的某一个节点