K-折交叉验证

来源：互联网发布：炒白银正规软件编辑：程序博客网时间：2024/04/26 13:41

K-折交叉验证（K-foldcross-validation）是指将样本集分为k份，其中k-1份作为训练数据集，而另外的1份作为验证数据集。用验证集来验证所得分类器或者回归的错误码率。一般需要循环k次，直到所有k份数据全部被选择一遍为止。
交叉检验的方法是
Cross Validation
Cross validation is a model evaluation method that is better thanresiduals. The problem with residual evaluations is that they donot give an indication of how well the learner will do when it isasked to make new predictions for data it has not already seen. Oneway to overcome this problem is to not use the entire data set whentraining a learner. Some of the data is removed before trainingbegins. Then when training is done, the data that was removed canbe used to test the performance of the learned model on ``new''data. This is the basic idea for a whole class of model evaluationmethods called cross validation.

The holdout method is the simplest kind of cross validation. Thedata set is separated into two sets, called the training set andthe testing set. The function approximator fits a function usingthe training set only. Then the function approximator is asked topredict the output values for the data in the testing set (it hasnever seen these output values before). The errors it makes areaccumulated as before to give the mean absolute test set error,which is used to evaluate the model. The advantage of this methodis that it is usually preferable to the residual method and takesno longer to compute. However, its evaluation can have a highvariance. The evaluation may depend heavily on which data pointsend up in the training set and which end up in the test set, andthus the evaluation may be significantly different depending on howthe division is made.

K-fold cross validation is one way to improve over the holdoutmethod. The data set is divided into k subsets, and the holdoutmethod is repeated k times. Each time, one of the k subsets is usedas the test set and the other k-1 subsets are put together to forma training set. Then the average error across all k trials iscomputed. The advantage of this method is that it matters less howthe data gets divided. Every data point gets to be in a test setexactly once, and gets to be in a training set k-1 times. Thevariance of the resulting estimate is reduced as k is increased.The disadvantage of this method is that the training algorithm hasto be rerun from scratch k times, which means it takes k times asmuch computation to make an evaluation. A variant of this method isto randomly divide the data into a test and training set kdifferent times. The advantage of doing this is that you canindependently choose how large each test set is and how many trialsyou average over.

Leave-one-out cross validation is K-fold cross validation taken toits logical extreme, with K equal to N, the number of data pointsin the set. That means that N separate times, the functionapproximator is trained on all the data except for one point and aprediction is made for that point. As before the average error iscomputed and used to evaluate the model. The evaluation given byleave-one-out cross validation error (LOO-XVE) is good, but atfirst pass it seems very expensive to compute. Fortunately, locallyweighted learners can make LOO predictions just as easily as theymake regular predictions. That means computing the LOO-XVE takes nomore time than computing the residual error and it is a much betterway to evaluate models. We will see shortly that Vizier reliesheavily on LOO-XVE to choose its metacodes.

Figure 26: Cross validation checks how well a model generalizes tonew data

Fig. 26 shows an example of cross validation performing better thanresidual error. The data set in the top two graphs is a simpleunderlying function with significant noise. Cross validation tellsus that broad smoothing is best. The data set in the bottom twographs is a complex underlying function with no noise. Crossvalidation tells us that very little smoothing is best for thisdata set.

Now we return to the question of choosing a good metacode for dataset a1.mbl:

File -> Open -> a1.mbl
Edit -> Metacode -> A90:9
Model -> LOOPredict
Edit -> Metacode -> L90:9
Model -> LOOPredict
Edit -> Metacode -> L10:9
Model -> LOOPredict

LOOPredict goes through the entire data set and makes LOOpredictions for each point. At the bottom of the page it shows thesummary statistics including Mean LOO error, RMS LOO error, andinformation about the data point with the largest error. The meanabsolute LOO-XVEs for the three metacodes given above (the samethree used to generate the graphs in fig. 25), are 2.98, 1.23, and1.80. Those values show that global linear regression is the bestmetacode of those three, which agrees with our intuitive feelingfrom looking at the plots in fig. 25. If you repeat the aboveoperation on data set b1.mbl you'll get the values 4.83, 4.45, and0.39, which also agrees with our observations.

What are cross-validation and bootstrapping?

--------------------------------------------------------------------------------

Cross-validation and bootstrapping are both methods forestimating
generalization error based on "resampling" (Weiss and Kulikowski1991; Efron
and Tibshirani 1993; Hjorth 1994; Plutowski, Sakata, and White1994; Shao
and Tu 1995). The resulting estimates of generalization error areoften used
for choosing among various models, such as different networkarchitectures.

Cross-validation
++++++++++++++++

In k-fold cross-validation, you divide the data into k subsetsof
(approximately) equal size. You train the net k times, each timeleaving
out one of the subsets from training, but using only the omittedsubset to
compute whatever error criterion interests you. If k equals thesample
size, this is called "leave-one-out" cross-validation."Leave-v-out" is a
more elaborate and expensive version of cross-validation thatinvolves
leaving out all possible subsets of v cases.

Note that cross-validation is quite different from the"split-sample" or
"hold-out" method that is commonly used for early stopping in NNs.In the
split-sample method, only a single subset (the validation set) isused to
estimate the generalization error, instead of k different subsets;i.e.,
there is no "crossing". While various people have suggestedthat
cross-validation be applied to early stopping, the proper way ofdoing so is
not obvious.

The distinction between cross-validation and split-samplevalidation is
extremely important because cross-validation is markedly superiorfor small
data sets; this fact is demonstrated dramatically by Goutte (1997)in a
reply to Zhu and Rohwer (1996). For an insightful discussion ofthe
limitations of cross-validatory choice among several learningmethods, see
Stone (1977).

Jackknifing
+++++++++++

Leave-one-out cross-validation is also easily confused withjackknifing.
Both involve omitting each training case in turn and retraining thenetwork
on the remaining subset. But cross-validation is used toestimate
generalization error, while the jackknife is used to estimate thebias of a
statistic. In the jackknife, you compute some statistic of interestin each
subset of the data. The average of these subset statistics iscompared with
the corresponding statistic computed from the entire sample inorder to
estimate the bias of the latter. You can also get a jackknifeestimate of
the standard error of a statistic. Jackknifing can be used toestimate the
bias of the training error and hence to estimate the generalizationerror,
but this process is more complicated than leave-one-outcross-validation
(Efron, 1982; Ripley, 1996, p. 73).

Choice of cross-validation method
+++++++++++++++++++++++++++++++++

Cross-validation can be used simply to estimate the generalizationerror of
a given model, or it can be used for model selection by choosingone of
several models that has the smallest estimated generalizationerror. For
example, you might use cross-validation to choose the number ofhidden
units, or you could use cross-validation to choose a subset of theinputs
(subset selection). A subset that contains all relevant inputs willbe
called a "good" subsets, while the subset that contains allrelevant inputs
but no others will be called the "best" subset. Note that subsetsare "good"
and "best" in an asymptotic sense (as the number of training casesgoes to
infinity). With a small training set, it is possible that a subsetthat is
smaller than the "best" subset may provide better generalizationerror.

Leave-one-out cross-validation often works well forestimating
generalization error for continuous error functions such as themean squared
error, but it may perform poorly for discontinuous error functionssuch as
the number of misclassified cases. In the latter case, k-fold
cross-validation is preferred. But if k gets too small, the errorestimate
is pessimistically biased because of the difference in training-setsize
between the full-sample analysis and the cross-validation analyses.(For
model-selection purposes, this bias can actually help; see thediscussion
below of Shao, 1993.) A value of 10 for k is popular forestimating
generalization error.