5.4 OTHER ESTIMATES
来源:互联网 发布:运单软件合成 编辑:程序博客网 时间:2024/06/07 06:40
(1)Leave-One-Out Cross-Validation
Each instance in turn is left out, and the learning
scheme is trained on all the remaining instances.
It is judged by its correctness on the remaining
instance—one or zero for success or failure, respectively.
The results of all n judgments, one for each member of the dataset,
are averaged, and that average represents the final error estimate.
假设dataset中有n个样本,那LOOC
每个样本单独作为一次test set,剩余n-1个样本则做为training set,
故一次LOOCV共要建立n个models。
advantage:
First, the greatest possible amount of data is used for training
in each case, which presumably increases the chance that
the classifier is an accurate one.
每一回合中几乎所有的样本皆用于训练model,因此最接近母体
Second, the procedure is deterministic:
No random sampling is involved. There is no point in repeating it
10 times, or repeating it at all: The same result will be obtained
each time.
实验过程中没有随机因素会影响实验数据,确保实验过程是
可以被复
Nevertheless, leave-one-out seems to offer a chance of
squeezing the maximum out of a small dataset and getting
as accurate an estimate as possible。
disadvantage:
First,high computational cost.because the entire learning
procedure must be executed n times and this is usually infeasible
for large datasets.
(缺点则是计算成本高,因为需要建立的models
当总样本数量相当多时,LOOCV在实作
Second,by its very nature, it cannot be stratified—worse than that,
it guarantees a nonstratified sample.Stratification involves getting the correct pro-
portion of examples in each class into the test set, and this is impossible when the
test set contains only a single example. A dramatic, although highly artificial, illus-
tration of the problems this might cause is to imagine a completely random dataset
that contains exactly the same number of instances of each of two classes. The best
that an inducer can do with random data is to predict the majority class, giving a
true error rate of 50%. But in each fold of leave-one-out, the opposite class to the
test instance is in the majority—and therefore the predictions will always be incor-
rect, leading to an estimated error rate of 100%.
(此种方法不可以分层,分层的条件之一是检验集中的每一个类所占的比例与总体
中相应类的比例相等,由于检验集中只有一个样本,这个条件不可能满足。例如,
假设完全是随机的数据集中的实例可以分为2类,并且数量相等。由于完全随机,
最好的预测法是inducer根据多数类预测random data所属的类,所以错误率是50%。
在leave-one-out方法中,如果每次都有一个样本留作检验集,所以与此样本
相反的类占据大都数,根据由多数分类原则,每次都会分错类。)
(2) 0.632 bootstrap(自助发)
a dataset of n instances is sampled n times, with replacement, to give another
dataset of n instances .Because some elements in this second dataset will
(almost certainly) be repeated, there must be some instances in theoriginal dataset
that have not been picked—we will use these as test instances.
(数据集中由n个样本,从该数据集中有放回的抽取n次,因此又可以得到一个含有n个
样本的数据集,该数据集就是所谓的训练集。在第二个数据集中有一些样本是重复的,
因此在第一个数据集中有些样本没有被抽到,这些样本作为检验集)
Advantage:
The bootstrap procedure may be the best way of estimating the error ratefor
very small datasets.
Disadvantage:
estimate of the true error:e= 0. 6320× e training- instances+0.368×etest -instances
e:error
etraining- instances:the error rate of training instances
e test–instances:the error rate of test –instances
In fact, the very dataset we consideredabove will do: a completely random
dataset with two classes of equal size. Thetrue error rate is 50% for any prediction
rule. But a scheme that memorized thetraining set would give a perfect resubstitution
score of 100%, so that e training-instances = 0, and the 0.632 bootstrap will mix this in with
a weight of 0.368to give an overall error rate of only 31.6% (0.632 × 50% + 0.368 × 0%),
which ismisleadingly optimistic.
假设有一个数据集,其中的样本是完全随机的,并且数据可以分为两类,
每一个类中所含的样本是一样多的,因此错误率是50%。假设我们由
bootstrap得到了一个“完美的”的分类器,对训练集分类的正确率是100%,
即错误率是0%。那么对检验集的正确率或者错误率只能是50%,
有以上公式(0.632× 50% + 0.368 × 0%=31.6%<50%).
因此所得到的结果是过分乐观的
- 5.4 OTHER ESTIMATES
- other
- other
- Other
- Other
- other
- other
- other
- other
- Other
- other
- Other
- Paul Glen: Truth and project time estimates
- other note
- other progress
- Other ref
- other way
- vim other
- 开发日志:hibernate映射,使表中的一个字段对应实体中的多个属性
- 收录一个RGB颜色转化的库
- hdu 4089 概率dp
- 黑马程序员_java基础-网络编程
- 5.iOS 自定义后台
- 5.4 OTHER ESTIMATES
- INVALID_SOCKET的值
- JavaWeb_session_防表单重复提交
- 高效编辑器vim之文件加密
- 黑马程序员_java基础-正则表达式
- 新浪微博客户端开发之adapter
- Introducing Document Management in SharePoint 2010 介绍SharePoint 2010中的文档管理
- Android API教程:人脸检测(上)
- Intent;Bundle--Activity之间跳转;Activity和Server之间跳转