AdaBoost算法理解基于机器学习实战

来源：互联网发布：win8重置网络设置编辑：程序博客网时间：2024/05/22 11:30

AdaBoost算法就是用一个数据多次训练一个弱的分类器，但是adaboost分类器主要关注那些以被分离器错分的数据。提高分类错误数据的权重，降低分对数据集的权重。最后把每个分类器集合到一起，然后进行测试。

Bagging:是在原始数据集选择S次后得到S个新数据集的一种技术。新数据和原始数据大小想到。新数据是经过原始数据集随机选择一个样本进行替换得到。这个说明新数据集可以有重复的值。当有了数据集的时候，将某个算法分别作用于每个数据集上面(分类器一样)得到S个分类器。当对新数据进行分类时，就可以应用这个S歌分类器分类。分类器投票的结果中选取最多的类别作为最后的分类器结果。

Boosting与Bagging类似，每个算法都是使用同样分类器，但是区别是bagging的每个分类器根据已训练出的分类器的性能来训练，bagging的分类器权重相等。但是Boosting主要关注分错的数据。

Ensemble：与Boosting和Bagging不同，Ensemble可以使用不同的算法构建不同的分类器，然后用另一个分类器去分类前边分类器的结果。

#先创建一个弱的分配器,然后根据弱分配器来创建adaboost分配器,对于这个强的分配器,这个是对于每个错分点的增加这个权重,减少对的权重.重新分类.# 对于每次分类,有一个list数组,会把所有的能让错误率降低的参数,例如,threshold,(树)左边还是右边,alpha(每个点的权重),错误率为错分点乘以他的权重.#就这样一直循化下去,去保存所有提高准确率的参数. 当循环结束之时候,把list拿出来,这个就是我们所有的弱分类和相应的参数.然后那这个预测.# 为什么能提高准确率呢,因为,这个强分类器会把每个弱分裂器分类器的分类结果相加对于每个测试数据,然后得出结果.# 若测试的数据为-1,这个叠加结果应在-1左右.若这个开始分错了,因为这个组合弱分类器会一直在增加这个数据的权重,让他无限的靠近应属于他的标签.

# -*- coding: utf-8 -*-import numpy as np
def loadSimpData():    datMat = np.matrix([[ 1. ,  2.1],        [ 2. ,  1.1],        [ 1.3,  1. ],        [ 1. ,  1. ],        [ 2. ,  1. ]])    classLabels = [1.0, 1.0, -1.0, -1.0, 1.0]    return datMat,classLabels

#作为训练的基础分类器.当在左分支小于treshold和右分支大于threshold两种情况def stumClassify(dataMatrix,dimen,treshVal,threshIneq):    maxLabel=np.ones((np.shape(dataMatrix)[0],1));    if threshIneq=='lt':        maxLabel[dataMatrix[:,dimen]<=treshVal]=-1.0;    else:        maxLabel[dataMatrix[:,dimen]> treshVal]=-1.0;    return maxLabel#对于数据的每种属性都要尝试他设置的threshold,然后实验在左叉好还是右叉好.记录能产生小错误的分叉的属性和threshold和左右边分类def buildStumy(dataArr,classLabel,D):    maxtriArr=np.mat(dataArr);    labelMat=np.mat(classLabel).T;    rows,cols=np.shape(maxtriArr);    steps=10;    minError=np.inf;    bestStum={};    bestClassEst=np.mat(np.zeros((rows,1)));    for dimen in range(cols):        minArr=maxtriArr[:,dimen].min();        maxArr=maxtriArr[:,dimen].max();        stepsize=(maxArr-dimen)/float(steps);        for step in xrange(-1,steps+1):            for inequal in ['lt','lr']:                threshold=minArr+stepsize*float(step);                prediction=stumClassify(maxtriArr,dimen,threshold,inequal);                errorArr=np.mat(np.ones((rows,1)));                errorArr[prediction==labelMat]=0;                weightError=D.T*errorArr;                # print "split: dim %d, thresh %.2f, thresh ineqal: %s, the weighted error is %.3f" % (dimen, threshold, inequal, weightError)                if weightError<minError:                    minError=weightError;                    bestClassEst=prediction.copy();                    bestStum['dim']=dimen;                    bestStum['ineq']=inequal;                    bestStum['thresh']=threshold;    return bestStum,minError,bestClassEst;

#设置有多少次循化.多少次循化就构造了多少个分类器,每个分类器的改变根据他错分的数据的权重增加,正确分的权重降低而改变.# 错误率为每个数据乘以相应的权重def adaBoostTrainDS(matArr,matLabel,numIt=40):    weekaClassArr=[];    rows=np.shape(matArr)[0];    D=np.mat(np.ones((rows,1))/rows);    aggClassEst=np.mat(np.zeros((rows,1)));    for i in range(numIt):        bestStum,errorArr,bestClassEst=buildStumy(matArr,matLabel,D);        alpha=float(0.5*np.log((1.0-errorArr)/max(errorArr,1e-16)));        bestStum['alpha']=alpha;        weekaClassArr.append(bestStum);        expon=np.multiply(-1*alpha*matLabel.T,bestClassEst);        D=np.multiply(D, np.exp(expon));        D=D/D.sum();        aggClassEst+=alpha*bestClassEst;        aggerror=np.multiply(np.sign(aggClassEst)!=np.mat(matLabel).T,np.ones((rows,1)))        errorRate=aggerror.sum()/rows;        # print "total error: ",errorRate        if errorRate==0: break    return weekaClassArr,aggClassEst

#构建多个分离器组成强大的分离器.从上边的函数可知到我们得到了多少分类器.最后用每个分离器去去分类.把每个分类器的结果乘以他们的权重求和,# 最后就会等到一个数组,长度等于数据分类标签的长度.然后用sign判断大于或者等于0分为1和-1.# 这个分裂器的实现原理就是基于组建多个分类器,由于每个分分类器的的大部分结果都会是准确的.然后我们也总是改善分类器去分对那些错误的点,提高错分点的权重.# 从而使错分的点向他正确的方向移动,即使有点在某个分类器上正确(大多数),在少数分类器错误,但是他们的和应该接近他的标签(1 或者-1).权重也不一样,错分的权重校,矫正的权重大. 所以把每个分类器的结果乘以相应的权重相加，大部分数据会在相应标签上徘徊.就像voting,每个人都可以投票,也有自己的评价标准.# 每个人都没有全面是的知识或者对一个人或者事有全面的认知,但是每个人都知道一部分,那全部人加起来不就差不多能全面的认知一个事物或者人了吗def addClassifier(test,classifer):    dataMatrix=np.mat(test)    m=np.shape(dataMatrix)[0]    aggEst=np.mat(np.zeros((m,1)))    numClassifer=len(classifer);    print numClassifer    for index in range(numClassifer):        classError=stumClassify(np.mat(test),classifer[index]['dim'],classifer[index]['thresh'],classifer[index]['ineq'])        aggEst+=classifer[index]['alpha']*classError;    return np.sign(aggEst)if __name__=='__main__':    maxtrixArr,maxtrixLabel=loadSimpData()    weekaClassArr,aggClassEst=adaBoostTrainDS(np.mat(maxtrixArr),np.mat(maxtrixLabel),numIt=40);    addClassifier([[5,5],[0,0]],weekaClassArr)

Explanation(1)

The idea behind bagging is that when you OVERFIT with a nonparametric regression method (usually regression or classification trees, but can be just about any nonparametric method), you tend to go to the high variance, no (or low) bias part of the bias/variance tradeoff. This is because an overfitting model is very flexible (so low bias over many resamples from the same population, if those were available) but has high variability (if I collect a sample and overfit it, and you collect a sample and overfit it, our results will differ because the non-parametric regression tracks noise in the data). What can we do? We can take many resamples (from bootstrapping), each overfitting, and average them together. This should lead to the same bias (low) but cancel out some of the variance, at least in theory.

Gradient boosting at its heart works with UNDERFIT nonparametric regressions, that are too simple and thus aren't flexible enough to describe the real relationship in the data (i.e. biased) but, because they are under fitting, have low variance (you'd tend to get the same result if you collect new data sets). How do you correct for this? Basically, if you under fit, the RESIDUALS of your model still contain useful structure (information about the population), so you augment the tree you have (or whatever nonparametric predictor) with a tree built on the residuals. This should be more flexible than the original tree. You repeatedly generate more and more trees, each at step k augmented by a weighted tree based on a tree fitted to the residuals from step k-1. One of these trees should be optimal, so you either end up by weighting all these trees together or selecting one that appears to be the best fit. Thus gradient boosting is a way to build a bunch of more flexible candidate trees.

Like all nonparametric regression or classification approaches, sometimes bagging or boosting works great, sometimes one or the other approach is mediocre, and sometimes one or the other approach (or both) will crash and burn.

Also, both of these techniques can be applied to regression approaches other than trees, but they are most commonly associated with trees, perhaps because it is difficult to set parameters so as to avoid under fitting or overfitting.

overfit = variance, underfit = bias argument!

Explanation(2)

All three are so-called "meta-algorithms": approaches to combine several machine learning techniques into one predictive model in order to decrease the variance (bagging), bias (boosting) or improving the predictive force (stacking alias ensemble).

Every algorithm consists of two steps:

Producing a distribution of simple ML models on subsets of the original data.
Combining the distribution into one "aggregated" model.

Here is a short description of all three methods:

Bagging (stands for Bootstrap Aggregation) is the way decrease the variance of your prediction by generating additional data for training from your original dataset using combinations with repetitions to produce multisets of the same cardinality/size as your original data. By increasing the size of your training set you can't improve the model predictive force, but just decrease the variance, narrowly tuning the prediction to expected outcome.
Boosting is a two-step approach, where one first uses subsets of the original data to produce a series of averagely performing models and then "boosts" their performance by combining them together using a particular cost function (=majority vote). Unlike bagging, in the classical boosting the subset creation is not random and depends upon the performance of the previous models: every new subsets contains the elements that were (likely to be) misclassified by previous models.
Stacking is a similar to boosting: you also apply several models to your original data. The difference here is, however, that you don't have just an empirical formula for your weight function, rather you introduce a meta-level and use another model/approach to estimate the input together with outputs of every model to estimate the weights or, in other words, to determine what models perform well and what badly given these input data.

Wiki

http://stats.stackexchange.com/questions/18891/bagging-boosting-and-stacking-in-machine-learning

机器学习实战

0 0