adaboost learing

来源：互联网发布：用c语言输出以下图案编辑：程序博客网时间：2024/06/05 10:02

最近在看adaboost，有个ppt讲的不错，将其中重要内容整理一下：

1984年Valiant提出PAC（Probably Approximately Correct）学习模型，文中提出强学习和弱学习两个概念。

简单的说，

强学习：分类性能强的学习，用强学习得到的分类器称为强分类器。

弱学习：分类性能弱的学习（只比猜测好一点点），用弱学习得到的分类器称为弱分类器。

Freund的主要贡献：

Freund证明只要足够的数据，就可以通过集成的方式把弱学习转为强学习!

这是一个非常重要的结论，因为：实际运用中，人们根据生产经验可以容易地找到弱学习方法，但是很多情况下要找到强学习方法是很困难的，所以人们常常倾向于找到弱学习然后把它转换为强学习，而Freund证明了这种方式的可行性。

AdaBoost基本思想

把大量各自“擅长不同领域”的弱分类器线性组合起来，构成一个分类能力很强的强分类器。

弱分类器à专家

所以关键就是：

（1）怎么找到不同领域的专家

AdaBoost通过调整样本权重分布，自动生成不同领域，进而找到专家（弱分类器）。

（2）怎么找到线性组合

AdaBoost根据专家（弱分类器）的错误率大小决定专家意见的采信率（弱分类器权重），把弱分类器线性组合的过程叫提升。

“万能”的转换体系

如果能找到一种把弱学习转为强学习的“万能”体系，人们就不用为寻找强学习烦恼了！

而Adaboost就是这样一种万能体系！

维基百科的算法描述：(copy from wikipedia: http://en.wikipedia.org/wiki/AdaBoost )

Given:

training set: $(x_{1},y_{1}),\ldots,(x_{m},y_{m})$ where $x_{i} \in X,\, y_{i} \in Y = \{-1, +1\}$ //m个samples
number of iterations $T$ //迭代次数T

For $i=1,\ldots,m$ :

Initialize $\textstyle D_{1}(i) = \frac{1}{m},$ //初始化权重

For $t = 1,\ldots,T$ :

From the family of weak classifiers ℋ, find the classifier $h_{t}\,\!$ that maximizes the absolute value of the difference of the corresponding weighted error rate $\epsilon_{t}\,\!$ and 0.5 with respect to the distribution $D_{t}$ :

$h_{t} = \underset{h_{t} \in \mathcal{H}}{\operatorname{argmax}} \; \left\vert 0.5 - \epsilon_{t}\right\vert$

where $\epsilon_{t} = \sum_{i=1}^{m} D_{t}(i)I(y_i \ne h_{t}(x_{i}))$ . (I is the indicator function)

If $\left\vert 0.5 - \epsilon_{t}\right\vert \leq \beta$ , where $\beta$ is a previously chosen threshold, then stop.
Choose $\alpha_{t} \in \mathbb{R}$ , typically $\alpha_{t}=\frac{1}{2}\textrm{ln}\frac{1-\epsilon_{t}}{\epsilon_{t}}$ . //计算出来的假设权重
For $i = 1,\ldots,m$ :

Update $D_{t+1}(i) = \frac{ D_t(i) \exp(\alpha_t (2 I(y_i \ne h_{t}(x_{i})) - 1 )) }{ Denom },$

where the denominator, $Denom$ , is the normalization factor ensuring that $D_{t+1}$ will be a probability distribution.

Output the final classifier:

$H(x) = \textrm{sign}\left( \sum_{t=1}^{T} \alpha_{t}h_{t}(x)\right)$

Thus, after selecting an optimal classifier $h_{t} \,$ for the distribution $D_{t} \,$ , the examples $x_{i} \,$ that the classifier $h_{t} \,$ identified correctly are weighted less and those that it identified incorrectly are weighted more. Therefore, when the algorithm is testing the classifiers on the distribution $D_{t+1} \,$ , it will select a classifier that better identifies those examples that the previous classifier missed.