（-）朴素贝叶斯学习（1）-机器学习中熵的理解

来源：互联网发布：怎么关闭手机数据编辑：程序博客网时间：2024/06/05 14:07

背景：

在学习NBC时，涉及特征独立的假设，并非所有的特征都对分类有贡献，因而需要进行特征选择，在NB中存在以下问题：

1.NB:多个特征的联合分布 ;

2. 并不是所有的特征都对分类有贡献;

3.运算复杂度O(D).

解决方法：

*去除不相关的特征，只选择最重要的特征；

*最简单即找出最相关的K个特征；

*特征相关性度量：与目标Y的互信息

在这个公式中，即体现了我们的主角： MI(互信息) can be thought of as the reduction in entropy(熵) on the label distribution once we observe the value of feature j.

这里说，一旦我们知道特征j 与类别的关系，就会减少熵。（我去，明明熵在哪里都没看到啊）

一、熵的理解：

这里出现了一个词reduction,减少，说明我们知道的信息越多，那么熵应该是减少的趋势。

熵是信息论的东西（知道就行，下面是重点），它与不确定性有一定的关系：

这里先给出它的公式（先看下，不理解没事）。

重点：熵是一种不确定性的度量，为了容易理解熵， 1）假如我们迷路了，我们走到一个路口，这时有N条路，让我们选择，如果每条路的选择我们一点参考也没有，那选择每条个概率就是1/N，这时，我们就会更加迷茫，这是的不确定性就会是最大的（也就是熵最大了），因为我们根本不知道怎么选，没有参考。

2）相反的，如果我们今天看了天气预报，，知道了风向，其中有一条路的风向与预报的吻合，那就是我们有90%的准确率知道选择哪条路回家，这样我们的不确定性（熵值）将会减少，这是预测也往往是准确的。

两个说明：

1）The entropy of a uniform distribution over n outcomes is log n. （均匀分布结果是log n）

2）The entropy of a deterministic distribution (in which one of the outcomes has probability 1 and the others have probability 0) is 0.(确定分布一个概率是0，另一个是1，则总的熵是0)

The entropy may be shown to reach a maximum for a uniform distribution.

举个例子：

二、机器学习相关

另外还有一类熵，是与机器学习的分类是相关的：就是通过特征与类标签相关的熵。在分类时，熵常常被用作来排序特征，从而减少不确定性的标签，可以预测特征突出的实例。

We define the class entropy of an attribute a as follows:

Hc(a) = sum of #(a=v)/#(all a-values)* H(class distribution | a=v)

where:

1 the sum is calculated over all values v of a,

2 #(v=a) is the number of instances for which a has the value v,

3 #(all a-values) is the total number of instances being considered,

4 H(class distribution | a=v) is the set of probabilities of the various classes over the instances that satisfy a=v。

a 的取值，将会决定这些类别的概率分布。这种分类熵值就是一种对所有概率的加权平均。每个可能的值会根据选择的特征来进行权重的合适的分配对现在考虑的实例样本。

用途：自己的学习笔记，如有理解错误，欢迎指正。

参考文献：

1. Entropy-Based Decision Tree Induction （paper）

2.Machine Learning A Probabilistic Perspective (textbook)

0 0