朴素贝叶斯

来源：互联网发布：linux 输出重定向 2 1 编辑：程序博客网时间：2024/05/18 02:59

朴素贝叶斯

朴素贝叶斯
- 贝叶斯定理
- 基本方法
- 拉普拉斯平滑 Laplace
- 模型
  - 高斯朴素贝叶斯 Gaussian naive Bayes
  - 多项模型 MultinomialNB
- 相关资料

贝叶斯定理

贝叶斯定理是关于随机事件A和B的条件概率的一则定理。

P (A | B) = P ( B | A ) P ( A ) P ( B )

在贝叶斯定理中，每个名词都有约定俗成的名称：P(A|B)是已知B发生后A的条件概率，也由于得自B的取值而被称作A的后验概率。P(A)是A的先验概率（或边缘概率）。P(B)是B的先验概率或边缘概率

先验概率是指根据以往经验和分析得到的概率，没有考虑其他证据。后验概率是考虑其他证据后得到的条件概率

Wiki：In Bayesian statistical, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one’s beliefs about this quantity before some evidence is taken into account. The posterior probability of a random event or an uncertain proposition is the conditional probability that is assigned after the relevant evidence or background is taken into account.

基本方法

输入空间 X∈Rn 为n维向量集合， x=(x1,x2,x3,...,xn), 输出空间 Y 为 K 类标记集合，即 y=(y1,y2,...,yK), 训练数据 T={(x1,y1),(x2,y2),...,(xN,yN)}

朴素贝叶斯学习先验概率分布

P (Y = y k)

及条件概率分布

P (X = x | Y = y k) = P (X = x 1, x 2, x 3, . . ., x n | y k)

假设

xi的值有

Si个,

i=1,2,...,n, 条件概率分布

P(X=x|Y=yk) 参数个数为

K∏ni=1Si, 有指数级数量的参数

朴素贝叶斯算法对条件概率分布作出了独立性的假设，假设各维度特征独立

P (x | y k) = P (x 1, x 2, x 3, . . ., x n | y k) = \prod i = 1 n P (x i | y k)

此假设使朴素贝叶斯变简单，但牺牲一定准确率。

所以后验概率等于

P (y k | x) = P ( x | y k ) P ( y k ) P ( x ) = P ( x | y k ) P ( y k ) \sum k 1 P ( x | y k ) P ( y k ) = \prod n i = 1 P ( x i | y k ) P ( y k ) \sum k 1 \prod n i = 1 P ( x i | y k ) P ( y k ), k = 1 . . . K

这是朴素贝叶斯的基本公式。于是，朴素贝叶斯分类器表示为

y = a r g m a x P (y k | x) = a r g m a x \prod n i = 1 P ( x i | y k ) P ( y k ) \sum k 1 \prod n i = 1 P ( x i | y k ) P ( y k ), k = 1 . . . K

对与不同

k 分母相同, 所以

y = a r g m a x \prod i = 1 n P (x i | y k) P (y k), k = 1 . . . K

拉普拉斯平滑 Laplace

可能会出现所要估计的概率值P(xi|yk), P(yk)为0的情况, 加入一个正数λ>0， λ=1 时称为拉普拉斯平滑

P (y k) = N y k + λ N + K λ

P (x i | y k) = N ( x i , y k ) + λ N y k + S i λ

对

k=1...K,j=1...Si 有

\sum K k = 1 P (y k) = 1

\sum j = 1 S i P (x i j | y k) = 1

Si 是

xi的特征数量

模型

高斯朴素贝叶斯 (Gaussian naive Bayes)

When dealing with continuous data, a typical assumption is that the continuous values associated with each class are distributed according to a Gaussian distribution. For example, suppose the training data contains a continuous attribute, x. We first segment the data by the class, and then compute the mean and variance of x in each class. Let μc be the mean of the values in x associated with class c, and let σ2c be the variance of the values in x associated with class c. Suppose we have collected some observation value v. Then, the probability distribution of v given a class c, p(x=v∣c), can be computed by plugging v into the equation for a Normal distribution parameterized by μc and σ2c. That is,

p (x = v ∣ c) = 1 2 π σ 2 c - - - - \sqrt e - ( v - μ c ) 2 2 σ 2 c

多项模型 (MultinomialNB)

假设一个邮件的概率服从多项分布，和里面每个单词出现的频率有关

With a multinomial event model, samples (feature vectors) represent the frequencies with which certain events have been generated by a multinomial (p1,…,pn) where pi is the probability that event i occurs (or K such multinomials in the multiclass case). A feature vector x=(x1,…,xn) is then a histogram, with xixi counting the number of times event i was observed in a particular instance. This is the event model typically used for document classification, with events representing the occurrence of a word in a single document (see bag of words assumption). The likelihood of observing a histogram x is given by

p (x ∣ C k) = ( \sum i x i ) ! \prod i x i ! \prod i p k i x i

The multinomial naive Bayes classifier becomes a linear classifier when expressed in log-space:[2]

log p (C k ∣ x) \propto log (p (C k) \prod i = 1 n p k i x i) = log p (C k) + \sum i = 1 n x i \cdot log p k i = b + w ⊤ k x

where

b=logp(Ck) and

wki=logpki.

If a given class and feature value never occur together in the training data, then the frequency-based probability estimate will be zero. This is problematic because it will wipe out all information in the other probabilities when they are multiplied. Therefore, it is often desirable to incorporate a small-sample correction, called pseudocount, in all probability estimates such that no probability is ever set to be exactly zero. This way of regularizing naive Bayes is called Laplace smoothing when the pseudocount is one, and Lidstone smoothing in the general case.

class sklearn.naive_bayes.MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None)

参数：
alpha : float, optional (default=1.0) λ 值
fit_prior : boolean, optional (default=True). Whether to learn class prior probabilities or not. If false, a uniform prior will be used. 默认True，如果false各先验概率相等。
class_prior : array-like, size (n_classes,), optional (default=None). Prior probabilities of the classes. If specified the priors are not adjusted according to the data. 默认None，可用来指定先验概率。

>>> import numpy as np>>> X = np.random.randint(5, size=(6, 100))  # X.shape = [n_samples, n_features]>>> y = np.array([1, 2, 3, 4, 5, 6])         # y.shape = [n_samples]>>> from sklearn.naive_bayes import MultinomialNB>>> clf = MultinomialNB()>>> clf.fit(X, y)MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)>>> print(clf.predict(X[2:3]))[3]

朴素贝叶斯

朴素贝叶斯

贝叶斯定理

基本方法

拉普拉斯平滑 Laplace

模型

高斯朴素贝叶斯 (Gaussian naive Bayes)

多项模型 (MultinomialNB)

相关资料