(一)朴素贝叶斯学习(2)- binary feature

来源:互联网 发布:js获取span的value 编辑:程序博客网 时间:2024/04/29 16:48

(一)公式理解- 最大似然

假设各个数据 X1, X2, .... Xn 为iiD,  且其 PDF 为

所以似然函数的定义为:

极大似然估计 使得最大的那个

下面是似然的一些实例, 自己推导下,让思路后续更清晰(课用PPT, 后续可下载)







(二)公式理解-朴素贝叶斯


(三) 算法实现 


(四)c++ 代码实现

因为是二值的特征,因此实现起来比较容易, 主要即为train 的过程与 predict 的过程

训练函数:NaiveBayesTrain

// binary feature (C = 2)// xtrain - N*D  N is the number of sambel, D is the dimension// ytrain - N*1// pic_of_one is the pi's estimation// theta - D*C  is the estimation of void NaiveBayesTrain(vector<vector<int> > &xtrain, vector<int> &ytrain, double pic_of_one[], double **theta) {//int data_number = xtrain.size();//int dimension_number = xtrain[0].size();double number_c = 0.0;double number_jc[DIMENSION_NUMBER][CLASS_NUMBER] = {0.0};//InitNumberForZero(number_jc);for (int i = 0; i < DATA_NUMBER; i++){int c = 1;if(ytrain[i] == c){number_c = number_c + 1.0;for (int j = 0; j < DIMENSION_NUMBER; j++){if (xtrain[i][j] == 1){number_jc[j][c] = number_jc[j][c] + 1.0;}else{number_jc[j][0] = number_jc[j][0] + 1.0; }}}}pic_of_one[0] = (DATA_NUMBER-number_c) / DATA_NUMBER;pic_of_one[1] = number_c / DATA_NUMBER;for (int emi_theta_row = 0; emi_theta_row < DIMENSION_NUMBER; emi_theta_row++)for (int emi_theta_col = 0; emi_theta_col < CLASS_NUMBER; emi_theta_col++){if(emi_theta_col == 1)theta[emi_theta_row][emi_theta_col] = number_jc[emi_theta_row][emi_theta_col] / number_c;theta[emi_theta_row][emi_theta_col] = number_jc[emi_theta_row][emi_theta_col] / (DATA_NUMBER - number_c);}}

预测函数:NaiveBayesPredict

// use the train param to predict the test datavoid NaiveBayesPredict(vector<vector<int> > &xtest, vector<int> &ytest, double pic_of_one[], double **theta){double likehood_pic[DATA_TEST_NUMBER][CLASS_NUMBER];double **probability_ic;probability_ic = new double *[DATA_TEST_NUMBER];for (int i = 0; i < DATA_TEST_NUMBER; i++){probability_ic[i] = new double [CLASS_NUMBER];}for (int i = 0; i < DATA_TEST_NUMBER; i++){for (int c = 0; c < CLASS_NUMBER; c++){likehood_pic[i][c] = log(pic_of_one[c]);for (int j = 0; j < DIMENSION_NUMBER; j++){if (xtest[i][j] = 1){likehood_pic[i][c] = likehood_pic[i][c] + log(theta[j][c]);}else{likehood_pic[i][c] = likehood_pic[i][c] + log(1.0 - theta[j][c]);}}}ytest[i] = 0;for (int c = 0; c < CLASS_NUMBER; c++){probability_ic[i][c] = exp(likehood_pic[i][c] - LogSumExp(likehood_pic[i]));}ArgMaxClass(ytest, probability_ic, DATA_TEST_NUMBER);}}

(五)数值溢出情况解决logsumexp

数值溢出问题,当解决HMM模型时也会遇到, 当需要计算形式为:

当   extremely 小 的时候, 则上述计算很容易underflows.  

当  很大时, 这个 logsumexp 也会防止上溢;

例如, 当进行 分类后验计算 ht 时 :

 

这个alphas 可能会非常小,利用 log-space 方式来进行解决

 

通常即为:


 

此时有这种对保持数值精准的操作即为一个最大化操作和缩减其他数值得到

其中实现如下:

// when the p(x|y = c) is very small// it will fail to numbercial underflow and overflow// this is the solution to this problem// If you want it to run without crashing when given 0 arguments, you'll have to add a case for that :)// realize: -----------------------------------------------------------// This does the trick of effectively dividing all of the arguments by the largest, //then adding its log back in at the end to avoid overflow, so it's well-behaved //for adding a large number of similarly-scaled values, with errors creeping in//if some arguments are many orders of magnitude larger than others.double LogSumExp(double *likehood_pic_each_row){double max_data = 0.0;double sum_exp = 0.0;for (int i  = 0; i < CLASS_NUMBER; i++){if (likehood_pic_each_row[i] > max_data)max_data = likehood_pic_each_row[i];}for (int i = 0; i < CLASS_NUMBER; i++){sum_exp = exp(likehood_pic_each_row[i] - max_data) + sum_exp;}return (log(sum_exp)+max_data);}

(六)特征非独立是,特征选择,互信息

训练时用了很多的特征,很容易过拟合, 这里进行特征选择(feature selection),来消除 "irrelevant" feature. 从而更好的分类。 

最简单的方式是选择评估出各个特征的相关性,然后选择 top K 最相关的。 

其中一个 就是利用互信息,  这里 在Xj 特征 和类标签Y 进行信息的测度:

 

互信息,认为是减少的熵值,当我们了解了这个特征的值之后,标签分布的熵值就会减少(the reduction in entropy on the label distribution once we observe the value the feature j.)

对于二值分布, 即可按下式进行计算:

 

其中


互信息,在分类问题中,即是这个特征对于他所属于的正确的类的贡献的大小度量。

 并且如果一项它对每个类的贡献相同时,那么,这个量的互信息I(U, C) = 0;

实现如下:

double SumThetajcForEachClass(double *theta_jc, double *pic_of_one);// it may be overfitting. so we should take it to perform feature selection.// to remove "irrelevant" features that do not help much with the class problom// evaluate the relevance of each feature separately, and then take the top K, // where K is need choose.// one way to measure relevance is to use mutual information between feature Xj and// the class label Y;// the mutual information can be thought as the reduction in entropy on the label // distribution once we observe the value of feature j.// MI measures how much information the presence/absence of a term contributes to // making the correct classification decision on C//. If a term's distribution is the same in the class as it is in the collection as a whole, then  I(U;C) = 0void MutualInformation(double pic_of_one[], double **theta, double mutu_info[]){//for each dimension we compute the mutual_information as a judge is to choose the featurefor (int j = 0; j < DIMENSION_NUMBER; j++){double sum_per_dimen = 0.0;for (int c = 0; c < CLASS_NUMBER; c++){double theta_j = SumThetajcForEachClass(theta[j], pic_of_one);if (theta_j == 0){theta_j = theta_j + 0.01;}else if (theta_j == 1){theta_j = theta_j - 0.01;}sum_per_dimen = sum_per_dimen + theta[j][c]*pic_of_one[c]*log(theta[j][c]/theta_j);sum_per_dimen = sum_per_dimen + (1.0 - theta[j][c])*pic_of_one[c]*log((1.0 - theta[j][c])/(1 - theta_j));} mutu_info[j] = sum_per_dimen;}}double SumThetajcForEachClass(double *theta_jc, double *pic_of_one){double theta_j = 0;for (int c = 0; c < CLASS_NUMBER; c++){theta_j = theta_jc[c]*pic_of_one[c] + theta_j;}return theta_j;}

(七)MLE vs MAP(maximum a-posterior)

贝叶斯公式     

 尝试去考虑 作为随机变量, 通过贝叶斯,转化先验 关于 参数到 一个后验的概率,其中通过似然函数

 最大后验概率(MAP)定义为:

由于与 p(X) 无关,所以就是最大化


道理很简单,就是在MLE 上添加个先验呗, 但是这是在实际问题中才有其妙用

Example:

咱们做一个统计,就是预测下一届的美国选举。

这个统计获得数据,就是通过在华尔街的高尔夫俱乐部进行投票统计, 统计的方式是问100个人的支持情况,其中7个回答是民主党派,93个支持共和党(这就是伯努力模型,相当于抛硬币)。如果这是用最大似然估计来估计选民主党的比例,得到结果为:

按照我们实际的观察,这明显不符啊,之前的经验是选举民主党的比重大约在50%(0.5)左右,一半的选票,那怎么能够让我们的统计包含上我们的实际经验(先验)那,这时MAP估计让我们将先验和估计联系到了一起。

其中对于伯努力(似然部分), Beta 分布是合适的先验表达。Beta分布是一个定义在[0, 1]的连续分布的家族,通过两个正参数 ,调整图的形状。


现在回到MAP ,要做的估计即为:

即得到项的对应如下:


对应到函数上为:


 后续想要详细的计算推到过程,参加后续reference 《MLE vs MAP》.

最终得到估计的结果为:


MAP总结:这里的 是很重要的伪数值, 并且他们的值越高, 先验的影响就会对最终预测比重越大。


重新来看美国总统 100 个投票选举的情况, 我们有 n = 100, nr = 93.

我们假设模型先验各种50%支持民主党,50%支持共和党。 这样有 =  。 然而不同的alpha 和 beta 在先验项表达不同的优势。




MAP估计在后续的算法中,有很多的应用, HMM , EM , LDA .  等等。 


(八) 未实现的部分 

1. 当分类问题不是二值特征时, 利用高斯进行参数估计的情况;

2. 对于多类,多特征问题。




Reference:

1> Machine Learning A Probabilistic Perspective (textbook);

2> 最大似然的课间-MLE, naive bayes(提供下载);

3>  logsumexp: The log-sum-exp trick: http://machineintelligence.tumblr.com/post/4998477107/the-log-sum-exp-trick;

4> Mutual information: http://nlp.stanford.edu/IR-book/html/htmledition/mutual-information-1.html;

5>MAP :  MLE vs MAP;     Example of maximum a posteriori estimation: http://stats.stackexchange.com/questions/65212/example-of-maximum-a-posteriori-estimation .



上述实现朴素贝叶斯工程的完整代码-NaiveBayes















0 0