(一)朴素贝叶斯学习(2)- binary feature

(一)公式理解- 最大似然

假设各个数据 X1, X2, .... Xn 为iiD,  且其 PDF 为


极大似然估计 使得最大的那个

下面是似然的一些实例, 自己推导下,让思路后续更清晰(课用PPT, 后续可下载)


(三) 算法实现 

(四)c++ 代码实现

因为是二值的特征,因此实现起来比较容易, 主要即为train 的过程与 predict 的过程


// binary feature (C = 2)// xtrain - N*D  N is the number of sambel, D is the dimension// ytrain - N*1// pic_of_one is the pi's estimation// theta - D*C  is the estimation of void NaiveBayesTrain(vector<vector<int> > &xtrain, vector<int> &ytrain, double pic_of_one[], double **theta) {//int data_number = xtrain.size();//int dimension_number = xtrain[0].size();double number_c = 0.0;double number_jc[DIMENSION_NUMBER][CLASS_NUMBER] = {0.0};//InitNumberForZero(number_jc);for (int i = 0; i < DATA_NUMBER; i++){int c = 1;if(ytrain[i] == c){number_c = number_c + 1.0;for (int j = 0; j < DIMENSION_NUMBER; j++){if (xtrain[i][j] == 1){number_jc[j][c] = number_jc[j][c] + 1.0;}else{number_jc[j][0] = number_jc[j][0] + 1.0; }}}}pic_of_one[0] = (DATA_NUMBER-number_c) / DATA_NUMBER;pic_of_one[1] = number_c / DATA_NUMBER;for (int emi_theta_row = 0; emi_theta_row < DIMENSION_NUMBER; emi_theta_row++)for (int emi_theta_col = 0; emi_theta_col < CLASS_NUMBER; emi_theta_col++){if(emi_theta_col == 1)theta[emi_theta_row][emi_theta_col] = number_jc[emi_theta_row][emi_theta_col] / number_c;theta[emi_theta_row][emi_theta_col] = number_jc[emi_theta_row][emi_theta_col] / (DATA_NUMBER - number_c);}}


// use the train param to predict the test datavoid NaiveBayesPredict(vector<vector<int> > &xtest, vector<int> &ytest, double pic_of_one[], double **theta){double likehood_pic[DATA_TEST_NUMBER][CLASS_NUMBER];double **probability_ic;probability_ic = new double *[DATA_TEST_NUMBER];for (int i = 0; i < DATA_TEST_NUMBER; i++){probability_ic[i] = new double [CLASS_NUMBER];}for (int i = 0; i < DATA_TEST_NUMBER; i++){for (int c = 0; c < CLASS_NUMBER; c++){likehood_pic[i][c] = log(pic_of_one[c]);for (int j = 0; j < DIMENSION_NUMBER; j++){if (xtest[i][j] = 1){likehood_pic[i][c] = likehood_pic[i][c] + log(theta[j][c]);}else{likehood_pic[i][c] = likehood_pic[i][c] + log(1.0 - theta[j][c]);}}}ytest[i] = 0;for (int c = 0; c < CLASS_NUMBER; c++){probability_ic[i][c] = exp(likehood_pic[i][c] - LogSumExp(likehood_pic[i]));}ArgMaxClass(ytest, probability_ic, DATA_TEST_NUMBER);}}


数值溢出问题,当解决HMM模型时也会遇到, 当需要计算形式为:

当   extremely 小 的时候, 则上述计算很容易underflows.  

当  很大时, 这个 logsumexp 也会防止上溢;

例如, 当进行 分类后验计算 ht 时 :


这个alphas 可能会非常小,利用 log-space 方式来进行解决






// when the p(x|y = c) is very small// it will fail to numbercial underflow and overflow// this is the solution to this problem// If you want it to run without crashing when given 0 arguments, you'll have to add a case for that :)// realize: -----------------------------------------------------------// This does the trick of effectively dividing all of the arguments by the largest, //then adding its log back in at the end to avoid overflow, so it's well-behaved //for adding a large number of similarly-scaled values, with errors creeping in//if some arguments are many orders of magnitude larger than others.double LogSumExp(double *likehood_pic_each_row){double max_data = 0.0;double sum_exp = 0.0;for (int i  = 0; i < CLASS_NUMBER; i++){if (likehood_pic_each_row[i] > max_data)max_data = likehood_pic_each_row[i];}for (int i = 0; i < CLASS_NUMBER; i++){sum_exp = exp(likehood_pic_each_row[i] - max_data) + sum_exp;}return (log(sum_exp)+max_data);}


训练时用了很多的特征,很容易过拟合, 这里进行特征选择(feature selection),来消除 "irrelevant" feature. 从而更好的分类。 

最简单的方式是选择评估出各个特征的相关性,然后选择 top K 最相关的。 

其中一个 就是利用互信息,  这里 在Xj 特征 和类标签Y 进行信息的测度:


互信息,认为是减少的熵值,当我们了解了这个特征的值之后,标签分布的熵值就会减少(the reduction in entropy on the label distribution once we observe the value the feature j.)

对于二值分布, 即可按下式进行计算:




 并且如果一项它对每个类的贡献相同时,那么,这个量的互信息I(U, C) = 0;


double SumThetajcForEachClass(double *theta_jc, double *pic_of_one);// it may be overfitting. so we should take it to perform feature selection.// to remove "irrelevant" features that do not help much with the class problom// evaluate the relevance of each feature separately, and then take the top K, // where K is need choose.// one way to measure relevance is to use mutual information between feature Xj and// the class label Y;// the mutual information can be thought as the reduction in entropy on the label // distribution once we observe the value of feature j.// MI measures how much information the presence/absence of a term contributes to // making the correct classification decision on C//. If a term's distribution is the same in the class as it is in the collection as a whole, then  I(U;C) = 0void MutualInformation(double pic_of_one[], double **theta, double mutu_info[]){//for each dimension we compute the mutual_information as a judge is to choose the featurefor (int j = 0; j < DIMENSION_NUMBER; j++){double sum_per_dimen = 0.0;for (int c = 0; c < CLASS_NUMBER; c++){double theta_j = SumThetajcForEachClass(theta[j], pic_of_one);if (theta_j == 0){theta_j = theta_j + 0.01;}else if (theta_j == 1){theta_j = theta_j - 0.01;}sum_per_dimen = sum_per_dimen + theta[j][c]*pic_of_one[c]*log(theta[j][c]/theta_j);sum_per_dimen = sum_per_dimen + (1.0 - theta[j][c])*pic_of_one[c]*log((1.0 - theta[j][c])/(1 - theta_j));} mutu_info[j] = sum_per_dimen;}}double SumThetajcForEachClass(double *theta_jc, double *pic_of_one){double theta_j = 0;for (int c = 0; c < CLASS_NUMBER; c++){theta_j = theta_jc[c]*pic_of_one[c] + theta_j;}return theta_j;}

(七)MLE vs MAP(maximum a-posterior)


 尝试去考虑 作为随机变量, 通过贝叶斯,转化先验 关于 参数到 一个后验的概率,其中通过似然函数


由于与 p(X) 无关,所以就是最大化

道理很简单,就是在MLE 上添加个先验呗, 但是这是在实际问题中才有其妙用



这个统计获得数据,就是通过在华尔街的高尔夫俱乐部进行投票统计, 统计的方式是问100个人的支持情况,其中7个回答是民主党派,93个支持共和党(这就是伯努力模型,相当于抛硬币)。如果这是用最大似然估计来估计选民主党的比例,得到结果为:


其中对于伯努力(似然部分), Beta 分布是合适的先验表达。Beta分布是一个定义在[0, 1]的连续分布的家族,通过两个正参数 ,调整图的形状。

现在回到MAP ,要做的估计即为:



 后续想要详细的计算推到过程,参加后续reference 《MLE vs MAP》.


MAP总结:这里的 是很重要的伪数值, 并且他们的值越高, 先验的影响就会对最终预测比重越大。

重新来看美国总统 100 个投票选举的情况, 我们有 n = 100, nr = 93.

我们假设模型先验各种50%支持民主党,50%支持共和党。 这样有 =  。 然而不同的alpha 和 beta 在先验项表达不同的优势。

MAP估计在后续的算法中,有很多的应用, HMM , EM , LDA .  等等。 

(八) 未实现的部分 

1. 当分类问题不是二值特征时, 利用高斯进行参数估计的情况;

2. 对于多类,多特征问题。


1> Machine Learning A Probabilistic Perspective (textbook);

2> 最大似然的课间-MLE, naive bayes(提供下载);

3>  logsumexp: The log-sum-exp trick: http://machineintelligence.tumblr.com/post/4998477107/the-log-sum-exp-trick;

4> Mutual information: http://nlp.stanford.edu/IR-book/html/htmledition/mutual-information-1.html;

5>MAP :  MLE vs MAP;     Example of maximum a posteriori estimation: http://stats.stackexchange.com/questions/65212/example-of-maximum-a-posteriori-estimation .


