朴素贝叶斯分类Naive Bayesian

来源：互联网发布：js奇偶数判断的代码编辑：程序博客网时间：2024/05/22 00:45

本算法依据《数据挖掘概念与技术》第三版（韩家炜）中的朴素贝叶斯算法描述来实现的，其分类过程可分为四步（这里只给了简略的步骤描述，详细的公式需看原书）：

（1）建立训练元组与类标号矩阵，并相互对应

（2）计算每个类的最大后验概率

（3）计算属性在不同类别下的概率

（4）预测类标号

朴素贝叶斯分类算法主程序：

clc;clear;%%%%  first step----to construct the probability tree, the training tuple data%%%%  should be load first in ConstructDecisionTree.m, you must notice%%%%  that if you want make the other decision different from this case,%%%%  you should train the tree first.% The 'result' contain a decision tree and attribute listresult=ConstructProbability();PT=result{1,1};attributeList=result{1,2};classAttr=result{1,3};%%%%second step : load the tuple data% read tuple filefileID = fopen('D:\matlabFile\NaiveBayesian\NaiveBaysian.txt');% read as stringD=textscan(fileID,'%s %s %s %s');fclose(fileID); %%%%% third step,make the decision,conclusion=cell(1,1);% get attributes from D for i=1:size(attributeList,1)    conclusion{1,i}=attributeList{i,1};endif size(D{1,1},1)>2    for i=2:size(D{1,1},1)        tuple=conclusion(1,:);       for j=1:size(D,2)           tuple{2,j}=D{1,j}{i,1};       end        decision=ErgodicPT(PT,attributeList,tuple);       tuple{2,j+1}=decision;       conclusion(size(conclusion,1)+1,:)=tuple(2,:);    endendFID=fopen('conclusion.txt','wt');for i=1:size(conclusion,1)    for j=1:size(conclusion,2)        fprintf(FID, '%s ', conclusion{i,j});    end     fprintf(FID,'\n');    endfclose(FID);

ConstructProbability函数实现代码：

%construct the probabilityfunction result=ConstructProbability()    % read training tuple file    fileID = fopen('D:\matlabFile\NaiveBayesian\TrainingSet.txt');    % read as string    Dataset=textscan(fileID,'%s %s %s %s %s');    fclose(fileID);    %appoint the attribute class    classA='buys-computer';    attrs={0,0};    % remeber the class attribute id    id=0;    % get attribute list from D    for i=1:size(Dataset,2)        % find the class attribute        if strcmp(classA,Dataset{1,i}{1,1})==1            id=i;        end        attrs{i,1}=Dataset{1,i}{1,1};        % initialize the attr class        attr=cell(1,1);        for j=2:size(Dataset{1,i},1)            % judge the attr class is exist or not            flag_attr=0;            for k=1:size(attr,1)                if strcmp(attr{k,1},Dataset{1,i}{j,1})                    Dataset{1,i}{j,1}=k-1;                    flag_attr=1;                    break;                end            end            % if attr class does not exist,add new attr            if flag_attr==0                                attr{k+1,1}=Dataset{1,i}{j,1};                Dataset{1,i}{j,1}=k;            end                        end        attr(1,:)=[];        % add attr class to attrs        attrs{i,2}=attr;        Dataset{1,i}(1,:)=[];    end    % create new metrix    DS=zeros(size(Dataset{1,1},1),1);    % convert cell to metrix    for i=1:size(Dataset,2)        DataTemp=cell2mat(Dataset{1,i});        DS=cat(2,DS,DataTemp);    end    DS(:,1)=[];    % move the columns, to make sure that the last column is class attribute    DS=circshift(DS,[0,size(DS,2)-id]);    % adjust the attribute list, to make sure the class attribute at the last    % position    p_temp=attrs(id,:);    attrs(id,:)=[];    attrs(size(attrs,1)+1,:)=p_temp;        % computer the probabilities of all attributes with condition of class    rows=unique(DS(:,size(DS,2)),'rows');    % sort the value so as to mapping the attribute list order    rows=sortrows(rows);    ProbabilityTree=cell(1,2);    for i=1:size(rows,1)        D=DS;        r=find(DS(:,size(DS,2))~=rows(i,1));        D(r,:)=[];        ProbabilityTree{i,1}=size(D,1)/size(DS,1);        % add node to Probability tree        node=cell(1,1);        % compute probability about every value of attribute        for j=1:size(D,2)-1            rows=unique(D(:,j),'rows');            subNode=cell(1,2);            % sort the rows            rows=sortrows(rows);            for k=1:size(rows,1)                subD=D;                subNode{k,1}=rows(k,1);                r=find(D(:,j)~=rows(k,1));                subD(r,:)=[];                subNode{k,2}=size(subD,1)/size(D,1);            end            node{j,1}=subNode;        end        ProbabilityTree{i,2}=node;    end    result={ProbabilityTree,attrs,classA};end

ErgodicPT函数实现过程如下：

function result=ErgodicPT(PT,attributeList,tuple)

 % translate tuple attribute value into integer t=zeros(1,1); for i=1:size(tuple,2)     for j=1:size(attributeList{i,2},1)         if strcmp(attributeList{i,2}{j,1},tuple{2,i})             t(1,i)=j;             break;         end     end end % computer the probabilityr=zeros(1,2);for i=1:size(PT,1)    r(i,1)=i;    R=1;    for j=1:size(t,2)        flag=0;        for k=1:size(PT{i,2}{j,1},1)            if PT{i,2}{j,1}{k,1}==t(1,j)                R=R*PT{i,2}{j,1}{k,2};                flag=1;                break;            end        end        if flag==0            R=0;        end    end    R=R*PT{i,1};    r(i,2)=R;endr=sortrows(r,-2);result=attributeList{size(attributeList,1),2}{r(1,1),1};end

TrainingSet.txt训练数据格式，请复制后保存为txt格式

age income student creditrating buys-computeryouth high no fair noyouth high no excellent nomiddleaged high no fair yessenior medium no fair yessenior low yes fair yessenior low yes excellent nomiddleaged low yes excellent yesyouth medium no fair noyouth low yes fair yessenior medium yes fair yesyouth medium yes excellent yesmiddleaged medium no excellent yesmiddleaged high yes fair yessenior medium no excellent no

NaiveBaysian.txt需要分类的数据，请复制后保存为txt格式：

age income student creditratingyouth high no fairyouth high no excellentmiddleaged high no fairsenior medium no fairsenior low yes fairsenior low yes excellentmiddleaged low yes excellentyouth medium no fairyouth low yes fairsenior medium yes fairyouth medium yes excellentmiddleaged medium no excellentmiddleaged high yes fairsenior medium no excellent

分类结果数据，请参照

age income student creditrating buys-computer youth high no fair no youth high no excellent no middleaged high no fair yes senior medium no fair yes senior low yes fair yes senior low yes excellent yes middleaged low yes excellent yes youth medium no fair no youth low yes fair yes senior medium yes fair yes youth medium yes excellent yes middleaged medium no excellent yes middleaged high yes fair yes senior medium no excellent no

阅读全文

0 0