C++实现Chi-square 特征词选择算法

来源：互联网发布：stc8f单片机编辑：程序博客网时间：2024/05/19 18:45

作者：finallyliuyu(转载请标明原作者与出处)

在文本分类问题中，离不开特征词选择模块。特征选择是特征降维的关键步骤。

首先我们给出一般性的特征词选择模块的伪代码描述：

特征词选择算法一般框架图

（此图摘自 C.D. Maning Introduction to InformationRetrieval 原版p251页或者王斌译版p188页）

此处仅赘述两点，其他还劳请读者自己去看书

1。上面的伪代码给出的是算法是针对某一个类别，按照某种测度（如IG,CHI-square）遴选出 top k个特征词；伪代码中的 ComputeFeatureUtility(D,t,c)。就是在计算上文提到的“某种测度”

2。针对某个分类问题，如何遴选出全部的特征词？

方法有很多，这里仅指出一种：假设有N个类别，共需要选取K个特征词，那么每个类别需要选取的特征词数目为K/N。

下面给出Chi-square的计算公式（出处同上，原版书p256页，王斌译作p192页）：

上面的公式和下面的公式是等价的，可以由下面的公式推导出上面的公式，在计算机实现上，我们通常采用上面的公式。

可以说上面的两个公式，通通是在构造一个chi-square 分布的检测统计量（test statistic）（在数理统计中 chi-square 常常用于检测两个事件之间的独立性，如果独立则 chi-square=0 相关知识请查阅数理统计关于假设检验的相关章节）

下面开始讲解chi-square特征词选择法的具体实现

主流的contingency table的定义。

针对某一个term t 和类别c

N11:该词出现在该类的多少篇文章中；

N10：该词出现的文章有多少篇不再该类中；

N01：该类别中有多少篇文章不含有该词；

N00:训练语料库中共有多少篇文章即不含该词，也不包含在该类中。

在给出实现代码之前，先来看一段对程序实现会有启发作用的话：

contigencyTable

（出处同上，p257页）

这段话引出了一个数据结构：它保存了一个词在每个类别中出现和不出现的情况：比如有n个类别，那么这个数据结构的每一行保存的是：N11,N01。在我的代码中，我把这个数据结构亦称作是contingency table和主流的contingency table定义可能会稍有区别，不过既然有了N11,N01,在根据程序中其他的数据结构很容易能够得到主流定义模式下的contingency table。

下面开始给出实现代码（如果程序中的一些函数的代码我没有给出，请参阅《K-means文本聚类系列（已经完成）》里面的相关函数）

用到的主要数据结构：

1。词典：保存一个词在训练语料集合中的每篇文章中出现的次数数据类型map<string,vector<pair<int,int>> >

2。contingency table（功能见上面叙述）数据类型：map<pair<string,string>,pair<int,int> >

map的键由两个部分组成第一个string代表term, 第二个string代表类别，值中的第一个int 是 N11,第二个int 是N01

获得contingency table的函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
/************************************************************************/
/* 获得每个词的ContingencyTable  
*顶层map的键值为词的(term Text,classLabel)
内层map的键值为类别名称
pair<int,int>的第一个int表示某一类别c中含有term t的文章数目,第二个int表示该类别中不含有term t的文章数目
*/
/************************************************************************/
map<pair<string,string>,pair<int,int> >Preprocess::GetContingencyTable(map<string,vector<pair<int,int>> > &mymap, vector<string> classLabels)
{  
    clock_t start,finish;
    double totaltime;
    start=clock();
     map<string,vector<int> >articleIdsEachClass=GetArticleIdinEachClass(classLabels);
     map<pair<string,string>,pair<int,int> >EntireContigencytable;
    //对于词袋子模型中的每个词
    for(map<string,vector<pair<int,int> > >::iterator it=mymap.begin();it !=mymap.end();++it)
    {  //对于每个类别
        if(it->first!=""||it->first!=" ")
        {
            for(map<string,vector<int> >::iterator it1=articleIdsEachClass.begin();it1!=articleIdsEachClass.end();it1++)
            {  
                int cntTheClass=(it1->second).size();//该类别共有文章数目
                int termInTheClass=0;//该词在该类中出现的次数
                for(vector<pair<int,int> >::iterator it2=(it->second).begin();it2!=(it->second).end();it2++)
                {
                    termInTheClass+=count((it1->second).begin(),it1->second.end(),it2->first);
 
                }
                int termAbsentInTheClass=cntTheClass-termInTheClass;
                pair<string,string> compoundKey=make_pair(it->first,it1->first);
                pair<int,int> valueInfo=make_pair(termInTheClass,termAbsentInTheClass);
                EntireContigencytable[compoundKey]=valueInfo;
                termInTheClass=0;//清空计数；
 
            }
 
        }
         
         
    }
    finish=clock();
    totaltime=(double)(finish-start)/CLOCKS_PER_SEC;
    cout<<"建立contingencyTable的时间为"<<totaltime<<endl;
 
 
    return EntireContigencytable;
 
 
}

由于构造contingency table 要远比将构造好的contingency table序列化到硬盘，然后需要的时候读取到内存的时间长（我的机器上：建立contingency table 历时233.41sec，将contingency table从硬盘序列化到内存的时间为0.954 sec）所有这里给出了针对contingency table序列化和反序列化的函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
/************************************************************************/
/* 将关联表保存到本地硬盘                                                                     */
/************************************************************************/
void Preprocess::SaveContingencyTable(map<pair<string,string>,pair<int,int> >&contingencyTable)
{ 
    ofstream outfile("F:\\Cluster\\contingency.dat",ios::binary);
    for(map<pair<string,string>, pair<int,int> >::iterator it=contingencyTable.begin();it!=contingencyTable.end();it++)
    {
        outfile<<(it->first).first<<" "<<(it->first).second<<" "<<(it->second).first<<" "<<(it->second).second<<endl;
    }
    outfile.close();
 
 
}
/************************************************************************/
/* 将关联表信息从硬盘加载到内存                                                                     */
/************************************************************************/
void Preprocess::LoadContingencyTable(map<pair<string,string>,pair<int,int> >&contingencyTable)
{ 
    clock_t start,finish;
    double totaltime;
    start=clock();
    ifstream infile("F:\\Cluster\\contingency.dat",ios::binary);
    string termtext="";
    string classLabel="";
    int presentNum=0;//该term 在该classLabel下的文章中出现的次数(不计算出现重数)
    int absentNum=0;//该classLabel下的文章中不含有该term的文章数目
    while(!infile.eof())
    {
        infile>>termtext;
        infile>>classLabel;
        infile>>presentNum;
        infile>>absentNum;
        pair<string, string> compoundKey=make_pair(termtext,classLabel);
        pair<int,int> valinfo=make_pair(presentNum,absentNum);
        contingencyTable[compoundKey]=valinfo;
    }
    infile.close();
    finish=clock();
    totaltime=(double)(finish-start)/CLOCKS_PER_SEC;
    cout<<"将contingencyTable加载到内存的时间为"<<totaltime<<endl;
 
}

计算chi-square值的函数：

1
2
3
4
5
6
7
8
9
/************************************************************************/
/* 计算CHI-square 值                                                */
/************************************************************************/
double Preprocess:: CalChiSquareValue(double N11,double N10,double N01,double N00)
{
    double chiSquare=0;
    chiSquare=(N11+N10+N01+N00)*pow((N11*N00-N10*N01),2)/((N11+N01)*(N11+N10)*(N10+N00)*(N01+N00));
    return chiSquare;
}

针对每个类别计算所有词的chi-square并按照chi-square值按从高到低排列：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
计算词袋子中的每一个词对某一类别的卡方值
/************************************************************************/
vector<pair<string,double> > Preprocess::ChiSquareFeatureSelectionForPerclass(map<string,vector<pair<int,int>> >&mymap,map<pair<string,string>,pair<int,int> > &contingencyTable,string classLabel)
{  int N=endIndex-beginIndex+1;//总共的文章数目
    vector<string>tempvector;//词袋子中的所有词
    vector<pair<string,double> > chisquareInfo;
    for(map<string,vector<pair<int,int>>>::iterator it=mymap.begin();it!=mymap.end();++it)
    {
        tempvector.push_back(it->first);
    }
    //计算卡方值
    for(vector<string>::iterator ittmp=tempvector.begin();ittmp!=tempvector.end();ittmp++)
    {
        int N1=mymap[*ittmp].size();
        pair<string,string> compoundKey=make_pair(*ittmp,classLabel);
        double N11=double(contingencyTable[compoundKey].first);
        double N01=double(contingencyTable[compoundKey].second);
        double N10=double(N1-N11);
        double N00=double(N-N1-N01);
        double chiValue=CalChiSquareValue(N11,N10,N01,N00);
        chisquareInfo.push_back(make_pair(*ittmp,chiValue));
 
         
 
    }
    //按照卡方值从大到小将这些词排列起来
    stable_sort(chisquareInfo.begin(),chisquareInfo.end(),isLarger);
    /*ofstream outfile("F:\\Cluster\\other.dat");
    int finalKeyWordsCount=0;
    for(vector<pair<string,double> >::size_type j=0;j<chisquareInfo.size();j++)
    {
        outfile<<chisquareInfo[j].first<<";"<<chisquareInfo[j].second<<endl;
        finalKeyWordsCount++;
    }
    outfile.close();*/
 
 
    return chisquareInfo;
 
 
}

针对整个分类问题的chi-square特征词选择法。在本例中，共有三个类别

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
/************************************************************************/
/* 卡方特征词选择算法                                                                     */
/************************************************************************/
void Preprocess::ChiSquareFeatureSelection(map<string,vector<pair<int,int>> > &mymap,map<pair<string,string>,pair<int,int> > &contingencyTable,int N)
{
    clock_t start,finish;
    double totaltime;
    start=clock();
    int N1=18693;
    int N2=23822;
    int N3=15717;
    int threshold1=N1*N/(N1+N2+N3);
    int threshold2=N2*N/(N1+N2+N3);
    int threshold3=N3*N/(N1+N2+N3);
    string classlabel1="xxxx";
    string classlabel2="yyyy";
    string classlabel3="zzzz";
    vector<string> classLabels;
    classLabels.push_back("xxxx");
    classLabels.push_back("yyyy");
    classLabels.push_back("zzzz");
    vector<pair<string,double>>chisquareInfo1;
    vector<pair<string,double>>chisquareInfo2;
    vector<pair<string,double>>chisquareInfo3;
    chisquareInfo1=ChiSquareFeatureSelectionForPerclass(mymap,contingencyTable,classlabel1);
    chisquareInfo2=ChiSquareFeatureSelectionForPerclass(mymap,contingencyTable,classlabel2);
    chisquareInfo3=ChiSquareFeatureSelectionForPerclass(mymap,contingencyTable,classlabel3);
     
    //stable_sort(chisquareInfo2.begin(),chisquareInfo2.end(),isLarger);
    //stable_sort(chisquareInfo3.begin(),chisquareInfo3.end(),isLarger);
    cout<<"finish ChiSquare Calculation"<<endl;
    set<string>finalKeywords;
    for(vector<pair<string,double> >::size_type j=0;j<threshold1;j++)
    {
        finalKeywords.insert(chisquareInfo1[j].first);
 
    }
    for(vector<pair<string,double> >::size_type j=0;j<threshold2;j++)
    {
        finalKeywords.insert(chisquareInfo2[j].first);
    }
    for(vector<pair<string,double> >::size_type j=0;j<threshold2;j++)
    {
        finalKeywords.insert(chisquareInfo3[j].first);
    }
    ofstream outfile(featurewordsAddress);
    int finalKeyWordsCount=finalKeywords.size();
    for (set<string>::iterator it=finalKeywords.begin();it!=finalKeywords.end();it++)
    {
        outfile<<*it<<endl;
         
    }
    outfile.close();
    cout<<"最后共选择特征词"<<finalKeyWordsCount<<endl;
    finish=clock();
    totaltime=(double)(finish-start)/CLOCKS_PER_SEC;
    cout<<"遴选特征词共有了"<<totaltime<<endl;
     
     
 
 
 
}