【编程珠玑】第十五章--字符串：统计文本中单词数量（C++的Map&C的结构体实现&POJ2418）

来源：互联网发布：网络延长器是一对吗编辑：程序博客网时间：2024/06/11 05:37

问题：将文档中包含的单词生成一个列表，并统计每个单词出现的次数。用两种方法：

Solution1：用C++中STL的map，统计单词和出现的次数。

Solution2：用C中的结构体实现：

typedef struct Node{    char* word;    int count;    struct Node* next; }tNode;

输入文件是一个简单的用引文的自我介绍文件SelfIntroduction.txt：

Hello  every one  My name is wahaha  I'm a 15 years old boy  I live in the beautiful city of Rizhao  I'm an active  lovely and clever boy  In the school   my favourite subject is maths  Perhaps someone thinks it's difficult to study well  But I like it  I belive that if you try your best  everything can be done well  I also like sports very much  Such as running volleyball and so on  I'm kind-hearted  If you need help  please come to me  I hope we can be good friends  OK This is me  A sunny boy  Hello everybody  My name is Stone  I come from Guangdong province in China  I am very happy to come here to study with you  When I arrived at this school three days ago  I fell in love with it  It is so beautiful and exciting here  and everyone is kind to me especially Kim  This class feels just like one big family to me  I'm interested in sports  music and mountain climbing  I hope I can become your friend soon  Thank you very much

代码如下word.cc：

#include<iostream>#include<vector>#include<algorithm>#include<map>#include<fstream>using namespace std;ifstream fin;ofstream fout;typedef pair<string,int> PAIR;struct CmpByValue{bool operator()(const PAIR& lhs,const PAIR& rhs){return lhs.second > rhs.second;}};int main(){fin.open("SelfIntroduction.txt");fout.open("result.txt");if(!fin || !fout){cout<<"fail to open the file"<<endl;return 0;}string word;map<string,int> map_word;map<string,int>:: iterator it;/* while语句将每个单词word插入映射map_word，并对相关的计数器（初始化为0）加1,。*/ while(fin>>word){map_word[word]++;}/* 将map_word中的键值对转存到vec_word中去 */vector<PAIR> vec_word(map_word.begin(),map_word.end());/* 根据CmpByValue对vec_word进行排序 */sort(vec_word.begin(),vec_word.end(),CmpByValue());/* 这里仅输出前二十个词频最高的单词 */ for(int i=0;i<20;i++){fout<<vec_word[i].first<<" "<<vec_word[i].second<<endl;}    /* 输出不同单词的总个数 */fout<<"Number of Words : "<<vec_word.size()<<endl;fin.close();fout.close();return 0;}

结果result.txt：

I 11is 6to 6and 5you 4in 4I'm 4me 4boy 3like 3can 3very 3come 3beautiful 2be 2school 2so 2sports 2one 2name 2Number of Words : 117

这段代码直白、简洁而且运行起来很快。为了减少处理时间，可以建立自己i的散列表，散列表中的结点包含指向单词的指针、单词出现频率以及指向表中下一个结点的指针。

用C语言结构体实现：

代码：

#include<iostream>#include<stdio.h>#include<cstring>#include<fstream>using namespace std;typedef struct HashNode{char *word;int count;struct HashNode* next;}tHashNode;const int NHASH = 127;  //用一个和单词数最接近的质数作为散列表的大小 const int MULT = 31;struct HashNode* bin[NHASH]; //指针数组（注意和数组指针的区别） /* 散列函数：把每个字符串映射为一个小于NHASH的正整数 */unsigned int hash(char *p){unsigned int h = 0; //h使用无符号整数确保h为正 for(;*p;p++){h = MULT * h + (*p);}return h % NHASH;}void incword(char *s){int pos = hash(s);tHashNode *p;/* 查看具有相同散列值的每个结点。如果发下了该单词，就将其计数值增加1并返回 */for(p = bin[pos];p!=NULL;p=p->next){if( strcmp(s,p->word)==0 ){(p->count)++;return ;}}/* 创建一个新结点，为其分配空间，并复制字符串 */p = new tHashNode();p->word = new char[strlen(s)+1];strcpy(p->word,s);p->count = 1;/* 将新结点插入到链表的最前面 */p->next = bin[pos];bin[pos] = p;}int main(){ifstream fin;ofstream fout;fin.open("SelfIntroduction.txt");fout.open("result.txt");if(!fin||!fout){fout<<"can not open the file"<<endl;return 0;}    /* 把每个bucket初始化为NULL */ for(int i=0;i<NHASH;i++){bin[i] = NULL;}char buf[100];/* 读取数据，增加计数值 */while(fin>>buf){incword(buf);} tHashNode *p;int j = 0;for(int i=0;i<NHASH;i++){fout<<"Bucket "<<i<<" : ";for( p = bin[i];p!=NULL;p=p->next){fout<<"| "<<p->word<<" "<<p->count<<" |";j++;}fout<<endl;}        fout<<"Number of Words : "<<j<<endl;        fin.close();    fout.close();    return 0; }

结果：

Bucket 0 : | big 1 |Bucket 1 : | Kim 1 || ago 1 |Bucket 2 : | good 1 |Bucket 3 : | Such 1 |Bucket 4 : | beautiful 2 |Bucket 5 : | friends 1 |Bucket 6 : | Perhaps 1 |Bucket 7 : | done 1 |Bucket 8 : | with 2 |Bucket 9 : | maths 1 |Bucket 10 : | old 1 |Bucket 11 : Bucket 12 : | school 2 |Bucket 13 : | fell 1 |Bucket 14 : | feels 1 || subject 1 |Bucket 15 : | music 1 || class 1 |Bucket 16 : | running 1 |Bucket 17 : Bucket 18 : | clever 1 |Bucket 19 : | mountain 1 |Bucket 20 : Bucket 21 : | someone 1 |Bucket 22 : | this 1 |Bucket 23 : Bucket 24 : | to 6 |Bucket 25 : Bucket 26 : | live 1 |Bucket 27 : | from 1 |Bucket 28 : | Stone 1 |Bucket 29 : | that 1 |Bucket 30 : | come 3 |Bucket 31 : | just 1 |Bucket 32 : | Thank 1 |Bucket 33 : | Rizhao 1 |Bucket 34 : | wahaha 1 |Bucket 35 : Bucket 36 : | family 1 || help 1 |Bucket 37 : Bucket 38 : | kind-hearted 1 || active 1 |Bucket 39 : Bucket 40 : Bucket 41 : Bucket 42 : Bucket 43 : Bucket 44 : Bucket 45 : Bucket 46 : Bucket 47 : Bucket 48 : | 15 1 |Bucket 49 : Bucket 50 : | exciting 1 |Bucket 51 : | me 4 |Bucket 52 : Bucket 53 : Bucket 54 : Bucket 55 : | if 1 |Bucket 56 : | climbing 1 || favourite 1 |Bucket 57 : Bucket 58 : Bucket 59 : Bucket 60 : Bucket 61 : | Hello 2 |Bucket 62 : | When 1 |Bucket 63 : | in 4 |Bucket 64 : Bucket 65 : | A 1 || volleyball 1 |Bucket 66 : | like 3 |Bucket 67 : Bucket 68 : | am 1 || is 6 |Bucket 69 : | interested 1 || try 1 || it 2 || study 2 || it's 1 || an 1 |Bucket 70 : Bucket 71 : | my 1 |Bucket 72 : | please 1 || everything 1 |Bucket 73 : | best 1 || I 11 || one 2 || every 1 |Bucket 74 : | as 1 |Bucket 75 : | at 1 || sunny 1 |Bucket 76 : | province 1 |Bucket 77 : | love 1 || boy 3 |Bucket 78 : | you 4 |Bucket 79 : | If 1 || name 2 |Bucket 80 : | and 5 |Bucket 81 : | especially 1 |Bucket 82 : | everybody 1 |Bucket 83 : Bucket 84 : | here 2 || thinks 1 |Bucket 85 : | kind 1 || can 3 |Bucket 86 : Bucket 87 : | In 1 |Bucket 88 : | arrived 1 |Bucket 89 : | city 1 |Bucket 90 : | happy 1 |Bucket 91 : | be 2 |Bucket 92 : | Guangdong 1 || belive 1 |Bucket 93 : | It 1 |Bucket 94 : | sports 2 |Bucket 95 : | My 2 |Bucket 96 : | become 1 |Bucket 97 : | a 1 || I'm 4 |Bucket 98 : Bucket 99 : | This 2 |Bucket 100 : | China 1 |Bucket 101 : Bucket 102 : Bucket 103 : Bucket 104 : Bucket 105 : Bucket 106 : Bucket 107 : | we 1 || hope 2 |Bucket 108 : | three 1 |Bucket 109 : Bucket 110 : | very 3 || years 1 |Bucket 111 : | everyone 1 || OK 1 || well 2 |Bucket 112 : Bucket 113 : | But 1 |Bucket 114 : | of 1 |Bucket 115 : Bucket 116 : | days 1 |Bucket 117 : Bucket 118 : | need 1 || also 1 |Bucket 119 : | your 2 |Bucket 120 : | so 2 || the 2 |Bucket 121 : Bucket 122 : | friend 1 || on 1 |Bucket 123 : | much 2 || lovely 1 |Bucket 124 : Bucket 125 : Bucket 126 : | soon 1 || difficult 1 |Number of Words : 117

总的运行时间用Ｃ语言实现的散列表比Ｃ＋＋标准模板库中的映射要快。

平衡搜索树将字符串看作是不可分割的对象进行操作，标准模板库的set和map中大部分实现都使用这种结构。平衡搜索树中的元素始终处于有序状态，从而很容易执行寻找前驱结点或者按顺序输出元素之类的操作。另一方面，散列则需要深入字符串的内部，计算散列函数并将关键字分散到一个较大的表中去。散列方法的平均速度很快，但缺乏平衡树提供的最坏情况性能保证，也不能支持其他设计顺序的操作。

下面实现以下POJ中的第2418题（http://poj.org/problem?id=2418）：

题意：就是统计每个树的种类出现的次数，然后求的每个树出现的频率，用map实现。（用Trie树和二叉查找树效率会更高）

代码：

#include<iostream>#include<map>#include<string>#include<fstream>#include<stdio.h>#include<algorithm>#include<vector>#include<iomanip>using namespace std;typedef pair<string,int> PAIR;struct CmpByValue{bool operator()(const PAIR& lhs, const PAIR& rhs){return lhs.first < rhs.first;}};int main(){string tree;map<string,int> map_tree;int count = 0;while(getline(cin,tree)){map_tree[tree]++;count++;}vector<PAIR> vec_tree(map_tree.begin(),map_tree.end());sort(vec_tree.begin(),vec_tree.end(),CmpByValue());for(int i=0;i<vec_tree.size();i++){cout<<vec_tree[i].first<<" "<<setiosflags(ios::fixed)<<setprecision(4)<<vec_tree[i].second*100.0/count<<endl;}     return 0;}

Run IDUserProblemResultMemoryTimeLanguageCode LengthSubmit Time12298700niuliguo2418Accepted956K8829MSG++959B2013-11-13 22:35:42

文章出处：http://blog.csdn.net/lavorange/article/details/15951063