【编程珠玑】第十五章--字符串:统计文本中单词数量(C++的Map&C的结构体实现&POJ2418)
来源:互联网 发布:网络延长器是一对吗 编辑:程序博客网 时间:2024/06/11 05:37
问题:将文档中包含的单词生成一个列表,并统计每个单词出现的次数。用两种方法:
Solution1:用C++中STL的map,统计单词和出现的次数。
Solution2:用C中的结构体实现:
typedef struct Node{ char* word; int count; struct Node* next; }tNode;
输入文件是一个简单的用引文的自我介绍文件SelfIntroduction.txt:
Hello every one My name is wahaha I'm a 15 years old boy I live in the beautiful city of Rizhao I'm an active lovely and clever boy In the school my favourite subject is maths Perhaps someone thinks it's difficult to study well But I like it I belive that if you try your best everything can be done well I also like sports very much Such as running volleyball and so on I'm kind-hearted If you need help please come to me I hope we can be good friends OK This is me A sunny boy Hello everybody My name is Stone I come from Guangdong province in China I am very happy to come here to study with you When I arrived at this school three days ago I fell in love with it It is so beautiful and exciting here and everyone is kind to me especially Kim This class feels just like one big family to me I'm interested in sports music and mountain climbing I hope I can become your friend soon Thank you very much
代码如下word.cc:
#include<iostream>#include<vector>#include<algorithm>#include<map>#include<fstream>using namespace std;ifstream fin;ofstream fout;typedef pair<string,int> PAIR;struct CmpByValue{bool operator()(const PAIR& lhs,const PAIR& rhs){return lhs.second > rhs.second;}};int main(){fin.open("SelfIntroduction.txt");fout.open("result.txt");if(!fin || !fout){cout<<"fail to open the file"<<endl;return 0;}string word;map<string,int> map_word;map<string,int>:: iterator it;/* while语句将每个单词word插入映射map_word,并对相关的计数器(初始化为0)加1,。*/ while(fin>>word){map_word[word]++;}/* 将map_word中的键值对转存到vec_word中去 */vector<PAIR> vec_word(map_word.begin(),map_word.end());/* 根据CmpByValue对vec_word进行排序 */sort(vec_word.begin(),vec_word.end(),CmpByValue());/* 这里仅输出前二十个词频最高的单词 */ for(int i=0;i<20;i++){fout<<vec_word[i].first<<" "<<vec_word[i].second<<endl;} /* 输出不同单词的总个数 */fout<<"Number of Words : "<<vec_word.size()<<endl;fin.close();fout.close();return 0;}
I 11is 6to 6and 5you 4in 4I'm 4me 4boy 3like 3can 3very 3come 3beautiful 2be 2school 2so 2sports 2one 2name 2Number of Words : 117
这段代码直白、简洁而且运行起来很快。为了减少处理时间,可以建立自己i的散列表,散列表中的结点包含指向单词的指针、单词出现频率以及指向表中下一个结点的指针。
用C语言结构体实现:
代码:
#include<iostream>#include<stdio.h>#include<cstring>#include<fstream>using namespace std;typedef struct HashNode{char *word;int count;struct HashNode* next;}tHashNode;const int NHASH = 127; //用一个和单词数最接近的质数作为散列表的大小 const int MULT = 31;struct HashNode* bin[NHASH]; //指针数组(注意和数组指针的区别) /* 散列函数:把每个字符串映射为一个小于NHASH的正整数 */unsigned int hash(char *p){unsigned int h = 0; //h使用无符号整数确保h为正 for(;*p;p++){h = MULT * h + (*p);}return h % NHASH;}void incword(char *s){int pos = hash(s);tHashNode *p;/* 查看具有相同散列值的每个结点。如果发下了该单词,就将其计数值增加1并返回 */for(p = bin[pos];p!=NULL;p=p->next){if( strcmp(s,p->word)==0 ){(p->count)++;return ;}}/* 创建一个新结点,为其分配空间,并复制字符串 */p = new tHashNode();p->word = new char[strlen(s)+1];strcpy(p->word,s);p->count = 1;/* 将新结点插入到链表的最前面 */p->next = bin[pos];bin[pos] = p;}int main(){ifstream fin;ofstream fout;fin.open("SelfIntroduction.txt");fout.open("result.txt");if(!fin||!fout){fout<<"can not open the file"<<endl;return 0;} /* 把每个bucket初始化为NULL */ for(int i=0;i<NHASH;i++){bin[i] = NULL;}char buf[100];/* 读取数据,增加计数值 */while(fin>>buf){incword(buf);} tHashNode *p;int j = 0;for(int i=0;i<NHASH;i++){fout<<"Bucket "<<i<<" : ";for( p = bin[i];p!=NULL;p=p->next){fout<<"| "<<p->word<<" "<<p->count<<" |";j++;}fout<<endl;} fout<<"Number of Words : "<<j<<endl; fin.close(); fout.close(); return 0; }结果:
Bucket 0 : | big 1 |Bucket 1 : | Kim 1 || ago 1 |Bucket 2 : | good 1 |Bucket 3 : | Such 1 |Bucket 4 : | beautiful 2 |Bucket 5 : | friends 1 |Bucket 6 : | Perhaps 1 |Bucket 7 : | done 1 |Bucket 8 : | with 2 |Bucket 9 : | maths 1 |Bucket 10 : | old 1 |Bucket 11 : Bucket 12 : | school 2 |Bucket 13 : | fell 1 |Bucket 14 : | feels 1 || subject 1 |Bucket 15 : | music 1 || class 1 |Bucket 16 : | running 1 |Bucket 17 : Bucket 18 : | clever 1 |Bucket 19 : | mountain 1 |Bucket 20 : Bucket 21 : | someone 1 |Bucket 22 : | this 1 |Bucket 23 : Bucket 24 : | to 6 |Bucket 25 : Bucket 26 : | live 1 |Bucket 27 : | from 1 |Bucket 28 : | Stone 1 |Bucket 29 : | that 1 |Bucket 30 : | come 3 |Bucket 31 : | just 1 |Bucket 32 : | Thank 1 |Bucket 33 : | Rizhao 1 |Bucket 34 : | wahaha 1 |Bucket 35 : Bucket 36 : | family 1 || help 1 |Bucket 37 : Bucket 38 : | kind-hearted 1 || active 1 |Bucket 39 : Bucket 40 : Bucket 41 : Bucket 42 : Bucket 43 : Bucket 44 : Bucket 45 : Bucket 46 : Bucket 47 : Bucket 48 : | 15 1 |Bucket 49 : Bucket 50 : | exciting 1 |Bucket 51 : | me 4 |Bucket 52 : Bucket 53 : Bucket 54 : Bucket 55 : | if 1 |Bucket 56 : | climbing 1 || favourite 1 |Bucket 57 : Bucket 58 : Bucket 59 : Bucket 60 : Bucket 61 : | Hello 2 |Bucket 62 : | When 1 |Bucket 63 : | in 4 |Bucket 64 : Bucket 65 : | A 1 || volleyball 1 |Bucket 66 : | like 3 |Bucket 67 : Bucket 68 : | am 1 || is 6 |Bucket 69 : | interested 1 || try 1 || it 2 || study 2 || it's 1 || an 1 |Bucket 70 : Bucket 71 : | my 1 |Bucket 72 : | please 1 || everything 1 |Bucket 73 : | best 1 || I 11 || one 2 || every 1 |Bucket 74 : | as 1 |Bucket 75 : | at 1 || sunny 1 |Bucket 76 : | province 1 |Bucket 77 : | love 1 || boy 3 |Bucket 78 : | you 4 |Bucket 79 : | If 1 || name 2 |Bucket 80 : | and 5 |Bucket 81 : | especially 1 |Bucket 82 : | everybody 1 |Bucket 83 : Bucket 84 : | here 2 || thinks 1 |Bucket 85 : | kind 1 || can 3 |Bucket 86 : Bucket 87 : | In 1 |Bucket 88 : | arrived 1 |Bucket 89 : | city 1 |Bucket 90 : | happy 1 |Bucket 91 : | be 2 |Bucket 92 : | Guangdong 1 || belive 1 |Bucket 93 : | It 1 |Bucket 94 : | sports 2 |Bucket 95 : | My 2 |Bucket 96 : | become 1 |Bucket 97 : | a 1 || I'm 4 |Bucket 98 : Bucket 99 : | This 2 |Bucket 100 : | China 1 |Bucket 101 : Bucket 102 : Bucket 103 : Bucket 104 : Bucket 105 : Bucket 106 : Bucket 107 : | we 1 || hope 2 |Bucket 108 : | three 1 |Bucket 109 : Bucket 110 : | very 3 || years 1 |Bucket 111 : | everyone 1 || OK 1 || well 2 |Bucket 112 : Bucket 113 : | But 1 |Bucket 114 : | of 1 |Bucket 115 : Bucket 116 : | days 1 |Bucket 117 : Bucket 118 : | need 1 || also 1 |Bucket 119 : | your 2 |Bucket 120 : | so 2 || the 2 |Bucket 121 : Bucket 122 : | friend 1 || on 1 |Bucket 123 : | much 2 || lovely 1 |Bucket 124 : Bucket 125 : Bucket 126 : | soon 1 || difficult 1 |Number of Words : 117
总的运行时间用C语言实现的散列表比C++标准模板库中的映射要快。
平衡搜索树将字符串看作是不可分割的对象进行操作,标准模板库的set和map中大部分实现都使用这种结构。平衡搜索树中的元素始终处于有序状态,从而很容易执行寻找前驱结点或者按顺序输出元素之类的操作。另一方面,散列则需要深入字符串的内部,计算散列函数并将关键字分散到一个较大的表中去。散列方法的平均速度很快,但缺乏平衡树提供的最坏情况性能保证,也不能支持其他设计顺序的操作。
下面实现以下POJ中的第2418题(http://poj.org/problem?id=2418):
题意:就是统计每个树的种类出现的次数,然后求的每个树出现的频率,用map实现。(用Trie树和二叉查找树效率会更高)
代码:
#include<iostream>#include<map>#include<string>#include<fstream>#include<stdio.h>#include<algorithm>#include<vector>#include<iomanip>using namespace std;typedef pair<string,int> PAIR;struct CmpByValue{bool operator()(const PAIR& lhs, const PAIR& rhs){return lhs.first < rhs.first;}};int main(){string tree;map<string,int> map_tree;int count = 0;while(getline(cin,tree)){map_tree[tree]++;count++;}vector<PAIR> vec_tree(map_tree.begin(),map_tree.end());sort(vec_tree.begin(),vec_tree.end(),CmpByValue());for(int i=0;i<vec_tree.size();i++){cout<<vec_tree[i].first<<" "<<setiosflags(ios::fixed)<<setprecision(4)<<vec_tree[i].second*100.0/count<<endl;} return 0;}
Run IDUserProblemResultMemoryTimeLanguageCode LengthSubmit Time12298700niuliguo2418Accepted956K8829MSG++959B2013-11-13 22:35:42
文章出处:http://blog.csdn.net/lavorange/article/details/15951063
- 【编程珠玑】第十五章--字符串:统计文本中单词数量(C++的Map&C的结构体实现&POJ2418)
- c: 统计文章的单词数量
- 统计一个字符串中单词的个数(C语言)
- c:统计单词数量
- Python实现统计文本当中单词的数量,
- 【C语言助教】统计文本中单词的个数!
- poj2418 Map统计单词数
- 【编程珠玑】第十五章 字符串(二)
- C语言统计一个字符串中单词的个数
- C语言算法--统计字符串中单词的个数
- 【编程珠玑】第十五章 字符串
- 【编程珠玑】第十五章 字符串
- c:统计单词数量2
- 统计文件中每个单词的个数--C语言实现
- 《编程珠玑》第二章问题C:找出相同的英文单词(单词改变顺序可以互相转化即为相同)
- [C/C++笔面试]编程实现字符串中各单词的翻转
- python实现统计文本中单词出现的频率
- c++primer 实现文本统计的程序
- 数据结构导论——树
- Linux kernel 2.6.39 + CodeSourcery 2011.03-41 = Alignment exception
- LINK : fatal error LNK1123: failure during conversion to COFF: file invalid or corrupt
- 后缀数组之最长重复不重叠子串 PKU1743
- system/build.prop各行代码解释
- 【编程珠玑】第十五章--字符串:统计文本中单词数量(C++的Map&C的结构体实现&POJ2418)
- 10151
- SQL Server 触发器中 Update的方法 判断一列是否更新 <转>
- 多线程的小知识点
- IOCP编程注意事项
- 常见http错误代码提示
- KMP算法入门【详解+例题模板】
- POJ挑战赛3(POJ Challenge Round 3)题解
- 如何学习opensips/kamailio/openser