英文文本词频统计

来源:互联网 发布:python 人工智能 入门 编辑:程序博客网 时间:2024/04/29 09:48

软件工程作业:


       老师在国庆前有布置一个任务,就是对一个英文文本进行词频统计,输出词频最高的10个单词及次数。还要做性能分析,简直抓狂啊~~~~从未做过。

       经过老师的点评,自己在后期做了一些修正,现在把最终把拿出来。


       我还是采用的C语言,使用链表加上字符数组,自己写了一个过滤单词的字典,实现了老师要求的功能。


       主要的程序设计思想是采用以下步骤:

      

    


      具体到程序来讲就主要是以下两个函数

       List *Statistic() //用来进行读取文件,并进行词典过滤

List *Statistic(){struct List *Head=NULL;FILE *fp;char infilename[1024];int n =1;char *b[] = {"I","We","He","She","Her","His","is","a","an","the",           "It","it","they","They","he","she","his","her","we",   "are","to","in","that","and","the","he","all","was","to","would","and","of",   "that","i","for","could","had","when","as","on","not","us","him","this","so","out",   "our","never","up","how","at","few","often","get",   "after","have","their","there","around","be","if","were",   "again","didn't","your","take","you","it's","toward","with",   "yourself","than","rather","an","what","don't","you're","or",   "--","you've","my","but","from","more","no","its","do",   "which","them","go","are","just","by","will",   "me","can","then","s","all","now","even","into","And","But","who","Then","So","may","thou","thee","Odysseus","thy"};printf("请输入文件路径:\n");scanf("%s",infilename);fp=fopen(infilename,"r");//读文件while(!feof(fp)){char *p=(char*)malloc(30*sizeof(char));fscanf(fp,"%s",p);for(int i=0;i<sizeof(b)/sizeof(b[0]);i++){n *= strcmp(b[i],p);}if(n!=0){if(Head==NULL)//分析单词频率{struct List *temp=(struct List*)malloc(sizeof(struct List));strcpy(temp->word,p);temp->m=1;temp->next=NULL;Head=temp;}else{struct List *L=Head;while(L!=NULL){if(strcmp(L->word,p)==0){int count = L->m;count++;L->m = count;break;}L=L->next;}if(L==NULL){struct List*temp = (struct List*)malloc(sizeof(struct List));strcpy(temp->word, p);temp->m=1;temp->next=Head;Head=temp;}}}else{n=1;}}return Head;}


       void Printf  (List *Head) //输出统计的词频


void Printf(List *Head) {struct List *q;int i,a[10];for(i=0;i<10;i++)a[i]=0;printf("文本单词出现频率由高到低依次为:\n");//排序输出for(i=0;i<10;i++){q=Head;while(q!=NULL){if(q->m>a[i])a[i]=q->m;elseq=q->next;}q=Head;while(q!=NULL){if(a[i]==q->m){q->m=0;printf("%s\t",q->word);printf("出现频数为:%d\n",a[i]);break;}elseq=q->next;}}}


    然后过滤单词系统中采用的词典主要是我自己写的词典


char *b[] = {"I","We","He","She","Her","His","is","a","an","the",           "It","it","they","They","he","she","his","her","we",   "are","to","in","that","and","the","he","all","was","to","would","and","of",   "that","i","for","could","had","when","as","on","not","us","him","this","so","out",   "our","never","up","how","at","few","often","get",   "after","have","their","there","around","be","if","were",   "again","didn't","your","take","you","it's","toward","with",   "yourself","than","rather","an","what","don't","you're","or",   "--","you've","my","but","from","more","no","its","do",   "which","them","go","are","just","by","will",   "me","can","then","s","all","now","even","into","And","But","who","Then","So","may","thou","thee","Odysseus","thy"};



通过上面三个过程,实现了功能,上面贴的代码只是自定义函数的代码,接下来附上源代码


源代码

#include<stdio.h>#include<stdlib.h>#include<string.h>struct List {char word[30];int m;struct List *next;};List *Statistic();void Printf(List *Head); int main() //主函数{ struct List *Head;Head=Statistic();Printf(Head);return 0;}List *Statistic(){struct List *Head=NULL;FILE *fp;char infilename[1024];int n =1;char *b[] = {"I","We","He","She","Her","His","is","a","an","the",           "It","it","they","They","he","she","his","her","we",   "are","to","in","that","and","the","he","all","was","to","would","and","of",   "that","i","for","could","had","when","as","on","not","us","him","this","so","out",   "our","never","up","how","at","few","often","get",   "after","have","their","there","around","be","if","were",   "again","didn't","your","take","you","it's","toward","with",   "yourself","than","rather","an","what","don't","you're","or",   "--","you've","my","but","from","more","no","its","do",   "which","them","go","are","just","by","will",   "me","can","then","s","all","now","even","into","And","But","who","Then","So","may","thou","thee","Odysseus","thy"};printf("请输入文件路径:\n");scanf("%s",infilename);fp=fopen(infilename,"r");//读文件while(!feof(fp)){char *p=(char*)malloc(30*sizeof(char));fscanf(fp,"%s",p);for(int i=0;i<sizeof(b)/sizeof(b[0]);i++){n *= strcmp(b[i],p);}if(n!=0){if(Head==NULL)//分析单词频率{struct List *temp=(struct List*)malloc(sizeof(struct List));strcpy(temp->word,p);temp->m=1;temp->next=NULL;Head=temp;}else{struct List *L=Head;while(L!=NULL){if(strcmp(L->word,p)==0){int count = L->m;count++;L->m = count;break;}L=L->next;}if(L==NULL){struct List*temp = (struct List*)malloc(sizeof(struct List));strcpy(temp->word, p);temp->m=1;temp->next=Head;Head=temp;}}}else{n=1;}}return Head;}void Printf(List *Head) {struct List *q;int i,a[10];for(i=0;i<10;i++)a[i]=0;printf("文本单词出现频率由高到低依次为:\n");//排序输出for(i=0;i<10;i++){q=Head;while(q!=NULL){if(q->m>a[i])a[i]=q->m;elseq=q->next;}q=Head;while(q!=NULL){if(a[i]==q->m){q->m=0;printf("%s\t",q->word);printf("出现频数为:%d\n",a[i]);break;}elseq=q->next;}}}


运行一个名叫test1.txt的文本以后,统计的词频结果如下。




接下来是对程序进行性能分析,我采用的是VS2012自带的性能分析工具。










     经过性能分析以后,发现在字典过滤的模块还可以进行优化,我采用的是最传统的构建字典,然后for循环一个一个的compare,通过网上百度以后发现,其实还有更好的方法,那就是采用Hash表或者正则表达式等等来进行过滤,性能会更高,我会在后面有时间的时候加以尝试。

0 0