分析一个英文txt文本中单词出现的频率
来源:互联网 发布:淘宝直通车助手软件 编辑:程序博客网 时间:2024/05/01 04:11
要求:
写一个程序,对一个txt格式的英文文本中的单词进行单词词频统计,并且输出排在前十的单词。文本大小为30k~300K。
步骤:
1、读一个txt文本文件;
2、统计文本中出现的单词和单词的次数;
3、定义一个数组,其中包括英语单词中的副词、代词、冠词和介词等一些无实际意义的单词;
4、对读到的单词进行排序,并且输出前10个高频词汇。
编程语言:java
测试文件:D:\\test1.txt 419K
性能测试工具:visualVM1.3.8
程序代码:
英语单词中的副词、代词、冠词和介词等一些无实际意义的单词数组:
String strA [] = {"your","had","I","their","not","ago","him","men","day","eighty","able","only","still","In","man","The","will",
"you","years","year","whose","waht","with","yours","yes","a","an","are","all","any","been","both","each","either","one","two",
"three","four","five","six","seven","eigth","nine","ten","none","little","few","many","much","other","another","some","no",
"every","nobody","anybody","somebody","everybody","when","where","how","who","there","where","is","was","were","do","did",
"this","that","in","on","at","as","first","secend","third","fouth","fifth","sixth","ninth","above","over","below","under",
"beside","behind","of","the","after","from","since","for","which","by","next","last","tomorrow","yesterday","before","because",
"against","except","beyond","along","among","but","so","towards","to","it","me","i","he","she","his","they","them","her","its",
"and","has","have","my","would","then","too","or","our","off","we","be","into","weel","can","having","being","even","us","these",
"those","if","ours"};
全部代码:
import java.io.BufferedReader;import java.io.FileReader;import java.util.ArrayList;import java.util.Collections;import java.util.Comparator;import java.util.List;import java.util.Map;import java.util.TreeMap;import java.util.regex.Matcher;import java.util.regex.Pattern;public class wordCount { public static void main(String[] args) throws Exception { long time1 = System.currentTimeMillis(); String strA [] = {"your","had","I","their","not","ago","him","men","day","eighty","able","only","still","In","man","The","will","you",
"years","year","whose","waht","with","yours","yes","a","an","are","all","any","been","both","each","either","one","two","three","four","five",
"six","seven","eigth","nine","ten","none","little","few","many","much","other","another","some","no","every","nobody","anybody","somebody",
"everybody","when","where","how","who","there","where","is","was","were","do","did","this","that","in","on","at","as","first","secend","third",
"fouth","fifth","sixth","ninth","above","over","below","under","beside","behind","of","the","after","from","since","for","which","by","next",
"last","tomorrow","yesterday","before","because","against","except","beyond","along","among","but","so","towards","to","it","me","i","he","she",
"his","they","them","her","its","and","has","have","my","would","then","too","or","our","off","we","be","into","weel","can","having","being",
"even","us","these","those","if","ours"}; BufferedReader reader = new BufferedReader(new FileReader( "D:\\text1.txt")); StringBuffer buffer = new StringBuffer(); String line = null; while ((line = reader.readLine()) != null) { buffer.append(line); } reader.close(); Pattern expression = Pattern.compile("[a-zA-Z]+");// 定义正则表达式匹配单词 String string = buffer.toString(); Matcher matcher = expression.matcher(string);// Map<String, Integer> map = new TreeMap<String, Integer>(); String word = ""; int times = 0; while (matcher.find()) {// 是否匹配单词 word = matcher.group();// 得到一个单词-树映射的键 for(int i=0;i<strA.length;i++){ if(word.equals(strA[i])){word="";}}/*if (map.containsKey(word)) {} else { map.put(word, 1);// 否则单词第一次出现,添加到映射中 }*/ if (map.containsKey(word)) {// 如果包含该键,单词出现过 times = map.get(word);// 得到单词出现的次数 map.put(word, times + 1); } else { map.put(word, 1);// 否则单词第一次出现,添加到映射中 } } /* * 核心:如何按照TreeMap 的value排序而不是key排序.将Map.Entry放在集合里,重写比较器,在用 * Collections.sort(list, comparator);进行排序 */ List<Map.Entry<String, Integer>> list = new ArrayList<Map.Entry<String, Integer>>(map.entrySet()); /* * 重写比较器 * 取出单词个数(value)比较 */ Comparator<Map.Entry<String, Integer>> comparator = new Comparator<Map.Entry<String, Integer>>() { public int compare(Map.Entry<String, Integer> left,Map.Entry<String, Integer> right) { return (left.getValue()).compareTo(right.getValue()); } }; Collections.sort(list, comparator);// 排序 // 打印 int last = list.size() - 1; String[] strB=new String[last+1]; for (int i = last-1; i > last-11; i--) { strB[i] = list.get(i).getKey(); Integer value = list.get(i).getValue(); System.out.print("Top"+(last-i)+" : "); System.out.println("strB["+i+"]="+strB[i] + " \t " + value); } long time2 = System.currentTimeMillis(); System.out.println("耗时:"); System.out.println(time2 - time1+"ms"); }}
性能测试:
分析与不足:
运用StringBuffer存储从文本文件中读到的单词,在程序中应用数组strA[ ]对从文本文件中读到的单词进行检索,剔除单词中与数组中的相同的单词。
在进行检索的时候进行了循环运算,致使程序的运行时间大量增加,并且在数组strA[ ]中没能全部列出英语单词中的副词、代词、介词、冠词等一些无实际意义的词汇。
改进与拓展的方向:
对文件的测试中,可以对一些大文件也可以进行单词词频的统计计算,不过运算时间可能会有所增加。若对程序的单词存储结构进行优化或对改进单词检索的方法函数也可以减少程序运算时间,再者就是完善数组strA[ ]中的内容,使最后得到的结果是我们的确所需要的结果。
0 0
- 分析一个英文txt文本中单词出现的频率
- 统计一个英文文本的单词出现的频率(有标点符号的)
- 统计英文文本单词出现频率
- 写一个程序,分析一个文本文件(英文文章)中各个单词出现的频率,并且把频率最高的10词打印出来
- STL统计英文中单词出现频率的问题
- 统计一个大小为30kb~300kb的文本中各单词出现的频率,并输出前十个单词和进行程序性能分析
- 统计一TXT文档中单词出现频率,输出频率最高的10个单词
- 查找文本中n个出现频率最高的单词
- 统计文本中各单词出现的频率(JavaWeb)
- 查找文本中n个出现频率最高的单词
- python实现统计文本中单词出现的频率
- 统计TXT文档中各个单词出现的频率,并将前十个打印输出的程序性能分析
- 输入一段英文文本,用程序统计出现频率最高和最低的两个单词;
- 分析一个文本文件中各个单词出现的频率,把频率最高的10个词打印出来
- 分析一个文本文件中各个单词出现的频率,把频率最高的10个词打印出来
- 分析一个文本文件中各个词出现的频率,并把频率最高的十个单词打印出来。
- 分析一个文本文件中各个单词出现的频率,把频率最高的10个词打印出来
- 统计文本中英文单词的出现频率
- Servlet上传文件
- HTTP常见错误
- mac os下如何 lsusb
- 做一名优秀的开发者可没有说的那么简单
- 判断手机访问还是电脑访问JS
- 分析一个英文txt文本中单词出现的频率
- cpci热插拔信号
- C语言_二维数组
- 励志人物——牛根生
- TRIZ系列-创新原理-22-变害为利原理
- 《编写高质量代码:改善Java程序的151个建议》 建议1
- leetcode系列(7)LRU Cache
- poj2524--Ubiquitous Religions
- hibernate离线查询之查询子表信息