个人项目：统计文本中的单词出现频率

来源：互联网发布：想要女生的身体知乎编辑：程序博客网时间：2024/06/04 09:01

项目要求：

写一个程序，分析一个文本文件中各个词出现的频率，并且把频率最高的10个词打印出来。文本文件大约是30KB~300KB大小。

我使用Java语言开始写关于如何统计文本中出现的单词频率。思路如下：

众所周知，单词与单词之间是通过空格、标点符号以及换行符等连接的。所以，基本思路是将文本转换为字符串，然后通过调用Java的java.util.StringTokenizer库的函数对字符串进行分割，存入容器Hashmap中，最后进行排序输出频率出现最高的前十个。具体代码如下：

package com.wordscount.main;//class  WordsCount import java.io.BufferedReader;import java.io.File;import java.io.FileReader;import java.io.IOException;import java.util.ArrayList;import java.util.Collections;import java.util.Comparator;import java.util.HashMap;import java.util.List;import java.util.Map;import java.util.StringTokenizer;import java.util.Map.Entry;public class WordsCount {public static void main(String arg[]) {String fileName = "D://Desktop/poem.txt";readFile(fileName);} private static void readFile(String fileName ){int wordCount = 0; // 用于统计单词的总个数Map<String, Integer> map = new HashMap<String, Integer>();// 用于统计各个单词的个数，排序File myFile=new File(fileName);    if(!myFile.exists())    {         System.err.println("Can't Find " + fileName);    }    try     {        BufferedReader in = new BufferedReader(new FileReader(myFile));        String str,myStr="";        while ((str = in.readLine()) != null)         {        myStr += str;             }         StringTokenizer token = new StringTokenizer(myStr);// 这个类会将字符串分解成一个个的标记while (token.hasMoreTokens()) { // 循环遍历wordCount++;String word = token.nextToken(", ?.!:\"\"''\n"); // 按照, ? .! : ""'' \n去分割if (map.containsKey(word)) { // HashMap不允许key重复int count = map.get(word);map.put(word, count + 1); // 如果HashMap已有这个单词，则设置它的数量加1} elsemap.put(word, 1); // 如果没有这个单词，则新填入，数量为1}System.out.println("总共单词数：" + wordCount);sort(map); // 调用排序的方法，排序并输出！                in.close();    }     catch (IOException e)     {        e.getStackTrace();    }}public static void sort(Map<String, Integer> map) {List<Map.Entry<String, Integer>> infoIds = new ArrayList<Map.Entry<String, Integer>>(map.entrySet());Collections.sort(infoIds, new Comparator<Map.Entry<String, Integer>>() {public int compare(Map.Entry<String, Integer> o1,Map.Entry<String, Integer> o2) {return (o2.getValue() - o1.getValue());}}); // 排序if(10>infoIds.size()){isPrint(infoIds.size(),infoIds);}else isPrint(10,infoIds);}private static void isPrint(int wordsNum,List<Map.Entry<String, Integer>>  inList){for (int i = 0; i < wordsNum; i++) { // 输出Entry<String, Integer> id = inList.get(i);System.out.println(id.getKey() + ":" + id.getValue());}}}

在经过调试后，我又加入了虚词过滤的DoFilter类，代码如下

package com.wordscount.main;import java.util.regex.Pattern;/** * 定义虚词过滤器类 * @version 1.0 * @author wangyuan * */public class DoFilter {public boolean search(String str){Pattern pattern = Pattern.compile("[0-9]*");//做对比的字符串String[] functionWords = {"a","an","of","the","as","and","but","in","on","at","to","oh","well","hi"};//虚词字符串for(String functionWord:functionWords){if(functionWord.equals(str)||pattern.matcher(str).matches())return false;}return true;}}

最后结果：

使用 VisualVM进行性能检测，结果如下：

附文本内容：

This last rose of summer Left blooming alone; All her lovely companions Are faded and gone; No flower of her kindred, No rose-bud is nigh, to reflect back her blushes, Or give sigh for sigh. I'll not leave thee, thou lone one! To pine on the stem; Since the lovely are sleeping, Go, sleep thou with them. thus kindly I scatter Thy leaves o'er the bed Where thy mates of the garden Lie scentless and dead. Soon may I follow, When friendships decay, And from Love's shining circle The gems drop away. When true hearts lie withered, And fond ones are flown, O! who would inhabit This bleak world alone?

0 0