java进行一篇文章的高频词统计

来源：互联网发布：新开淘宝店铺怎么装修编辑：程序博客网时间：2024/05/17 06:09

一、需求分析：

给定一篇文档，请对其高频词进行统计，并输出高频词top10。

二、解决思路：

对高频词进行统计，主要是对字符串进行分割，并对其出现的频率进行存储和统计。存储字符串频率可用HashMap的数据结构进行存储，但是HashMap本身是无序的，故需按照频率高低进行自定义排序。

三、具体编码：

package com.zhuke.countWord;import java.io.BufferedReader;import java.io.File;import java.io.FileInputStream;import java.io.FileNotFoundException;import java.io.IOException;import java.io.InputStreamReader;import java.util.Collections;import java.util.Comparator;import java.util.HashMap;import java.util.Iterator;import java.util.LinkedList;import java.util.List;import java.util.Map;/** * 统计一个文件中各词出现的频率，并打印出前10位 *  * @author ZHUKE *  */public class CountWord {// 使用HashMap来存储单词的频率Map<String, Integer> wordCount = new HashMap<>();public static void main(String[] args) {HashMap<String, Integer> map = (HashMap<String, Integer>) new CountWord().wordCount("test.txt");// 自定义排序List<Map.Entry<String, Integer>> list = new LinkedList<Map.Entry<String, Integer>>();list.addAll(map.entrySet());Collections.sort(list, new Comparator<Map.Entry<String, Integer>>() {public int compare(Map.Entry obj1, Map.Entry obj2) {// 从高往低排序if (Integer.parseInt(obj1.getValue().toString()) < Integer.parseInt(obj2.getValue().toString()))return 1;if (Integer.parseInt(obj1.getValue().toString()) == Integer.parseInt(obj2.getValue().toString()))return 0;elsereturn -1;}});// 打印出出现频率最高的前十个单词int i = 0;for (Iterator<Map.Entry<String, Integer>> ite = list.iterator(); i < 10; i++) {Map.Entry<String, Integer> maps = ite.next();System.out.println(maps.getKey() + "\t" + maps.getValue());}/* * //打印出所有的信息 for (Iterator<Map.Entry<String, Integer>> ite = * list.iterator(); ite .hasNext();) { Map.Entry<String, Integer> maps = * ite.next(); System.out.println(maps.getKey() + "\t" + * maps.getValue()); } */}/** * 统计单词频率 *  * @param fileName *            文件名 * @return 存储有单词频率的HashMap */public Map<String, Integer> wordCount(String fileName) {// 打开文件File file = new File(fileName);FileInputStream fis = null;try {fis = new FileInputStream(file);} catch (FileNotFoundException e) {// TODO Auto-generated catch blockSystem.out.println("文件不存在");}// 英文单词以空格为分隔符，将单词分隔，并将所有大写字母转换为小写BufferedReader bufr = new BufferedReader(new InputStreamReader(fis));String s = null;try {while ((s = bufr.readLine()) != null) {// 移除字符串的前导空白和后尾部空白s = s.trim();// 正则表达式：以非字母或者是数字为分隔符，进行分割String[] str = s.split("(\\s+\\W+)|[\\s+\\W+]");for (int i = 0; i < str.length; i++) {// 如果HashMap中已有该值,将值加1if (wordCount.containsKey(str[i])) {wordCount.put(str[i], wordCount.get(str[i]) + 1);} else {// 默认初始化该单词的出现次数为1wordCount.put(str[i], 1);}}}} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}// 移除HashMap中的""空字符串wordCount.remove("");return wordCount;}}

四、运行结果：

选用一篇英文长篇小说进行测试，txt文件大小0.9M，运行结果如下图：

五、性能分析：

0 0