《程序员》第9期智慧擂台题目——高频词汇提取
来源:互联网 发布:排球扣球 知乎 编辑:程序博客网 时间:2024/05/16 17:57
面对浩瀚的信息海洋,找到想要的资源有时真的是不容易。在大量文字中搜索高频词汇是信息搜索和数据压缩的共通课题。
这次智慧擂台请大家在一个比较庞大的英文文本中找出M个数量最多的短语(由N个单词组成)。统一处理相同的文本文件,该文本只包含英文单词、空格和回行符,比较谁的程序效率最高,这个文本近日发布。
程序输入:M,N,文本文件路径(M不超过20,N不超过8)
程序输出:高频短语及其数量清单
擂台规则:提交符合以上要求的可执行程序,语言招式不限,点到为止;
我们将在统一的环境下对每位选手的作品进行公平的测试,
比较出综合用时最少的程序。
源程序
import java.io.*;
import java.util.*;
class tt
{
public String phrase;
public int count;
}
public class searchphrase {
private static LinkedHashMap phrase = new LinkedHashMap();
static tt[] max_phrase;
private static Vector SeparateString(String s)
{
Vector v = new Vector();
String temp = "";
for(int i = 0; i < s.length(); i++)
{
if(s.charAt(i) != ' ')
{
temp += s.charAt(i);
}
else
{
if(temp != "")
v.add(temp);
temp = "";
}
}
if(temp != "")
v.add(temp);
return v;
}
private static void swap(int pos, int count, String phrase)
{
int i;
if(max_phrase[pos - 1].count < count)
{
for(i = pos - 1; i > 0; i--)
{
if(max_phrase[i - 1].count > max_phrase[i].count)
break;
}
max_phrase[pos].count = max_phrase[i].count;
max_phrase[pos].phrase = max_phrase[i].phrase;
max_phrase[i].count = count;
max_phrase[i].phrase = phrase;
}
}
private static void adjust_max(int count, String phrase)
{
int i, j;
if(count <= max_phrase[max_phrase.length - 1].count) return;
for(i = max_phrase.length - 1; i >= 0; i--)
{
if(max_phrase[i].phrase.equals(phrase))
{
max_phrase[i].count = count;
if(i > 0)
{
swap(i, count, phrase);
}
return;
}
}
max_phrase[max_phrase.length - 1].count = count;
max_phrase[max_phrase.length - 1].phrase = phrase;
if(i > 0)
{
swap(max_phrase.length - 1, count, phrase);
}
}
private static void js(Vector v, int n)
{
String s;
for(int i = 0; i < v.size() - n + 1; i++)
{
s = "";
for(int j = i; j < i + n; j++)
{
s += v.get(j) + " ";
}
int count = 1;
if(phrase.containsKey(s.hashCode()))
{
count = Integer.parseInt(phrase.get(s.hashCode()).toString());
count++;
}
phrase.put(s.hashCode(), count);
adjust_max(count, s);
}
}
public static void main(String[] args) {
try
{
long t;
int m, n;
String path;
m = Integer.parseInt(args[0]);
n = Integer.parseInt(args[1]);
path = args[2];
max_phrase = new tt[m];
for(int i = 0; i < m; i++)
{
max_phrase[i] = new tt();
max_phrase[i].count = 0;
max_phrase[i].phrase = "";
}
t = (new java.util.Date()).getTime();
java.io.FileReader fr = new java.io.FileReader(path);
java.io.BufferedReader br = new BufferedReader(fr);
String s;
Vector v = null;
while((s = br.readLine()) != null)
{
v = SeparateString(s);
js(v, n);
}
for(int i = 0; i < m; i++)
{
System.out.println(max_phrase[i].phrase);
System.out.println(max_phrase[i].count);
System.out.println();
}
t = (new java.util.Date()).getTime() - t;
System.out.print(t);
System.out.println(" ms");
}
catch(Exception e)
{
System.out.println(e.getMessage());
}
}
}
测试结果1:m = 20 n = 8
under games played won drawn lost goals for
71
tabulated under games played won drawn lost goals
70
games played won drawn lost goals for against
70
May Xinhua Following are the results from the
69
played won drawn lost goals for against and
59
won drawn lost goals for against and points
59
Jan Xinhua Following are the results from the
48
Chinas economic efficiency indicators of the sector of
39
The industrial statistics include all stateowned enterprises and
39
industrial statistics include all stateowned enterprises and the
39
statistics include all stateowned enterprises and the nonstateowned
39
include all stateowned enterprises and the nonstateowned ones
39
all stateowned enterprises and the nonstateowned ones with
39
stateowned enterprises and the nonstateowned ones with annual
39
enterprises and the nonstateowned ones with annual sales
39
and the nonstateowned ones with annual sales income
39
Xinhua Chinas economic efficiency indicators of the sector
39
the nonstateowned ones with annual sales income over
39
nonstateowned ones with annual sales income over million
39
up percent over the same period last year
35
13594 ms
测试结果2 m = 10 n = 5
Xinhua Following are the results
295
May Xinhua Following are the
209
Following are the results from
183
are the results from the
176
April Xinhua Following are the
141
Jan Xinhua Following are the
122
billion yuan billion US dollars
120
won drawn lost goals for
88
played won drawn lost goals
88
Dec Xinhua Following are the
87
12437 ms
以上源程序是采用的是最简单的方法,谁有更好,效率更高的方法,请跟贴!!
- 《程序员》第9期智慧擂台题目——高频词汇提取
- 《程序员》第9期智慧擂台题目——高频词汇提取
- 《程序员》06年9期智慧擂台 :高频词汇提取
- 《程序员》杂志06年第7期智慧擂台题
- 高频词汇提取
- 程序员12期算法擂台
- 《程序员》2007年第九期算法擂台——骑士聚会
- 08年第3期算法擂台,网友聚会题目的另一种解答
- 高频词汇提取的Java实现
- 9期算法擂台源代码
- 《程序员》2008年第1期"算法擂台" - Cantor表解答
- 《程序员》2008年第1期"算法擂台" - Cantor表解答
- 《程序员》2008年第2期"算法擂台" - 微积分习题解答
- 《程序员》2008年第1期"算法擂台" - Cantor表解答
- 高频词汇
- 《程序员》“智慧擂台”栏目诚征出题专家启事
- 高频词汇提取的Java实现(转载)
- Pandas结合Jieba提取xlsx表格文件高频词汇
- 新版Eclipse中JFace需要的运行库配置
- 获得元素的内容
- 验证码的三种做法[纯数字][纯汉字][字母/数字/汉字混合]C#
- C#字符串的使用笔记
- 我对XNA发布的一些评论和看法
- 《程序员》第9期智慧擂台题目——高频词汇提取
- ASP.NET中为DataGrid添加单选框
- Windows2000/XP系统服务列表及说明
- CSDN style blue
- 利用Lucene制作中文搜尋應用[摘]
- Asp.net(C#)实现验证码功能
- asp.net 性能---asp.net performance
- JFreeChart 开发
- 常用单词