Linux下统计文本文件中前n个出现频率最高的单词
来源:互联网 发布:贵州广电网络股票 编辑:程序博客网 时间:2024/05/16 15:28
关键脚本:
[root@bogon tmp]# cat stat.sh
#!/bin/bash
end=$1 #$1,第一个入参,表示统计出现频率最高的单词的个数的前n个
cat $2 | #$2,第二个入参,表示需要统计的文件名
#tr是sed的精简,可以用一个字符替换另一个字符或者删除重复的字符
#-c表示补集,即[a-z][A-Z]的补集,即非字母字符,
#-s是出现多个重复的,则去重,只一个,
#\012是新行的意思,就用换行替换非字符符号,如多个重复出现则只显示一个
tr -cs "[a-z][A-Z]" "[\012*]" | #这样可以保证每行只有一个单词
tr A-Z a-z | #将大写字母替换为小写字母
sort | #将每行排序
uniq -c| #统计每行单词个数
sort -k1nr -k2 | #先按第一域数字(k1n)降序(r)排序,再按第二域排序
head -n"$end" #显示前n条记录
[root@bogon tmp]#
该脚本需要两个入参,第一个是输入出现频率最高的单词的前n个
第二个是统计的文本文件。
[root@bogon tmp]# cat words.txt
Occasionally, Dad would get out his mandolin and play for the family. We three children: Trisha, Monte and I, George Jr., would often sing along. Songs such as the Tennessee Waltz, Harbor Lights and around Christmas time, the well-known rendition of Silver Bells. "Silver Bells, Silver Bells, its Christmas time in the city" would ring throughout the house. One of Dad's favorite hymns was "The Old Rugged Cross". We learned the words to the hymn when we were very young, and would sing it with Dad when he would play and sing. Another song that was often shared in our house was a song that accompanied the Walt Disney series: Davey Crockett. Dad only had to hear the song twice before he learned it well enough to play it. "Davey, Davey Crockett, King of the Wild Frontier" was a favorite song for the family. He knew we enjoyed the song and the program and would often get out the mandolin after the program was over. I could never get over how he could play the songs so well after only hearing them a few times. I loved to sing, but I never learned how to play the mandolin. This is something I regret to this day.
[root@bogon tmp]#
执行结果:
[root@bogon tmp]# ./stat.sh 10 words.txt
18 the
7 and
6 to
6 would
5 i
5 play
5 song
5 was
4 dad
4 he
[root@bogon tmp]#
如果不用脚本的话,用一个命令解决的话,就是
cat words.txt |tr -cs "[a-z][A-Z]" "[\012*]"|tr A-Z a-z|sort|uniq -c|sort -k1nr -k2|head -10
实际是一样的。
关键命令是tr,sort,uniq -c
tr是主要删除非英文单词,转换成换行,并只显示一个换行
sort就是对单词进行排序
uniq -c就是统计单词出现次数
在sort就是按对统计次数进行排序显示
head -10就是显示前10条
- Linux下统计文本文件中前n个出现频率最高的单词
- Linux作业(三)-shell统计某文章中出现频率最高的N个单词并排序输出出现次数
- 查找文本中n个出现频率最高的单词
- 查找文本中n个出现频率最高的单词
- linux shell查找文本中n个出现频率最高的单词
- 统计一TXT文档中单词出现频率,输出频率最高的10个单词
- 分析一个文本文件中各个单词出现的频率,把频率最高的10个词打印出来
- 分析一个文本文件中各个单词出现的频率,把频率最高的10个词打印出来
- 分析一个文本文件中各个单词出现的频率,把频率最高的10个词打印出来
- 找出文件中最高频率的前k个单词
- 统计文本中出现频率最高的10个词
- 分析一个文本文件中各个词出现的频率,并把频率最高的十个单词打印出来。
- 统计文章出现频率最高的单词 2011211554
- Java编程:统计文本文件中单词出现频率
- [python]使用Counter统计文章中出现频率最高的单词
- 计算一篇文章中单词出现的频率,并把输出频率最高的十五个单词输出来
- 用hash表统计文本文件中每个单词出现的频率
- 统计文本文件中单词出现频率,自己编写的Java小程序
- beego orm 的基本操作
- objc_msgSend()报错Too many arguments to function call ,expected 0,have3
- github详尽图文配置攻略
- C# 委托实例(跨窗体操作控件)
- Shell脚本连接、读写、操作mysql数据库实例
- Linux下统计文本文件中前n个出现频率最高的单词
- Java学习笔记——利用BufferedInputStream读数据
- jquery的each()详细介绍
- 关于Servlet创建与配置
- 查看jvm heap情况
- 第十五周项目三在OJ上玩指针(4)两数和与差
- Activity之间传递List类型数据
- fread与read的区别
- MySQL增加Innodb数据文件过程