Linux下统计文本文件中前n个出现频率最高的单词

来源:互联网 发布:贵州广电网络股票 编辑:程序博客网 时间:2024/05/16 15:28


关键脚本:


[root@bogon tmp]# cat stat.sh 
#!/bin/bash
end=$1 #$1,第一个入参,表示统计出现频率最高的单词的个数的前n个


cat $2 | #$2,第二个入参,表示需要统计的文件名
#tr是sed的精简,可以用一个字符替换另一个字符或者删除重复的字符
#-c表示补集,即[a-z][A-Z]的补集,即非字母字符,
#-s是出现多个重复的,则去重,只一个,
#\012是新行的意思,就用换行替换非字符符号,如多个重复出现则只显示一个
tr -cs "[a-z][A-Z]" "[\012*]" | #这样可以保证每行只有一个单词
tr A-Z a-z | #将大写字母替换为小写字母
sort | #将每行排序
uniq -c| #统计每行单词个数
sort -k1nr -k2 | #先按第一域数字(k1n)降序(r)排序,再按第二域排序
head -n"$end"   #显示前n条记录 
 




[root@bogon tmp]# 

该脚本需要两个入参,第一个是输入出现频率最高的单词的前n个

第二个是统计的文本文件。


[root@bogon tmp]# cat words.txt 
Occasionally, Dad would get out his mandolin and play for the family. We three children: Trisha, Monte and I, George Jr., would often sing along. Songs such as the Tennessee Waltz, Harbor Lights and around Christmas time, the well-known rendition of Silver Bells. "Silver Bells, Silver Bells, its Christmas time in the city" would ring throughout the house. One of Dad's favorite hymns was "The Old Rugged Cross". We learned the words to the hymn when we were very young, and would sing it with Dad when he would play and sing. Another song that was often shared in our house was a song that accompanied the Walt Disney series: Davey Crockett. Dad only had to hear the song twice before he learned it well enough to play it. "Davey, Davey Crockett, King of the Wild Frontier" was a favorite song for the family. He knew we enjoyed the song and the program and would often get out the mandolin after the program was over. I could never get over how he could play the songs so well after only hearing them a few times. I loved to sing, but I never learned how to play the mandolin. This is something I regret to this day.
[root@bogon tmp]# 


执行结果:


[root@bogon tmp]# ./stat.sh 10 words.txt 
     18 the
      7 and
      6 to
      6 would
      5 i
      5 play
      5 song
      5 was
      4 dad
      4 he
[root@bogon tmp]# 


如果不用脚本的话,用一个命令解决的话,就是


cat words.txt |tr -cs "[a-z][A-Z]" "[\012*]"|tr A-Z a-z|sort|uniq -c|sort -k1nr -k2|head -10


实际是一样的。


关键命令是tr,sort,uniq -c

tr是主要删除非英文单词,转换成换行,并只显示一个换行

sort就是对单词进行排序

uniq -c就是统计单词出现次数

在sort就是按对统计次数进行排序显示

head -10就是显示前10条






0 0
原创粉丝点击