【Leetcode Shell】Word Frequency

来源：互联网发布：手机淘宝举报在哪里看编辑：程序博客网时间：2024/06/06 10:02

题目：

Write a bash script to calculate the frequency of each word in a text file words.txt.

For simplicity sake, you may assume:

words.txt contains only lowercase characters and space ' ' characters.
Each word must consist of lowercase characters only.
Words are separated by one or more whitespace characters.

For example, assume that words.txt has the following content:

the day is sunny the thethe sunny is is

Your script should output the following, sorted by descending frequency:

the 4is 3sunny 2day 1

Note:
Don't worry about handling ties, it is guaranteed that each word's frequency count is unique.

第一次写的：

# Read from the file words.txt and output the word frequency list to stdout.

sed 's/ /\n/g' words.txt | sort | uniq -c | sort -r | awk '{print $2 " " $1}'

思想：

(1)通过sed命令将空格转换成换行符——>(2)将得到的结果用sort命令来排序——>(3)然后用uniq -c命令来统计每个单词出现的次数——>(4)将得到的结果用sort -r命令来逆序排序——>(5)用awk重新排版

报错：

错误原因：

忽略了多个空格或者tab的影响，如果两个单词之间有多个空格，sed命令只会把一个空格当作分隔符

第二次：

# Read from the file words.txt and output the word frequency list to stdout.

sed 's/ /\n/g' words.txt | sed '/^\s*$/d' | sort | uniq -c | sort -r | awk '{print $2 " " $1}'

由于这道题只有空格，没有tab，在(1)和(2)之间加入去空格行的代码

还是报错：

错误原因：

can 13应该是在最前面的，结果排到了最后。说明是排序命令出现错误。

第三次：

# Read from the file words.txt and output the word frequency list to stdout.

sed 's/ /\n/g' words.txt | sed '/^\s*$/d' | sort | uniq -c | sort -rn | awk '{print $2 " " $1}'

Accepted

看来确实是排序命令没有用对，加一个-n选项，可以按照出现次数（注意：uniq -c输出的格式是：次数单词）排序，这样就不会出现之前的状况了。

本题知识点：

一、sed转换，

（1）将空格转换成回车： sed 's/ /\n/g'

（2）将多个空格行删除： sed '/^\s*$/d'；还可以用awk NF 或者 awk '!/^$/' 或者 tr -s '\n'

二、sort排序

（1）sort -r 逆序排列

-r, --reverse reverse the result of comparisons

--sort=WORD sort according to WORD:

general-numeric -g, human-numeric -h, month -M,

numeric -n, random -R, version -V

（2）sort -n 按字符串的数值排列，帮助文档：“ compare according to string numerical value”

三、uniq查重

我们通过uniq --help命令，查看uniq的帮助文档，有如下提示：

Note: 'uniq' does not detect repeated lines unless they are adjacent.
You may want to sort the input first, or use `sort -u' without `uniq'.

可以看到uniq只能检测到相邻的重复，所以我们在uniq之前先用sort命令排序，这样可以使重复的单词相邻，方便我们用uniq统计其重复次数。当然，我们也可以用sort -u来达到同样的目的。

四、awk排版

因为程序经sort -rn的输出格式是：次数单词，因此我们需要排版，用awk命令（默认的分隔符是空格），将第一列和第二列颠倒即可。

本题扩展：

如果文件中有tab键该如何写shell?

# Read from the file words.txt and output the word frequency list to stdout.

sed 's/ /\n/g' words.txt | sed -e '/^\s*$/d' -e 's/\t*//g' \

| sort | uniq -c | sort -rn | awk '{print $2 " " $1}'

对，只需用sed命令将一个或多个tab换成空即可，这里注意sed如果要多条命令同时执行，用-e选项

0 0