python--10行代码搞定词频统计

来源：互联网发布：mysql触发器动态表名编辑：程序博客网时间：2024/05/20 20:19

问题描述：现在有两篇英文电子书（含中文行），统计他们各自的单词出现次数并进行加和，结果以字典形式呈现：

{'the': 2154, 'and': 1394, 'to': 1080, 'of': 871, 'a': 861, 'his': 639, 'The': 637, 'in': 515, 'he': 461, 'with': 310, 'that': 308, 'you': 295, 'for': 280, 'A': 269, 'was': 258, 'him': 246, 'I': 234, 'had': 220, 'as': 217, 'not': 215, 'by': 196, 'on': 189, 'it': 178, 'be': 164, 'at': 153, 'from': 149, 'they': 149, 'but': 149, 'is': 144, 'her': 144, 'their': 143, 'who': 131, 'all': 121, 'one': 119, 'which': 119,}#部分结果展示

借助python强大的标准库，解决方法的实现只需要10行代码：（本文需要用到的两篇文档下载：http://pan.baidu.com/s/1pKuO7fP）

import re,collectionsdef get_words(file):    with open (file) as f:        words_box=[]        for line in f:                                     if re.match(r'[a-zA-Z0-9]*',line):#避免中文影响                words_box.extend(line.strip().split())                   return collections.Counter(words_box)print(get_nums('emma.txt')+get_nums('伊索寓言.txt'))

如何实现的呢？我们首先是获得了一个所有单词组成的列表，列表成员可以重复，这让我们的词频统计成为可能。利用collections库中的Counter模块，可以很轻松地得到一个由单词和词频组成的字典。

<pre name="code" class="python">words = [    'look', 'into', 'my', 'eyes', 'look', 'into', 'my', 'eyes',    'the', 'eyes', 'the', 'eyes', 'the', 'eyes', 'not', 'around', 'the',    'eyes', "don't", 'look', 'around', 'the', 'eyes', 'look', 'into',    'my', 'eyes', "you're", 'under']a=Counter(words)>>> aCounter({'eyes': 8, 'the': 5, 'look': 4, 'into': 3, 'my': 3, 'around': 2,"you're": 1, "don't": 1, 'under': 1, 'not': 1})

Counter的实例最帅气的特性是它可以进行数学加减，开头两篇文章词频的黏合正是借助了这个功能。

>>> bCounter({'eyes': 1, 'looking': 1, 'are': 1, 'in': 1, 'not': 1, 'you': 1,'my': 1, 'why': 1})>>> # Combine counts>>> c = a + b>>> cCounter({'eyes': 9, 'the': 5, 'look': 4, 'my': 4, 'into': 3, 'not': 2,'around': 2, "you're": 1, "don't": 1, 'in': 1, 'why': 1,'looking': 1, 'are': 1, 'under': 1, 'you': 1})

此外，Counter实例还可以调用most_common方法，回到文章开头的例子，如果我们想得到词频前十的单词该怎么做呢？只需要改动一下末尾，调用most_common方法：

import re,collectionsdef get_words(file):    with open (file) as f:        words_box=[]        for line in f:                                         if re.match(r'[a-zA-Z0-9]',line):                    words_box.extend(line.strip().split())                   return collections.Counter(words_box)a=get_nums('emma.txt')+get_nums('伊索寓言.txt')print(a.most_common(10))

打印结果如下：

[('the', 2154), ('and', 1394), ('to', 1080), ('of', 871), ('a', 861), ('his', 639), ('The', 637), ('in', 515), ('he', 461), ('with', 310)]

当然，最终词频的统计结果还存在不少问题，它们将在下篇文章得到解决。

1 0