词频统计

来源:互联网 发布:apache php 网站发布 编辑:程序博客网 时间:2024/05/16 08:46
#count_hamlet.py# -*- coding: utf-8 -*-def get_txt(filename):    #获取文件内容,并将其中的标点符号以空格替换,返回文本内容    with open(filename,"r") as f:        txt = f.read()        txt = txt.lower()        for ch in "!^~@#$%&()-_+-*\=[]{}|;:'<>,./?" :            txt = txt.replace(ch ,"")        #内容要缩进,不需要f.close()    return txthamlet_txt = get_txt("D:\python_work\mytxt\hamlet.txt") #获取单词列表,待排除的单词{集合}words = hamlet_txt.split()excludes = {'i','you','he','she','we','my','your','his','her','our',    'they','their','me','him','them','it','its','this','that',    'be','been','is','are','was','were','no','not',    'the','a','an','there','here',    'in','on','of','for','with','to','as','so',    'will','can','shall','may','would','could','should','might',    'must','need','ought','have','had',    'and','but','or',    'what','when','who','where','which','hamlet'}#创建字典,以单词为键,以频数为键值counts = {}for word in words:    if word in excludes:        continue    counts[word] = counts.get(word,0) +1#将字典转换为列表,其元素为元组items = list(counts.items())#采用匿名函数,以频数为关键字进行从高到低排序items.sort(key = lambda x : x[1],reverse = True)#打印前三十个频数最高的单词for i in range(30):    word,count = items[i]    print("{0:<10s}{1:>4d}".format(word,count))

这里要注意:
1.创建集合的方式 {::::::::::::::},或者用set()创建空集合,set(list)函数可以将列表转换为集合;
2.list()函数
3.列表的sort()函数,降序:reverse= True
4.lambda()函数