Computing with Language:Simple Statistics

来源：互联网发布：史密斯热水器贵知乎编辑：程序博客网时间：2024/05/15 23:48

Frequency Distributions

//定义变量fdist1 = FreqDist(text1)//输出fdist1//重复最多的50个fdist1.most_common(50)//whale重复次数fdist1['whale']//累积频率图fdist1.plot(50,cumulative=True)//单频词fdist1.hapaxes()

//定义V，V是一个链表，而不是一个集合V = set(text1)//在V中长度大于15的词long_words = [w for w in V if len(w) > 15]//排序sorted(long_words)

Python这里很类似于数学的表达方式，和正在用的java相比，更偏数学语言。

//词长>7，且词频>7的词（与文本内容相关的高频词）fdist5 = FreqDist(text5)sorted(w for w in set(text5) if len(w) > 7 and fdist5[w] > 7)

Collocations and Bigrams

双联词

bigrams(['more','is','said','than','done'])

直接执行上述代码会报错

Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
bigrams(['more','is','said','than','done'])
NameError: name 'nltk' is not defined

需要import nltk

from nltk import *

之后执行，并未显示出来，而是以下语句，需要加上list函数执行。

list(bigrams(['more','is','said','than','done']))

collocation函数为我们找到一个text中的双联词

text4.collocations()

Counting other things

//词长的频率fdist = FreqDist([len(w) for w in text1])fdist.keys()//freqdist后的结果fdist.items()fdist.max()fdist[3]fdist.freq(3)

NLTK频率分布类中定义的函数

例子描述fdist = FreqDist(samples)创建包含给定样本的频率分布fdist.inc(sample)增加样本fdist['monstrous']计数给定样本出现的次数fdist.freq('monstrous')给定样本的频率fdist.N()样本总数fdist.keys()以频率递减顺序排序的样本链表for sample in fdist :以频率递减的顺序遍历样本fidst.max()数值最大的样本fdist.tabulate()绘制频率分布表fdist.plot()绘制频率分布图fdist.plot(cumulative=True)绘制累积频率分布图fdist1 < fdist2测试样本在fdist1中出现的频率是否小于fdist2

阅读全文

0 0