python自然语言处理学习笔记3——词频统计

来源：互联网发布：公众号大数据编辑：程序博客网时间：2024/05/21 10:13

频率分布
数数文中词条的出现频率

《Python自然语言处理》是酱紫写的

FreqDist()#词频

方法
这里写图片描述

>>> fdist1 = FreqDist(text1)>>> fdist1FreqDist({',': 18713, 'the': 13721, '.': 6862, 'of': 6536, 'and': 6024, 'a': 4569, 'to': 4542, ';': 4072, 'in': 3916, 'that': 2982, ...})>>> vocabulary1 = fdist1.keys()>>> vocabulary1[:50]

但是会出现问题：
这里写图片描述

查询文档后更正为：

>>> vocabulary1 = list(fdist1.keys())>>> vocabulary1[:100]['head', 'supposition', 'Rig', 'commence', 'inspection', 'swim', 'mansion', 'strained', 'bowsman', 'strangers', 'investigators', 'OCTAVO', 'bare', 'observest', 'adorned', 'maintains', 'Gone', 'monstrous', 'unread', 'bedsteads', 'wriggles', 'rears', 'compacted', 'thump', 'LASHINGS', 'Prodigies', 'useful', 'dubiously', 'ticklish', 'flour', 'yes', 'mackerel', 'rate', 'knit', 'occasions', 'imperative', 'abating', 'neutral', 'reading', 'stalk', 'prosecution', 'complimentary', 'hearse', 'Canada', 'unobstructed', 'Capting', 'impatience', 'layers', 'CHORUS', 'Scripture', 'caudam', 'ineffably', 'RESPECTABLE', 'naturae', 'clue', 'NANTUCKET', 'pike', 'steps', 'without', 'students', 'tore', 'hides', 'slave', 'oaths', 'incognita', 'darts', 'unmistakable', '"', 'stronger', 'Imprimis', 'aromatic', 'mists', 'piers', 'everlasting', 'Sway', 'temporarily', 'shirts', 'chivalric', 'unwillingness', 'Coffins', 'merchants', 'mallet', 'rounding', 'soliloquizer', 'suicide', 'smack', 'ruling', 'inexpressible', 'Fates', 'etherial', 'giant', 'obstructed', 'wharf', 'fuel', 'grounded', 'graceful', 'Lowering', 'correspondence', 'resent', 'pagans']

统计某一特定词的出现频率

>>> fdist1['giant']2>>> fdist1['reading']8>>> fdist1['whale']906>>> fdist1['head']335

词汇累积频率图

>>> fdist1.plot(50,cumulative=True)

这里写图片描述

细粒度的选择词（就是加条件的词语链表）

集合的表示

数学：{w | w ∈ V ∩ w∈p(w)}
python: [w for w in V if p(w)]

python产生的是一个链表，酱紫元素没有唯一性

>>> V = set(text1)#获得词汇表>>> long_words = [w for w in V if len(w)>15]>>> sorted(long_words)#排序['CIRCUMNAVIGATION', 'Physiognomically', 'apprehensiveness', 'cannibalistically', 'characteristically', 'circumnavigating', 'circumnavigation', 'circumnavigations', 'comprehensiveness', 'hermaphroditical', 'indiscriminately', 'indispensableness', 'irresistibleness', 'physiognomically', 'preternaturalness', 'responsibilities', 'simultaneousness', 'subterraneousness', 'supernaturalness', 'superstitiousness', 'uncomfortableness', 'uncompromisedness', 'undiscriminating', 'uninterpenetratingly']

词汇搭配和双连词

书中定义的搭配“不经常出现在一起的词序列”eg. red wine是个搭配，而 the wine不是；此外，搭配中的词不能被同类词语替换 eg.gery wine 很奇怪

#bigrams()#获取双连词，但是新版本的不能这样用了list(bigrams())#获取特定词的双连词的用法（新版用法！！）

>>> bigrams(['more','is','said'])Traceback (most recent call last):  File "<stdin>", line 1, in <module>NameError: name 'bigrams' is not defined>>> from nltk import *>>> bigrams(['more','is','said'])<generator object bigrams at 0x105302fc0>>>> list(bigrams(['more','is','said']))[('more', 'is'), ('is', 'said')]

>>> text1.collocations()#获取更频繁的双连词Sperm Whale; Moby Dick; White Whale; old man; Captain Ahab; spermwhale; Right Whale; Captain Peleg; New Bedford; Cape Horn; cried Ahab;years ago; lower jaw; never mind; Father Mapple; cried Stubb; chiefmate; white whale; ivory leg; one hand

阅读全文

0 0