NLTK学习笔记

来源：互联网发布：福禄克网络中国编辑：程序博客网时间：2024/05/22 10:56

学习参考书： http://nltk.googlecode.com/svn/trunk/doc/book/

1. 使用代理下载数据

nltk.set_proxy("**.com:80")

nltk.download()

2. 使用sents(fileid)函数时候出现：Resource 'tokenizers/punkt/english.pickle' not found. Please use the NLTK Downloader to obtain the resource:

import nltk

nltk.download()

安装窗口中选择'Models'项，然后'在 'Identifier' 列找 'punkt，点击下载安装该数据包

3. 语料Corpus元素获取函数

from nltk.corpus import webtext

webtext.fileids() #得到语料中所有文件的id集合

webtext.raw(fileid) #给定文件的所有字符集合

webtext.words(fileid) #所有单词集合

webtext.sents(fileid) #所有句子集合

ExampleDescriptionfileids()the files of the corpusfileids([categories])the files of the corpus corresponding to these categoriescategories()the categories of the corpuscategories([fileids])the categories of the corpus corresponding to these filesraw()the raw content of the corpusraw(fileids=[f1,f2,f3])the raw content of the specified filesraw(categories=[c1,c2])the raw content of the specified categorieswords()the words of the whole corpuswords(fileids=[f1,f2,f3])the words of the specified fileidswords(categories=[c1,c2])the words of the specified categoriessents()the sentences of the whole corpussents(fileids=[f1,f2,f3])the sentences of the specified fileidssents(categories=[c1,c2])the sentences of the specified categoriesabspath(fileid)the location of the given file on diskencoding(fileid)the encoding of the file (if known)open(fileid)open a stream for reading the given corpus fileroot()the path to the root of locally installed corpusreadme()the contents of the README file of the corpus

4.文本处理的一些常用函数

假若text是单词集合的列表

len(text) #单词个数

set(text) #去重

sorted(text) #排序

text.count('a') #数给定的单词的个数

text.index('a') #给定单词首次出现的位置

FreqDist(text) #单词及频率，keys()为单词，*[key]得到值

FreqDist(text).plot(50,cumulative=True) #画累积图

bigrams(text) #所有的相邻二元组

text.collocations() #找文本中频繁相邻二元组

text.concordance("word") #找给定单词出现的位置及上下文

text.similar("word") #找和给定单词语境相似的所有单词

text.common_context("a“,"b") #找两个单词相似的上下文语境

text.dispersion_plot(['a','b','c',...]) #单词在文本中的位置分布比较图

text.generate() #随机产生一段文本

NLTK's Conditional Frequency Distributions: commonly-used methods and idioms for defining,accessing, and visualizing a conditional frequency distribution.of counters.

ExampleDescriptioncfdist = ConditionalFreqDist(pairs)create a conditional frequency distribution from a list of pairscfdist.conditions()alphabetically sorted list of conditionscfdist[condition]the frequency distribution for this conditioncfdist[condition][sample]frequency for the given sample for this conditioncfdist.tabulate()tabulate the conditional frequency distributioncfdist.tabulate(samples, conditions)tabulation limited to the specified samples and conditionscfdist.plot()graphical plot of the conditional frequency distributioncfdist.plot(samples, conditions)graphical plot limited to the specified samples and conditionscfdist1 < cfdist2test if samples in cfdist1 occur less frequently than incfdist2
to be continued

阅读全文

0 0