NLP之路-查看获取文本语料库

来源：互联网发布：三维试衣软件哪个好编辑：程序博客网时间：2024/05/22 04:39

继续学习NLP in Python

#coding=UTF-8#上面一句解决中文注释编码错误问题import nltk#查看获取到的文本语料库nltk.corpus.gutenberg.fileids()#给书名附一个简短的名字emmaemma=nltk.corpus.gutenberg.words('austen-emma.txt')#192427len(emma)#同样利用前一章中的concordancefrom nltk.corpus import gutenbergemma = nltk.Text(gutenberg.words('austen-emma.txt'))#如果不import，语句需要写全：#emma=nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))emma.concordance("surprize")#每个文本的三个统计量：平均词长、平均句子长度和本文中每个词出现的平均次数for fileid in gutenberg.fileids():num_chars = len(gutenberg.raw(fileid))num_words = len(gutenberg.words(fileid))num_sents = len(gutenberg.sents(fileid))num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)]))print int(num_chars/num_words), int(num_words/num_sents), int(num_words/num_vocab), fileid

0 0