NLTK自带的词干提取器

来源:互联网 发布:乐视线刷软件 编辑:程序博客网 时间:2024/06/08 03:36

代码来自《Python自然语言处理》P116

(python2.7) appleyuchi@ubuntu:~/.virtualenvs/python2.7/bin$ pythonPython 2.7.12 (default, Nov 19 2016, 06:48:10) [GCC 5.4.0 20160609] on linux2Type "help", "copyright", "credits" or "license" for more information.>>> raw="""DENNIS:Listen,strange women lying in ponds distributing swords is... is no basis for a system of goverment. Supreme executive power derives from... a mandate from the masses, not from some farcical aquatic ceremony.""">>> import nltk>>> tokens=nltk.word_tokenize(raw)>>> porter = nltk.PorterStemmer()>>> lancaster=nltk.LancasterStemmer()>>> [porter.stem(t) for t  in tokens][u'denni', ':', 'listen', ',', u'strang', 'women', u'lie', 'in', u'pond', u'distribut', u'sword', 'is', '...', 'is', 'no', u'basi', 'for', 'a', 'system', 'of', u'gover', '.', u'suprem', u'execut', 'power', u'deriv', 'from', '...', 'a', u'mandat', 'from', 'the', u'mass', ',', 'not', 'from', 'some', u'farcic', u'aquat', u'ceremoni', '.']>>> [lancaster.stem(t) for t in tokens]['den', ':', 'list', ',', 'strange', 'wom', 'lying', 'in', 'pond', 'distribut', 'sword', 'is', '...', 'is', 'no', 'bas', 'for', 'a', 'system', 'of', 'gov', '.', 'suprem', 'execut', 'pow', 'der', 'from', '...', 'a', 'mand', 'from', 'the', 'mass', ',', 'not', 'from', 'som', 'farc', 'aqu', 'ceremony', '.']

上述代码中,raw是原始余料,最后几行是词干提取结果。

以上代码总共使用了两种词干提取器,分别是Porter和Lancaster

原创粉丝点击