Python NLTK 学习笔记1

来源:互联网 发布:广州小孩学编程 编辑:程序博客网 时间:2024/05/20 04:51

先用NLTK做一些简单的尝试。
定义一个字符串

>>> text = 'why can\'t I print anything? I am confused. :-('>>> text"why can't I print anything? I am confused. :-("

注意字符串里的“’”要用“\”做转换。
将一段话拆分成句子列表。

>>> sens = nltk.sent_tokenize(text)>>> sens["why can't I print anything?", 'I am confused.', ':-(']

分词。它把can’t当做两个单词了,郁闷。。。

>>> for sent in sens:...     words.append(nltk.word_tokenize(sent))...>>>>>> words[['why', 'ca', "n't", 'I', 'print', 'anything', '?'], ['I', 'am', 'confused', '.'], [':', '-', '(']]

词性标注

>>> tags[[('why', 'WRB'), ('ca', 'MD'), ("n't", 'RB'), ('I', 'PRP'), ('print', 'VB'), ('anything', 'NN'), ('?', '.')], [('I', 'PRP'), ('am', 'VBP'), ('confused', 'VBN'), ('.', '.')], [(':', ':'), ('-', ':'), ('(', '(')]]

“ca”和“n’t”居然还能有词性。服了。
提取实体

>>> text = "Xi is the chairman of China in the year 2013.">>> tokens = nltk.word_tokenize(text)>>> tags = nltk.pos_tag(tokens)>>> tags[('Xi', 'NN'), ('is', 'VBZ'), ('the', 'DT'), ('chairman', 'NN'), ('of', 'IN'), ('China', 'NNP'), ('in', 'IN'), ('the', 'DT'), ('year', 'NN'), ('2013', 'CD'), ('.', '.')]>>> ners = nltk.ne_chunk(tags)>>> ners.draw()

注意:ne_chunk只能作用于一个字典。不能输入[ [],[],[] ]这种结构的东西。
下面做一个生成器。

>>> ners = nltk.ne_chunk_sents(tags)>>> ners<generator object <genexpr> at 0x000000000C4D15E8>>>> for chunk in ners:...     print (chunk)...(S why/WRB ca/MD n't/RB I/PRP print/VB anything/NN ?/.)(S I/PRP am/VBP confused/VBN ./.)(S :/: -/: (/()

附,NLTK词性:

  1. CC Coordinating conjunction 连接词
  2. CD Cardinal number 基数词
  3. DT Determiner 限定词(如this,that,these,those,such,不定限定词:no,some,any,each,every,enough,either,neither,all,both,half,several,many,much,(a) few,(a) little,other,another.
  4. EX Existential there 存在句
  5. FW Foreign word 外来词
  6. IN Preposition or subordinating conjunction 介词或从属连词
  7. JJ Adjective 形容词或序数词
  8. JJR Adjective, comparative 形容词比较级
  9. JJS Adjective, superlative 形容词最高级
  10. LS List item marker 列表标示
  11. MD Modal 情态助动词
  12. NN Noun, singular or mass 常用名词 单数形式
  13. NNS Noun, plural 常用名词 复数形式
  14. NNP Proper noun, singular 专有名词,单数形式
  15. NNPS Proper noun, plural 专有名词,复数形式
  16. PDT Predeterminer 前位限定词
  17. POS Possessive ending 所有格结束词
  18. PRP Personal pronoun 人称代词
  19. PRP$ Possessive pronoun 所有格代名词
  20. RB Adverb 副词
  21. RBR Adverb, comparative 副词比较级
  22. RBS Adverb, superlative 副词最高级
  23. RP Particle 小品词
  24. SYM Symbol 符号
  25. TO to 作为介词或不定式格式
  26. UH Interjection 感叹词
  27. VB Verb, base form 动词基本形式
  28. VBD Verb, past tense 动词过去式
  29. VBG Verb, gerund or present participle 动名词和现在分词
  30. VBN Verb, past participle 过去分词
  31. VBP Verb, non-3rd person singular present 动词非第三人称单数
  32. VBZ Verb, 3rd person singular present 动词第三人称单数
  33. WDT Wh-determiner 限定词(如关系限定词:whose,which.疑问限定词:what,which,whose.)
  34. WP Wh-pronoun 代词(who whose which)
  35. WP$ Possessive wh-pronoun 所有格代词

  36. WRB Wh-adverb 疑问代词(how where when)

0 0
原创粉丝点击