NLTK使用总结

来源:互联网 发布:学mysql还是sql server 编辑:程序博客网 时间:2024/06/05 20:30
  1. nltk.tokenize.punkt()
    这个class能将text拆分成句子,但是会保留标点符号,比如括号之类的
import nltk.datatext = '''... Punkt knows that the periods in Mr. Smith and Johann S. Bach... do not mark sentence boundaries.  And sometimes sentences... can start with non-capitalized words.  i is a good variable... name.... '''sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')print('\n-----\n'.join(sent_detector.tokenize(text.strip())))'''...Punkt knows that the periods in Mr. Smith and Johann S. Bachdo not mark sentence boundaries.-----'''