Porting your code to NLTK 3.0

来源:互联网 发布:自学编程入门基础知识 编辑:程序博客网 时间:2024/04/30 02:28

Original link: https://github.com/nltk/nltk/wiki/Porting-your-code-to-NLTK-3.0

NLTK 3.0 contains a number of interface changes. These are being incorporated into a new version of the NLTK book, updated for Python 3 and NLTK 3.

The way NLTK works with unicode is changed: NLTK3 requires all text input to be unicode and always return text as unicode. Previously, some functions and classes worked on unicode and others required encoded bytestrings. Please make sure you're passing unicode to NLTK and expecting unicode output from NLTK - existing code that assumes bytestrings may start to fail.

Here are some changes you may need to make:

  • grammarContextFreeGrammar → CFGWeightedGrammar → PCFG,StatisticalDependencyGrammar → ProbabilisticDependencyGrammar,WeightedProduction → ProbabilisticProduction
  • draw.treeTreeSegmentWidget.node() → TreeSegmentWidget.label(),TreeSegmentWidget.set_node() → TreeSegmentWidget.set_label()
  • parsers: nbest_parse() → parse()
  • ccg.parse.chartEdgeI.next() → EdgeI.nextsym()
  • Chunk parser: top_node → root_labelchunk_node → chunk_label
  • WordNet properties are now access methods, e.g. Synset.definition →Synset.definition()
  • sem.relextractmk_pairs() → _tree2semi_rel()mk_reldicts() →semi_rel2reldict()show_clause() → clause()show_raw_rtuple() → rtuple()
  • corpusname.tagged_words(simplify_tags=True) →corpusname.tagged_words(tagset='universal')
  • util.clean_html() → BeautifulSoup.get_text()clean_html() is now dropped, install & use BeautifulSoup or some other html parser instead.
  • util.ibigrams() → util.bigrams()
  • util.ingrams() → util.ngrams()
  • util.itrigrams() → util.trigrams()
  • metrics.windowdiff → metrics.segmentation.windowdiff(),metrics.windowdiff.demo() was removed.
  • parse.generate2 was re-written and merged into parse.generate

Creating objects from strings:

  • Many objects now support a fromstring() method
  • tree.Tree.parse() → tree.Tree.fromstring()
  • tree.Tree() → tree.Tree.fromstring()
  • chunk.RegexpChunkRule.parse() → chunkRegexpChunkRule.fromstring()
  • grammar.parse_cfg() → CFG.fromstring() (same for other types of grammar)
  • sem.LogicParser.parse() → sem.Expression.fromstring()
  • sem.DrtParser.parse() → sem.DrtExpression.fromstring()
  • sem.parse_valuation() → sem.Valuation.fromstring()
  • sem.parse_type() → sem.Type.fromstring()

Operations on lists of sentences or other items:

  • tokenize.batch_tokenize() → tokenize.tokenize_sents()
  • tag.batch_tag() → tag.tag_sents()
  • parse.batch_parse() → parse.parse_sents()
  • classify.batch_classify() → classify.classify_many()
  • sem.batch_interpret() → sem.interpret_sents()
  • sem.batch_evaluate() → sem.evaluate_sents()
  • chunk.batch_ne_chunk() → chunk.ne_chunk_sents()

Changes in probability.FreqDist:

  • fdist.keys() → sorted(fdist)
  • fdist.inc(x) → fdist[x] += 1
  • fdist.samples() → sorted(fdist.keys())
  • fdist.Nr(r) → fdist.Nr()[r]
  • fdist.Nr_nonzero() → fdist.Nr().items()
  • cfdist.conditions() → sorted(cfdist.conditions())

Porter stemmer changes:

  • adjust_case()cons()cvc()doublec()m()step1ab()step1c()step2(),step3()step4()step5()vowelinstem() made private
  • ends()r()setto() removed

Removed modules, classes and functions:

  • classify.svm was removed. For classification based on support vector machines (SVMs) use classify.scikitlearn or scikit-learn directly. Seehttps://github.com/nltk/nltk/issues/450.
  • probability.GoodTuringProbDist class was removed. Seehttps://github.com/nltk/nltk/issues/381.
  • HiddenMarkovModelTaggerTransformI and its subclasses are removed. Seehttps://github.com/nltk/nltk/issues/374.
  • classify.maxent no longer support algorithms backed by scipy.maxentropy. Seehttps://github.com/nltk/nltk/issues/321.
  • misc.babelfish was removed. See https://github.com/nltk/nltk/issues/265.
  • sourcedstring was removed. See https://github.com/nltk/nltk/issues/322.
  • yamltags was removed. JSON is now preferred instead. Seehttps://github.com/nltk/nltk/issues/540
  • mallet was removed, including the tag.crf module. Seehttps://github.com/nltk/nltk/issues/104
  • tag.simplify was removed. See https://github.com/nltk/nltk/issues/483
  • model was removed. See https://github.com/nltk/nltk/issues?labels=model
  • corpus.reader.wordnet._lcs_by_depth was removed. Seehttps://github.com/nltk/nltk/issues/422.

Miscellaneous changes:

  • probability.ConditionalProbDist.default_factory now inherits from dict instead of defaultdict
  • probability.ConditionalProbDistI.default_factory now inherits from dict instead of defaultdict
  • probability.DictionaryConditionalProbDist.default_factory now inherits from dictinstead of defaultdict

Environment variables for third-party software:

  • These have been normalised; please see Installing Third Party Software

More background on Python 3 and NLTK 3:

  • http://docs.python.org/2/library/2to3.html
  • http://docs.python.org/dev/whatsnew/3.0.html
  • http://nltk.org/dev/python3porting.html
0 0
原创粉丝点击