使用NLTK计算word的相似度
来源:互联网 发布:黑道圣徒4但丁数据 编辑:程序博客网 时间:2024/06/05 04:00
5 Similarity
>>> dog = wn.synset('dog.n.01')>>> cat = wn.synset('cat.n.01')
synset1.path_similarity(synset2): Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1, except in those cases where a path cannot be found (will only be true for verbs as there are many distinct verb taxonomies), in which case -1 is returned. A score of 1 represents identity i.e. comparing a sense with itself will return 1.
>>> dog.path_similarity(cat)0.20000000000000001
synset1.lch_similarity(synset2): Leacock-Chodorow Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur. The relationship is given as -log(p/2d) where p is the shortest path length and d the taxonomy depth.
>>> dog.lch_similarity(cat)2.0281482472922856
synset1.wup_similarity(synset2): Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node). Note that at this time the scores given do _not_ always agree with those given by Pedersen's Perl implementation of Wordnet Similarity.
The LCS does not necessarily feature in the shortest path connecting the two senses, as it is by definition the common ancestor deepest in the taxonomy, not closest to the two senses. Typically, however, it will so feature. Where multiple candidates for the LCS exist, that whose shortest path to the root node is the longest will be selected. Where the LCS has multiple paths to the root, the longer path is used for the purposes of the calculation.
>>> dog.wup_similarity(cat)0.8571428571428571
wordnet_ic Information Content: Load an information content file from the wordnet_ic corpus.
>>> from nltk.corpus import wordnet_ic>>> brown_ic = wordnet_ic.ic('ic-brown.dat')>>> semcor_ic = wordnet_ic.ic('ic-semcor.dat')
Or you can create an information content dictionary from a corpus (or anything that has a words() method).
>>> from nltk.corpus import genesis>>> genesis_ic = wn.ic(genesis, False, 0.0)
synset1.res_similarity(synset2, ic): Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node). Note that for any similarity measure that uses information content, the result is dependent on the corpus used to generate the information content and the specifics of how the information content was created.
>>> dog.res_similarity(cat, brown_ic)7.9116665090365768>>> dog.res_similarity(cat, genesis_ic)7.1388833044805002
synset1.jcn_similarity(synset2, ic): Jiang-Conrath Similarity Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs)).
>>> dog.jcn_similarity(cat, brown_ic)0.44977552855167391>>> dog.jcn_similarity(cat, genesis_ic)0.28539390848096979
synset1.lin_similarity(synset2, ic): Lin Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2)).
>>> dog.lin_similarity(cat, semcor_ic)0.88632886280862277
- 使用NLTK计算word的相似度
- 词语相似度计算:1、安装NLTK和下载WordNet语料库;WordNet的使用
- 词语相似度计算:2、使用NLTK和WordNet计算词语相似度
- 使用gensim计算文档的相似度
- Python 使用nltk计算词的搭配
- 相似度的计算
- 基于word2vec与Word Mover Distance的文档相似度计算
- 计算字符串的相似度
- 计算字符串的相似度
- 计算字符串的相似度
- 计算字符串的相似度
- 计算字符串的相似度
- 计算字符串的相似度
- 计算字符串的相似度
- 计算字符串的相似度
- 计算字符串的相似度
- 计算字符串的相似度
- 计算字符串的相似度
- Ubuntu安装dos2unix工具
- hdu2036
- 【学习笔记】用python实现bubblesort以及shakersort
- nohup 输出文件重定向
- 388. Talk of the devil and he will appear. 说曹操,曹操就到
- 使用NLTK计算word的相似度
- 五险一金
- UML用例图用法详解
- 08 C# 第七章 接口
- 解决关机速度慢
- 如何实现Tomcat连接池数据库密码加密
- LeetCode6:ZigZag Conversion
- 对于java servlet的理解
- call与execute区别