Gensim-Similarity Queries

来源:互联网 发布:九天封神翅膀进阶数据 编辑:程序博客网 时间:2024/06/06 01:14

介绍

下面一个例子说明如何在gensim中做到这一点。

方法来自Indexing by Latent Semantic Analysis文章,例子来自gensim官网。

代码

from gensim import corpora, models, similaritiesdef GenDictandCorpus():    documents = ["Human machine interface for lab abc computer applications",                 "A survey of user opinion of computer system response time",                 "The EPS user interface management system",                 "System and human system engineering testing of EPS",                 "Relation of user perceived response time to error measurement",                 "The generation of random binary unordered trees",                 "The intersection graph of paths in trees",                 "Graph minors IV Widths of trees and well quasi ordering",                 "Graph minors A survey"]    texts = [[word for word in document.lower().split()] for document in documents]    # 词典    dictionary = corpora.Dictionary(texts)    # 词库,以(词,词频)方式存贮    corpus = [dictionary.doc2bow(text) for text in texts]    # print(dictionary)    # print(corpus)    return dictionary, corpusdef SimQuery(doc):    dictionary, corpus = GenDictandCorpus()    lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)    vec_bow = dictionary.doc2bow(doc.lower().split())    vec_lsi = lsi[vec_bow]  # convert the query to LSI space    # 为了准备相似性查询,我们需要输入我们要与后续查询进行比较的所有文档。    #  本例中,它们是用于训练LSI的9个文档,转换为2-D LSA空间。      # transform corpus to LSI space and index it    index_corpus  = similarities.MatrixSimilarity(lsi[corpus])    # 存贮和载入    # index.save('/tmp/deerwester.index')    # index = similarities.MatrixSimilarity.load('/tmp/deerwester.index')    # 对语料库执行相似性查询    sims = index_corpus[vec_lsi]    # print(list(enumerate(sims)))    # 相似性排序为降序    sims_sorted = sorted(enumerate(sims), key = lambda item: -item[1])    print(sims_sorted)SimQuery("Human computer interaction")结果:[(0, 0.9768815), (2, 0.96618712), (4, 0.93288612), (3, 0.89150834), (1, 0.87645805), (5, 0.032106727), (8, -0.002741307), (6, -0.07901895), (7, -0.2151109)]Process finished with exit code 0

参考:http://radimrehurek.com/gensim/tut3.html

原创粉丝点击