在Python中使用LDA处理文本

来源:互联网 发布:人间正道是沧桑 知乎 编辑:程序博客网 时间:2024/05/16 17:19

[翻译] 在Python中使用LDA处理文本

发表于2个月前(2016-02-17 16:10)   阅读(78) | 评论(0) 1人收藏此文章, 我要收藏

目录[-]

  • 安装
  • 示例
  • 说明:

    原文:http://chrisstrelioff.ws/sandbox/2014/11/13/getting_started_with_latent_dirichlet_allocation_in_python.html

    本文包含了上文的主要内容。

    关于LDA:LDA漫游指南

    使用的python库lda来自:https://github.com/ariddell/lda 。

    gensim库也含有lda相关函数。

    安装

    $ pip install lda --user

    示例

    from __future__ import division, print_functionimport numpy as npimport ldaimport lda.datasets# document-term matrixX = lda.datasets.load_reuters()print("type(X): {}".format(type(X)))print("shape: {}\n".format(X.shape))print(X[:5, :5])'''输出:type(X): <type 'numpy.ndarray'>shape: (395L, 4258L)[[ 1  0  1  0  0] [ 7  0  2  0  0] [ 0  0  0  1 10] [ 6  0  1  0  0] [ 0  0  0  2 14]]'''

    X为395*4298的矩阵,意味着395个文本,共4258个单词。值代表出现次数。

    看一下是哪些单词:

    # the vocabvocab = lda.datasets.load_reuters_vocab()print("type(vocab): {}".format(type(vocab)))print("len(vocab): {}\n".format(len(vocab)))print(vocab[:6])'''输出type(vocab): <type 'tuple'>len(vocab): 4258('church', 'pope', 'years', 'people', 'mother', 'last')'''

    X中第0列对应的单词是church,第1列对应的单词是pope

    下面看一下文章标题:

    # titles for each storytitles = lda.datasets.load_reuters_titles()print("type(titles): {}".format(type(titles)))print("len(titles): {}\n".format(len(titles)))print(titles[:2])  # 前两篇文章的标题'''输出type(titles): <type 'tuple'>len(titles): 395('0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20', '1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21')'''

    训练数据,指定20个主题,500次迭代:

    model = lda.LDA(n_topics=20, n_iter=500, random_state=1)model.fit(X)

    主题-单词(topic-word)分布:

    topic_word = model.topic_word_print("type(topic_word): {}".format(type(topic_word)))print("shape: {}".format(topic_word.shape))'''输出:type(topic_word): <type 'numpy.ndarray'>shape: (20L, 4258L)'''

    topic_word中一行对应一个topic,一行之和为1。 看一看'church', 'pope', 'years'这三个单词在各个主题中的比重:

    print(topic_word[:, :3])'''输出[[  2.72436509e-06   2.72436509e-06   2.72708945e-03] [  2.29518860e-02   1.08771556e-06   7.83263973e-03] [  3.97404221e-03   4.96135108e-06   2.98177200e-03] [  3.27374625e-03   2.72585033e-06   2.72585033e-06] [  8.26262882e-03   8.56893407e-02   1.61980569e-06] [  1.30107788e-02   2.95632328e-06   2.95632328e-06] [  2.80145003e-06   2.80145003e-06   2.80145003e-06] [  2.42858077e-02   4.66944966e-06   4.66944966e-06] [  6.84655429e-03   1.90129250e-06   6.84655429e-03] [  3.48361655e-06   3.48361655e-06   3.48361655e-06] [  2.98781661e-03   3.31611166e-06   3.31611166e-06] [  4.27062069e-06   4.27062069e-06   4.27062069e-06] [  1.50994982e-02   1.64107142e-06   1.64107142e-06] [  7.73480150e-07   7.73480150e-07   1.70946848e-02] [  2.82280146e-06   2.82280146e-06   2.82280146e-06] [  5.15309856e-06   5.15309856e-06   4.64294180e-03] [  3.41695768e-06   3.41695768e-06   3.41695768e-06] [  3.90980357e-02   1.70316633e-03   4.42279319e-03] [  2.39373034e-06   2.39373034e-06   2.39373034e-06] [  3.32493234e-06   3.32493234e-06   3.32493234e-06]]'''

    获取每个topic下权重最高的5个单词:

    n = 5for i, topic_dist in enumerate(topic_word):    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n+1):-1]    print('*Topic {}\n- {}'.format(i, ' '.join(topic_words)))'''输出:*Topic 0- government british minister west group*Topic 1- church first during people political*Topic 2- elvis king wright fans presley*Topic 3- yeltsin russian russia president kremlin*Topic 4- pope vatican paul surgery pontiff*Topic 5- family police miami versace cunanan*Topic 6- south simpson born york white*Topic 7- order church mother successor since*Topic 8- charles prince diana royal queen*Topic 9- film france french against actor*Topic 10- germany german war nazi christian*Topic 11- east prize peace timor quebec*Topic 12- n't told life people church*Topic 13- years world time year last*Topic 14- mother teresa heart charity calcutta*Topic 15- city salonika exhibition buddhist byzantine*Topic 16- music first people tour including*Topic 17- church catholic bernardin cardinal bishop*Topic 18- harriman clinton u.s churchill paris*Topic 19- century art million museum city'''

    文档-主题(Document-Topic)分布:

    doc_topic = model.doc_topic_print("type(doc_topic): {}".format(type(doc_topic)))print("shape: {}".format(doc_topic.shape))'''输出:type(doc_topic): <type 'numpy.ndarray'>shape: (395, 20)'''

    一篇文章对应一行,每行的和为1。

    输入前10篇文章最可能的Topic:

    for n in range(10):    topic_most_pr = doc_topic[n].argmax()    print("doc: {} topic: {}".format(n, topic_most_pr))'''输出:doc: 0 topic: 8doc: 1 topic: 1doc: 2 topic: 14doc: 3 topic: 8doc: 4 topic: 14doc: 5 topic: 14doc: 6 topic: 14doc: 7 topic: 14doc: 8 topic: 14doc: 9 topic: 8'''
    1 0
    原创粉丝点击