LDA

来源:互联网 发布:山海经知乎 编辑:程序博客网 时间:2024/05/01 04:35

看了一天的LDA:

LDA是无监督学习的一种,其中用到的理论包括多项式分布、狄利克雷分布。文档中主题的分布,主题中词汇的分布,文档中词汇的分布三部分。

使用的还是anaconda中的包。

刚开始用python,有很多不懂的地方,碰到的一个问题加深了python的理解:在python工程的模块中导入包的问题。我是在windows下使用python 的,使用的命令行。我在包外部python import的时候没问题,而当进入了包内,再进行python import时候就不可以了:同学提了一个观点说是可能windows对于环境变量的读取有一个优先级选择,如果当前目录下有你要导入的文件,不管该文件是否是需要用的文件,那么windows的策略就是直接导入该文件,如果当前目录没有该文件,那么就按照环境变量进行读取。这个确实可以解决我import lda时候出现的错误问题,姑且这么理解。

使用的数据还是路透社的新闻数据,边操作他的数据,边进行源码阅读,碰到的python语法问题单独查资料。

>>> import numpy as np>>> import lda>>> import lda.datasets>>> X = lda.datasets.load_reuters()>>> vocab = lda.datasets.load_reuters_vocab()>>> titles = lda.datasets.load_reuters_titles()>>> X.shape(395, 4258)>>> X.sum()84010>>> model = lda.LDA(n_topics=20, n_iter=1500, random_state=1)>>> model.fit(X)  # model.fit_transform(X) is also available>>> topic_word = model.topic_word_  # model.components_ also works>>> n_top_words = 8>>> for i, topic_dist in enumerate(topic_word):...     topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]...     print('Topic {}: {}'.format(i, ' '.join(topic_words)))<span style="color:#cc0000;">对于前20个类别,每个单词属于每个类别的概率:</span>Topic 0: british churchill sale million major letters west britainTopic 1: church government political country state people party againstTopic 2: elvis king fans presley life concert young deathTopic 3: yeltsin russian russia president kremlin moscow michael operationTopic 4: pope vatican paul john surgery hospital pontiff romeTopic 5: family funeral police miami versace cunanan city serviceTopic 6: simpson former years court president wife south churchTopic 7: order mother successor election nuns church nirmala headTopic 8: charles prince diana royal king queen parker bowlesTopic 9: film french france against bardot paris poster animalTopic 10: germany german war nazi letter christian book jewsTopic 11: east peace prize award timor quebec belo leaderTopic 12: n't life show told very love television fatherTopic 13: years year time last church world people sayTopic 14: mother teresa heart calcutta charity nun hospital missionariesTopic 15: city salonika capital buddhist cultural vietnam byzantine showTopic 16: music tour opera singer israel people film israeliTopic 17: church catholic bernardin cardinal bishop wright death cancerTopic 18: harriman clinton u.s ambassador paris president churchill franceTopic 19: city museum art exhibition century million churches set
>>> doc_topic = model.doc_topic_>>> for i in range(10):...     print("{} (top topic: {})".format(titles[i], doc_topic[i].argmax()))
<span style="color:#cc0000;">将每个titles根据上面提到的20个类别,进行文章分类:</span>0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20 (top topic: 8)1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21 (top topic: 13)2 INDIA: Mother Teresa's condition said still unstable. CALCUTTA 1996-08-23 (top topic: 14)3 UK: Palace warns British weekly over Charles pictures. LONDON 1996-08-25 (top topic: 8)4 INDIA: Mother Teresa, slightly stronger, blesses nuns. CALCUTTA 1996-08-25 (top topic: 14)5 INDIA: Mother Teresa's condition unchanged, thousands pray. CALCUTTA 1996-08-25 (top topic: 14)6 INDIA: Mother Teresa shows signs of strength, blesses nuns. CALCUTTA 1996-08-26 (top topic: 14)7 INDIA: Mother Teresa's condition improves, many pray. CALCUTTA, India 1996-08-25 (top topic: 14)8 INDIA: Mother Teresa improves, nuns pray for "miracle". CALCUTTA 1996-08-26 (top topic: 14)9 UK: Charles under fire over prospect of Queen Camilla. LONDON 1996-08-26 (top topic: 8)



0 0
原创粉丝点击