Gensim and LDA: a quick tour
来源:互联网 发布:淘宝套现可靠的店家 编辑:程序博客网 时间:2024/06/18 13:21
In [1]:
import logginglogging.basicConfig(format='%(levelname)s : %(message)s', level=logging.WARNING)logging.root.level = logging.WARNING
In [2]:
from sklearn import datasetsnews_dataset = datasets.fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
In [3]:
# A list of text document is contained in the data variabledocuments = news_dataset.dataprint "In the dataset there are", len(documents), "textual documents"print "And this is the first one:\n", documents[0]
In [4]:
import gensimfrom gensim.utils import simple_preprocessfrom gensim.parsing.preprocessing import STOPWORDS
In [5]:
def tokenize(text): return [token for token in gensim.utils.simple_preprocess(text) if token not in gensim.parsing.preprocessing.STOPWORDS]print "After the tokenizer, the previous document becomes:\n", tokenize(documents[0])
In [6]:
processed_docs = [tokenize(doc) for doc in documents]word_count_dict = gensim.corpora.Dictionary(processed_docs)print "In the corpus there are", len(word_count_dict), "unique tokens"
In [7]:
word_count_dict.filter_extremes(no_below=20, no_above=0.1) # word must appear >10 times, and no more than 20% documents
In [8]:
print "After filtering, in the corpus there are only", len(word_count_dict), "unique tokens"
In [9]:
bag_of_words_corpus = [word_count_dict.doc2bow(pdoc) for pdoc in processed_docs]
In [10]:
bow_doc1 = bag_of_words_corpus[0]print "Bag of words representation of the first document (tuples are composed by token_id and multiplicity):\n", bow_doc1printfor i in range(5): print "In the document, topic_id {} (word \"{}\") appears {} time[s]".format(bow_doc1[i][0], word_count_dict[bow_doc1[i][0]], bow_doc1[i][1])print "..."
In [11]:
# LDA mono-corelda_model = gensim.models.LdaModel(bag_of_words_corpus, num_topics=10, id2word=word_count_dict, passes=5)# LDA multicore (in this configuration, defaulty, uses n_cores-1)# lda_model = gensim.models.LdaMulticore(bag_of_words_corpus, num_topics=10, id2word=word_count_dict, passes=5)
In [12]:
_ = lda_model.print_topics(-1)
In [13]:
for index, score in sorted(lda_model[bag_of_words_corpus[0]], key=lambda tup: -1*tup[1]): print "Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 10))
In [14]:
news_dataset.target_names[news_dataset.target[0]]
Out[14]:
In [16]:
unseen_document = "In my spare time I either play badmington or drive my car"print "The unseen document is composed by the following text:", unseen_documentprintbow_vector = word_count_dict.doc2bow(tokenize(unseen_document))for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]): print "Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5))
In [17]:
print "Log perplexity of the model is", lda_model.log_perplexity(bag_of_words_corpus)
0 0
- Gensim and LDA: a quick tour
- Chapter 1 A Quick Tour
- gensim-lda
- a quick tour of many tools
- A quick tour of Torch internals
- gensim中实践LDA
- 用gensim做LDA
- Gensim-TFIDF,LDA,LSI实战
- Gensim LDA主题模型实验
- quick-tour spring+rabbitmq
- A Quick MFC and WTL Comparison
- A Quick and Easy Guide to tmux
- 主题模型TopicModel:通过gensim实现LDA
- Gensim做中文主题模型(LDA)
- LDA小结及在gensim中的应用
- 主题模型TopicModel:通过gensim实现LDA
- pyLDAvis基于gensim的LDA模型可视化
- LDA的使用记录--gensim库
- Java 中equal() 方法与==的区别以及 equals()与 hashCode()方法重写
- 图表误差线
- Java经典设计模式(3):十一种行为型模式(附实例和详解)
- [LeetCode]House Robber III
- XML
- Gensim and LDA: a quick tour
- 文章标题
- Eclipse安装SVN插件
- Ehcache和Memcached比较分析
- tortoiseHG不用每次输入密码
- Tomcat下部署Jenkins无法打开(404)的解决办法
- Calling C and C++ from IDL (三) ——数组传递
- php 类型
- 由最小生成树算法改到最短路径算法代码----为了区分两者的区别