主题模型lda使用
来源:互联网 发布:联想网络同传怎么用 编辑:程序博客网 时间:2024/06/07 08:24
import pymysqlfrom sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizerfrom sklearn.decomposition import LatentDirichletAllocationimport jiebaimport ldaimport numpy as npdef mysql_run_sql(sql): db = pymysql.connect( host='***.18.***.51', port=3306, user='root', password='****', database='****', charset='utf8', ) cursor = db.cursor() cursor.execute(sql) data = cursor.fetchall() # # 关闭数据库连接 db.close() return data#导入停用词库stopwords = open('D:\****\stop_words.txt', 'r', encoding='utf-8').readlines()stops = [stopword.strip() for stopword in stopwords]content = mysql_run_sql("SELECT c_content FROM math_compute.news_result_02")#atl_list = json_to_list(content) def text_to_words(text): words = jieba.lcut(str(text).strip()) meaningful_words = [w for w in words if not w in stops] return ' '.join(meaningful_words)clean_content = [text_to_words(raw_review) for raw_review in content[1]] #从文本中提取1000个最重要的特征关键词#vectorizer = CountVectorizer(analyzer="word", tokenizer=None, \#preprocessor=None, stop_words=None, token_pattern=r"(?u)\b\w+\b", ngram_range=(1,1), max_features=None)#tf = vectorizer.fit_transform(clean_content)n_features = 1000tf_vectorizer = TfidfVectorizer(strip_accents = 'unicode',max_features = n_features,stop_words = 'english')tf = tf_vectorizer.fit_transform(clean_content)#定义主题数量n_topics = 1lda = LatentDirichletAllocation(n_topics = n_topics,max_iter = 50,learning_method = 'online',learning_offset = 50,random_state =0)lda.fit(tf)#n_topics = 5#model = lda.LDA(n_topics = n_topics,n_iter = 500,random_state = 1)#model.fit(tf)'''#主题-单词分布topic_word = model.topic_word_print("type(topic_word): {}".format(type(topic_word))) print("shape: {}".format(topic_word.shape)) print(clean_content[:3]) print(topic_word[:, :3]) for n in range(5): sum_pr = sum(topic_word[n,:]) print("topic: {} sum: {}".format(n, sum_pr)) #计算各主题Top-N个单词n = 5 for i ,topic_dist in enumerate(topic_word): topic_words = np.array(tf)[np.argsort(topic_dist)][:-(n+1):-1] print('*Topic {}\n- {}'.format(i, ' '.join(topic_words))) '''###将每个主题里面的前若干个关键词显def print_top_words(model, feature_names, n_top_words): for topic_idx, topic in enumerate(model.components_): print("Topic #%d:" % topic_idx) print(" ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])) print()n_top_words = 20tf_feature_names = tf_vectorizer.get_feature_names()print_top_words(lda,tf_feature_names,n_top_words)
阅读全文
0 0
- 主题模型lda使用
- 【LDA】LDA主题模型
- LDA主题模型简介
- LDA主题模型简介
- LDA主题模型简介
- 主题模型LDA研究
- LDA主题模型简介
- 主题模型-LDA浅析
- 主题模型-LDA浅析
- 主题模型-LDA浅析
- 主题模型-LDA浅析
- 主题模型-LDA理解
- 主题模型-LDA浅析
- LDA主题模型
- 主题模型-LDA
- 主题模型-LDA浅析
- 主题模型-LDA小结
- 主题模型-LDA浅析
- java详解 --- 构造方法和static关键字
- 使用you-get进行视频爬取
- Maven 浅谈
- pat 乙级 1008. 数组元素循环右移问题 (20)
- 最基本的Java集合框架---List
- 主题模型lda使用
- 解决FTPClient不能访问文件的问题150 Opening ASCII mode data connection
- jarsigner简单使用说明
- Base64原理解析
- struts.xml文件详解
- LeetCode 187. Repeated DNA Sequences
- springmvc 接收前端数据 map。数组
- servlet的过滤器
- 在windows 使用 map的经历1