tf-idf使用-提取文章关键词-搜索文章
来源:互联网 发布:mac同步的照片在哪里 编辑:程序博客网 时间:2024/05/19 02:39
tf-idf 使用
我们的目标是提取一篇文章中的关键词 or 给出关键词,在语料库中找到这组关键词最相近的文章。 两个目标要解决的问题是差不多的。今天用一种很简单却很有效的方法来解决这个问题, TF-IDF。在本文,我们选取第二种描述,即给出关键词,在语料库中找到与这组关键词最相近的文章。
TF,Term Frequency 词频,表示词语在一篇文章中出现的频数。TF值越大,表示这个词在该篇文章中出现的频数约大。但是如果仅仅根据数量来判断一个词是否为关键词,显然是不够的。例如[1],在文章中“的”,“是”这样的词往往数量很大,但却不是我们想要的关键词,这样的词称为停用词。(Stop words)。为了解决这个问题,于是引入了 IDF。
IDF Inverse Document Frequency。逆文档频率,它表示一个词的区分程度大小。 一个词的 IDF 值越大,表示这个词越重要。 本文就不列举公式了,想看公式的同学请参考引文 阮一峰老师的文章。
本文的主要目标是实现一个demo。
有了TF(数量)和IDF(权重)
我们将二者相乘,就可以比较合理的衡量一个词重要性。TF-IDF
import numpy as npimport math
file_dir = 'input/tf_idf_data.txt' # 数据在文尾给出docid2content = {} # int - listword2id = {} # str-intid2word = {} # int-strword_id = 0
with open(file_dir, 'r') as f: doc_id = 0 for line in f.readlines(): seg = line.strip('\n').split(' ') docid2content[doc_id] = seg doc_id += 1 for word in seg: # 自定义词典 if word not in word2id: word2id[word] = word_id id2word[word_id] = word word_id += 1n_doc = len(docid2content)n_word = len(word2id)print('Document length = %d' % n_doc)print('Unique word number = %d' % n_word)
Document length = 148Unique word number = 20035
# V 词典词数量, M 文档数量# 统计词频 - Term Frequencyword_tf_VM = np.zeros(shape=[n_word, n_doc])for doc_id in range(n_doc): for word in docid2content[doc_id]: word_tf_VM[word2id[word]][doc_id] += (1.0/len(docid2content[doc_id])) # 归一化print('==========> 词频统计')for i in range(5): print(word_tf_VM[i])
==========> 词频统计[ 0.01611279 0. 0. 0. 0. 0.0021978 0. 0. 0. 0. 0.0208605 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.03106212 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.02606882 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.02504174 0. 0. 0.00064226 0.00089127 0. 0. 0. 0. 0. 0.03455497 0. 0. 0. 0. 0. 0. 0. 0. 0.00089767 0.02677888 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.02833333 0. 0. 0. 0. 0. 0. 0. 0. 0.01930215 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.01268657 0.00817439 0.00331126 0. 0. 0. 0.00388601 0. 0. 0. 0.02874133 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.06626506 0.00102564 0. 0. 0. 0.0027248 0. 0. 0. 0. 0.05405405 0. 0. 0. 0. 0. 0. 0. 0. 0.02467232 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.02205882 0. 0. 0. 0. 0. 0. 0. 0. 0.00239808][ 0.00302115 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.00090909 0. 0.00079051 0. 0. 0. 0. 0. 0.00137931 0. 0. 0.00088339 0. 0. 0. 0.00208551 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.00160256 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.00089928 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.00087642 0. 0. 0. 0. 0.00099108 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.00069013 0.00086806 0. 0. 0. 0. 0. 0. 0. 0. 0.00087489 0. 0. 0.00133333 0. 0. 0. 0.00077101 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ][ 0.00050352 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.00130378 0. 0.0010929 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.00208551 0. 0.00106724 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.00064226 0. 0. 0. 0. 0. 0. 0.00104712 0. 0.00160256 0. 0. 0. 0. 0.00097943 0. 0.00089767 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.00112867 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.00145349 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.00269179 0. 0. 0.00100402 0. 0.00069013 0. 0. 0. 0. 0. 0. 0. 0.001001 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ][ 0.00050352 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.00097943 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ][ 0.00050352 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.00089127 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.00208768 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
# 逆文档频率 - Inverse Document frequencyword_idf_V = np.zeros(shape=[n_word])for i in range(n_word): word = id2word[i] for doc_id in range(n_doc): if word in docid2content[doc_id]: word_idf_V[i] += 1for i in range(n_word): word_idf_V[i] = math.log( n_doc / (word_idf_V[i] + 1) )print('==========> 逆文档频率预览')for i in range(5): print(word_idf_V[i])
==========> 逆文档频率预览1.739115735742.224623551522.163998929713.89859998513.61091791264
# 根据 TF-IDF 值, 输入关键词,给出最相近的 top 3 篇文章标号input_word = [2,5,10,34,100]for word_id in input_word: word = id2word[word_id] tf_idf = list() # ele (doc_id, tf_idf) for doc_id in range(n_doc): tf_idf.append((doc_id, word_tf_VM[word_id][doc_id] * word_idf_V[word_id])) sort_tf_idf = sorted(tf_idf, key = lambda x:x[1], reverse=True) print(word,'==>', sort_tf_idf[0], sort_tf_idf[1], sort_tf_idf[2])
你们好 ==> (106, 0.0058250307663738872) (30, 0.0045130321787443146) (52, 0.0034679470027370175)二周目 ==> (0, 0.062589924690941171) (20, 0.0063331879234507513) (101, 0.0056082711158862925)微信 ==> (127, 0.0083196841455246261) (126, 0.0069068013846811339) (109, 0.006832832962890706)弹幕 ==> (23, 0.0017124917032905211) (0, 0.0012503086454034534) (3, 0.0010561444578422166)快乐 ==> (125, 0.009259887219141132) (121, 0.0070725122854857457) (88, 0.0037244328690671309)
Reference & Recommend
- 阮一峰老师的博文
- 测试数据
阅读全文
0 0
- tf-idf使用-提取文章关键词-搜索文章
- 利用TF-IDF 提取文章关键词
- TF-IDF提取文章关键词算法
- TF-IDF提取关键词
- TF-IDF提取关键词
- TF-IDF与余弦相似性文本处理:自动提取关键词、找出相似文章
- 文章提取关键词_jieba(IF-IDF/TextRank)
- TF-IDF 提取文本关键词
- TF-IDF:自动提取关键词
- TF-IDF自动提取关键词
- TF-IDF自动提取关键词
- tf-idf关键词提取算法
- 从提取网页关键词到TF-IDF
- 51、tf-idf值提取关键词
- TF-IDF关键词提取方法的学习
- 使用spark TF-IDF特征计算文章间相似度
- 文章关键词提取算法
- 文章关键词提取算法
- Win10应用商店下载界面打不开提示错误0x80070422的解决方法
- 让你明明白白的使用RecyclerView——SnapHelper详解
- css手写下拉列表的那个三角形
- 省选模拟赛[CQOI2014]
- U3D -- 关于GameObject的GetInstanceID()和GetHashCode()
- tf-idf使用-提取文章关键词-搜索文章
- 信息增益-香农熵
- Ubuntu16.04安装wps并解决系统缺失字体问题
- Two Sum
- Spring Cloud (14) | 微服务不能从git/github/gitlab中获取数据库信息 can't load properties from git/github/gitlab
- Python基础篇之迭代
- JAVA程序性能调优(一)
- Lintcode:字符串查找
- SQL中关于EXISTS谓词的理解