前提知识:
阮一峰:TF-IDF与余弦相似性的应用(一):自动提取关键词
TF-IDF与余弦相似性的应用(二):找出相似文章
本文章根据 在路上吗 翻译官方教程,使用tfidf计算文本相似度
翻译教程地址:http://blog.csdn.net/questionfish/article/category/5610303
首先安装gensim,具体可百度。导入gensim,并设置日志
- from gensim import corpora, models, similarities
- import logging
- from collections import defaultdict
- logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
准备数据:现在有9篇文档,将9篇文档放入到list中- #文档
- documents = ["Human machine interface for lab abc computer applications",
- "A survey of user opinion of computer system response time",
- "The EPS user interface management system",
- "System and human system engineering testing of EPS",
- "Relation of user perceived response time to error measurement",
- "The generation of random binary unordered trees",
- "The intersection graph of paths in trees",
- "Graph minors IV Widths of trees and well quasi ordering",
- "Graph minors A survey"]
1、对文档进行分词,此处文档是英文,所以可以直接根据空格分割,如果是中文,可使用一些中文分词工具对文档进行分词。(1)首先设置停用词,此处只是为了测试,简单的设置一下,将for a of the and to in设为停用词
(2)遍历文档,对每个文档分词,并对分词过滤,如果单词属于停用词,则舍弃
-
- stoplist=set('for a of the and to in'.split())
- texts=[[word for word in document.lower().split() if word not in stoplist] for document in documents]
如果和我一样是Python初学者,看到- [[word for word in document.lower().split() if word not in stoplist] for document in documents]
可能会有些陌生,这里可以先看一个简单的例子,想详细了解可以去搜Python链表推导式。
- num=[1,2,3]
- myvec=[[x,x*2] for x in num]
-
http://fortianwei.iteye.com/blog/356367
打印分词后的结果,每篇文档都已经被分词:
- print(texts)
- [['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'], ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
- ['eps', 'user', 'interface', 'management', 'system'], ['system', 'human', 'system', 'engineering', 'testing', 'eps'], ['relation', 'user', 'perceived
- ', 'response', 'time', 'error', 'measurement'], ['generation', 'random', 'binary', 'unordered', 'trees'], ['intersection', 'graph', 'paths', 'trees'],
- ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'], ['graph', 'minors', 'survey']]
2、计算每个词出现的频率
(1)遍历上一步得到的分词后的结果集texts,然后计算每个单词出现的频率
(2)找出频率大于1的词,出现次数小于1的单词舍弃(实际情况中可根据需求确定)
-
- frequency = defaultdict(int)
-
- for text in texts:
- for token in text:
- frequency[token]+=1
-
- texts=[[token for token in text if frequency[token]>1] for text in texts]
打印结果,过滤出出现频率大于1的单词:- print(texts)
- [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system',
- 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']]
3、通过corpora创建字典以第一篇文档为例
[human,interface,computer] 建立字典→
{'human':1,'interface':2,'computer':3},其中key为单词,value为单词的编号(注意:实际编号不一定是1 2 3,这里只是为了举例)
-
- dictionary=corpora.Dictionary(texts)
-
-
-
- print(dictionary.token2id)
-
从打印的字典结果中,可以看出为每个单词都建立了一个编号,总共有12个单词
4、处理将要比较的文档
(1)首先还是对文档分词
(2)然后根据上一步建立的字典dictionary将文档分词后的结果转为向量,使用一种名为词袋的表示方法
-
-
- new_doc = "Human computer interaction"
-
- new_vec = dictionary.doc2bow(new_doc.lower().split())
-
-
首先文档分词后的结果为【human,computer,interaction】,从上一步的字典集中找出human单词的编号为0,本文档出现的次数为1,computer单词编号为2,本文档中出现次数为1,interaction在字典集中没有出现,因此没有对应的信息,最后得到文档的词袋表示(向量表示)
5、建立语料库使用同样的方法,根据字典集将9篇文档的分词结果转为向量表示,从打印结果中可以看到9篇文档都被转换为了向量表示方法,此时得到一个语料库corpus
-
-
- corpus = [dictionary.doc2bow(text) for text in texts]
- print(corpus)
-
6、初始化模型(1)使用上一步得到语料库建立一个tfidf模型,利用此模型可以将文档的向量表示转换为tfidf表示方法
-
-
- tfidf = models.TfidfModel(corpus)
-
- test_doc_bow = [(0, 1), (1, 1)]
- print(tfidf[test_doc_bow])
-
test_doc_bow为测试数据,假如有一篇文档的向量表示为[(0,1),(1,1)],也就是该篇文档中包含两个单词,在字典集中一个编号为1,一个编号为0,两个单词在文档中都出现了一次,现在使用tfidf模型转换,转换后的结果为- [(0, 0.7071067811865476), (1, 0.7071067811865476)]
(0,0.7071067811865476)
第一个数字0还表示单词的编号,第二个数字0.7071067811865476表示该单词的tfidf值
(2)用同样的办法,将整个语料库转为tfidf表示
-
- corpus_tfidf = tfidf[corpus]
- for doc in corpus_tfidf:
- print(doc)
转换后的语料库:- [(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]
- [(2, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.3244870206138555), (5, 0.3244870206138555), (6, 0.44424552527467476), (7, 0.44424552527467476)]
- [(1, 0.5710059809418182), (4, 0.4170757362022777), (5, 0.4170757362022777), (8, 0.5710059809418182)]
- [(0, 0.49182558987264147), (5, 0.7184811607083769), (8, 0.49182558987264147)]
- [(4, 0.45889394536615247), (6, 0.6282580468670046), (7, 0.6282580468670046)]
- [(9, 1.0)]
- [(9, 0.7071067811865475), (10, 0.7071067811865475)]
- [(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)]
- [(3, 0.6282580468670046), (10, 0.45889394536615247), (11, 0.6282580468670046)]
7、创建索引使用上一步得到的带有tfidf值的语料库建立索引
-
- index = similarities.MatrixSimilarity(corpus_tfidf)
8、相似度计算
-
- new_vec_tfidf=tfidf[new_vec]
- print(new_vec_tfidf)
-
-
-
- sims = index[new_vec_tfidf]
- print(sims)
-
-
最后打印的结果是输入的测试文档与语料库中9篇文档通过余弦相似度计算得到的值,可以看出和第一篇文档的余弦值最高,为0.81649655,所以和第一篇文档最为相似测试文档:Human computer interaction
第一篇文档:Human machine interface for lab abc computer applications
完整代码:
-
- from gensim import corpora, models, similarities
- import logging
- from collections import defaultdict
- logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
-
-
- documents = ["Human machine interface for lab abc computer applications",
- "A survey of user opinion of computer system response time",
- "The EPS user interface management system",
- "System and human system engineering testing of EPS",
- "Relation of user perceived response time to error measurement",
- "The generation of random binary unordered trees",
- "The intersection graph of paths in trees",
- "Graph minors IV Widths of trees and well quasi ordering",
- "Graph minors A survey"]
-
-
- stoplist=set('for a of the and to in'.split())
- texts=[[word for word in document.lower().split() if word not in stoplist] for document in documents]
- print('-----------1----------')
- print(texts)
-
-
-
-
-
-
- frequency = defaultdict(int)
-
- for text in texts:
- for token in text:
- frequency[token]+=1
-
- texts=[[token for token in text if frequency[token]>1] for text in texts]
- print('-----------2----------')
- print(texts)
-
-
-
-
- dictionary=corpora.Dictionary(texts)
-
-
-
- print('-----------3----------')
- print(dictionary.token2id)
-
-
-
-
- new_doc = "Human computer interaction"
-
- new_vec = dictionary.doc2bow(new_doc.lower().split())
- print('-----------4----------')
- print(new_vec)
-
-
-
-
- corpus = [dictionary.doc2bow(text) for text in texts]
- print('-----------5----------')
- print(corpus)
-
-
-
-
- tfidf = models.TfidfModel(corpus)
-
- test_doc_bow = [(0, 1), (1, 1)]
- print('-----------6----------')
- print(tfidf[test_doc_bow])
-
-
- print('-----------7----------')
-
- corpus_tfidf = tfidf[corpus]
- for doc in corpus_tfidf:
- print(doc)
-
-
- index = similarities.MatrixSimilarity(corpus_tfidf)
-
- print('-----------8----------')
-
- new_vec_tfidf=tfidf[new_vec]
- print(new_vec_tfidf)
-
- print('-----------9----------')
-
- sims = index[new_vec_tfidf]
- print(sims)
-
-