Python 查询Google+相似文档
来源:互联网 发布:mac os10.7镜像下载 编辑:程序博客网 时间:2024/05/16 00:58
CODE:
#!/usr/bin/python # -*- coding: utf-8 -*-'''Created on 2014-9-10@author: guaguastd@name: find_similiar_document.py'''# Finding similar documents using cosine similarityimport jsonimport nltk.cluster.util# Load in human language data from wherever you've saved itDATA = r'E:\eclipse\Google\dFile\107033731246200681024.json'data = json.loads(open(DATA).read())# Only consider content that's ~1000+ wordsdata = [post for post in json.loads(open(DATA).read()) if len(post['object']['content']) > 1000]all_posts = [post['object']['content'].lower().split() for post in data]# Provides tf, idf, and tf_idf abstractions for scorintc = nltk.TextCollection(all_posts)# Compute a term-document matrixtd_matrix = {}for idx in range(len(all_posts)): post = all_posts[idx] fdist = nltk.FreqDist(post) doc_title = data[idx]['title'] url = data[idx]['url'] td_matrix[(doc_title, url)] = {} for term in fdist.iterkeys(): td_matrix[(doc_title, url)][term] = tc.tf_idf(term, post)# Build vectors such that term scores are in the same positions...distances = {}for (title1, url1) in td_matrix.keys(): distances[(title1, url1)] = {} (min_dist, most_similar) = (1.0, ('', '')) for (title2, url2) in td_matrix.keys(): # Take care not to mutate the original data structures # since we're in a loop and need the originals multiple times terms1 = td_matrix[(title1, url1)].copy() terms2 = td_matrix[(title2, url2)].copy() # Fill in gaps in each map so vectors of the same length can be computed for term1 in terms1: if term1 not in terms2: terms2[term1] = 0 for term2 in terms2: if term2 not in terms1: terms1[term2] = 0 # Create vectors from term maps v1 = [score for (term, score) in sorted(terms1.items())] v2 = [score for (term, score) in sorted(terms2.items())] # Compute similarity amongst documents distances[(title1, url1)][(title2, url2)] = nltk.cluster.util.cosine_distance(v1, v2) if url1 == url2: continue if distances[(title1, url1)][(title2, url2)] < min_dist: (min_dist, most_similar) = (distances[(title1, url1)][(title2, url2)], (title2, url2)) print '''Most similar to %s (%s)\t%s (%s)\tscore %f''' % (title1, url1, most_similar[0], most_similar[1], 1-min_dist)
RESULT:
Most similar to Great talk by Maciej Ceglowski. Funny, smart, and with an important message. Just like Maciej all ... (https://plus.google.com/107033731246200681024/posts/b17bWhGfkH3)Journalism vs. Punditry: NPR's Kelly McEvers on Why Reporting MattersThere was a great segment on ... (https://plus.google.com/107033731246200681024/posts/NGZmQLE392X)score 0.056840Most similar to Journalism vs. Punditry: NPR's Kelly McEvers on Why Reporting MattersThere was a great segment on ... (https://plus.google.com/107033731246200681024/posts/NGZmQLE392X)Great talk by Maciej Ceglowski. Funny, smart, and with an important message. Just like Maciej all ... (https://plus.google.com/107033731246200681024/posts/b17bWhGfkH3)score 0.056840Most similar to The Myth of the Spoiled ChildThere is an interesting op-ed in the NY Times by Alfie Cohn, who has ... (https://plus.google.com/107033731246200681024/posts/c1f9KVXsivD)How to Raise Moral ChildrenI thought this article on child-raising had a lot of good ideas in it. ... (https://plus.google.com/107033731246200681024/posts/NVZVmG1ct6C)score 0.064629Most similar to Why Common Core is Like Healthcare.govDraw a bold line between this piece on the failure of the Common... (https://plus.google.com/107033731246200681024/posts/XebEgwjhV35)"We don't need new policies. We need better implementation."Last night, I hosted Oakland City Councilor... (https://plus.google.com/107033731246200681024/posts/M1kH7bErNDm)score 0.067829Most similar to "We don't need new policies. We need better implementation."Last night, I hosted Oakland City Councilor... (https://plus.google.com/107033731246200681024/posts/M1kH7bErNDm)Why Common Core is Like Healthcare.govDraw a bold line between this piece on the failure of the Common... (https://plus.google.com/107033731246200681024/posts/XebEgwjhV35)score 0.067829Most similar to +Maria Konnikova's NY Times article about the role of time and attention scarcity in the cycle of poverty... (https://plus.google.com/107033731246200681024/posts/4qHJZJU6Dtb)How to Raise Moral ChildrenI thought this article on child-raising had a lot of good ideas in it. ... (https://plus.google.com/107033731246200681024/posts/NVZVmG1ct6C)score 0.046450Most similar to How to Raise Moral ChildrenI thought this article on child-raising had a lot of good ideas in it. ... (https://plus.google.com/107033731246200681024/posts/NVZVmG1ct6C)The Myth of the Spoiled ChildThere is an interesting op-ed in the NY Times by Alfie Cohn, who has ... (https://plus.google.com/107033731246200681024/posts/c1f9KVXsivD)score 0.064629
0 0
- Python 查询Google+相似文档
- 查询最相似的文档
- Python 正则表达式查询相似的字符串
- Python解决Google文档打开缓慢问题
- Python(2) Python 模块帮助文档查询
- 相似文档匹配
- 文档余弦相似度
- solr 相似查询 -- MoreLikeThis
- 相似度查询
- Google相似图片搜索原理
- 利用LUCENE求相似文档
- MoreLikeThis实现检索相似文档
- 文档相似度算法 Simhash
- lsi计算文档相似度
- doc2vec计算文档相似度
- 计算文档的相似度
- GOOGLE文档离线文档
- Python下用Google Map查询地址的经纬度
- 从汇编层面去理解c++中的虚函数
- 从汇编层面去理解对象创建
- OCR损坏RAC集群服务无法启动:CRS-0704、CRS-10132: No msg for has:crs-10132 [10][60]、Could not init OLR
- Linked List Cycle @Leetcode
- mongodb分片模式启用认证的注意事项
- Python 查询Google+相似文档
- [base]Mipmapping Normal Maps
- ubuntu12.04部署ROR
- CentOS Linux服务器安全设置
- Default access privilege in C++
- LeetCode总结 -- 矩阵篇
- Linked List Cycle II@leetcode
- Auto variable in C++
- 当她问「为什么喜欢我」的时候,你就已经输了