Python 查询Google+相似文档

来源:互联网 发布:mac os10.7镜像下载 编辑:程序博客网 时间:2024/05/16 00:58

CODE:

#!/usr/bin/python # -*- coding: utf-8 -*-'''Created on 2014-9-10@author: guaguastd@name: find_similiar_document.py'''# Finding similar documents using cosine similarityimport jsonimport nltk.cluster.util# Load in human language data from wherever you've saved itDATA = r'E:\eclipse\Google\dFile\107033731246200681024.json'data = json.loads(open(DATA).read())# Only consider content that's ~1000+ wordsdata = [post for post in json.loads(open(DATA).read())        if len(post['object']['content']) > 1000]all_posts = [post['object']['content'].lower().split()             for post in data]# Provides tf, idf, and tf_idf abstractions for scorintc = nltk.TextCollection(all_posts)# Compute a term-document matrixtd_matrix = {}for idx in range(len(all_posts)):    post = all_posts[idx]    fdist = nltk.FreqDist(post)    doc_title = data[idx]['title']    url = data[idx]['url']    td_matrix[(doc_title, url)] = {}    for term in fdist.iterkeys():        td_matrix[(doc_title, url)][term] = tc.tf_idf(term, post)# Build vectors such that term scores are in the same positions...distances = {}for (title1, url1) in td_matrix.keys():    distances[(title1, url1)] = {}    (min_dist, most_similar) = (1.0, ('', ''))    for (title2, url2) in td_matrix.keys():        # Take care not to mutate the original data structures        # since we're in a loop and need the originals multiple times        terms1 = td_matrix[(title1, url1)].copy()        terms2 = td_matrix[(title2, url2)].copy()        # Fill in gaps in each map so vectors of the same length can be computed        for term1 in terms1:            if term1 not in terms2:                terms2[term1] = 0        for term2 in terms2:            if term2 not in terms1:                terms1[term2] = 0        # Create vectors from term maps        v1 = [score for (term, score) in sorted(terms1.items())]        v2 = [score for (term, score) in sorted(terms2.items())]                # Compute similarity amongst documents        distances[(title1, url1)][(title2, url2)] = nltk.cluster.util.cosine_distance(v1, v2)        if url1 == url2:            continue        if distances[(title1, url1)][(title2, url2)] < min_dist:            (min_dist, most_similar) = (distances[(title1, url1)][(title2, url2)], (title2, url2))    print '''Most similar to %s (%s)\t%s (%s)\tscore %f''' % (title1, url1, most_similar[0], most_similar[1], 1-min_dist)

RESULT:

Most similar to Great talk by Maciej Ceglowski.  Funny, smart, and with an important message.  Just like Maciej all ... (https://plus.google.com/107033731246200681024/posts/b17bWhGfkH3)Journalism vs. Punditry: NPR's Kelly McEvers on Why Reporting MattersThere was a great segment on ... (https://plus.google.com/107033731246200681024/posts/NGZmQLE392X)score 0.056840Most similar to Journalism vs. Punditry: NPR's Kelly McEvers on Why Reporting MattersThere was a great segment on ... (https://plus.google.com/107033731246200681024/posts/NGZmQLE392X)Great talk by Maciej Ceglowski.  Funny, smart, and with an important message.  Just like Maciej all ... (https://plus.google.com/107033731246200681024/posts/b17bWhGfkH3)score 0.056840Most similar to The Myth of the Spoiled ChildThere is an interesting op-ed in the NY Times by Alfie Cohn, who has ... (https://plus.google.com/107033731246200681024/posts/c1f9KVXsivD)How to Raise Moral ChildrenI thought this article on child-raising had a lot of good ideas in it. ... (https://plus.google.com/107033731246200681024/posts/NVZVmG1ct6C)score 0.064629Most similar to Why Common Core is Like Healthcare.govDraw a bold line between this piece on the failure of the Common... (https://plus.google.com/107033731246200681024/posts/XebEgwjhV35)"We don't need new policies. We need better implementation."Last night, I hosted Oakland City Councilor... (https://plus.google.com/107033731246200681024/posts/M1kH7bErNDm)score 0.067829Most similar to "We don't need new policies. We need better implementation."Last night, I hosted Oakland City Councilor... (https://plus.google.com/107033731246200681024/posts/M1kH7bErNDm)Why Common Core is Like Healthcare.govDraw a bold line between this piece on the failure of the Common... (https://plus.google.com/107033731246200681024/posts/XebEgwjhV35)score 0.067829Most similar to +Maria Konnikova's NY Times article about the role of time and attention scarcity in the cycle of poverty... (https://plus.google.com/107033731246200681024/posts/4qHJZJU6Dtb)How to Raise Moral ChildrenI thought this article on child-raising had a lot of good ideas in it. ... (https://plus.google.com/107033731246200681024/posts/NVZVmG1ct6C)score 0.046450Most similar to How to Raise Moral ChildrenI thought this article on child-raising had a lot of good ideas in it. ... (https://plus.google.com/107033731246200681024/posts/NVZVmG1ct6C)The Myth of the Spoiled ChildThere is an interesting op-ed in the NY Times by Alfie Cohn, who has ... (https://plus.google.com/107033731246200681024/posts/c1f9KVXsivD)score 0.064629


0 0
原创粉丝点击