PageRank算法 python单机实现

来源:互联网 发布:三分算法 编辑:程序博客网 时间:2024/05/16 14:40

海量数据挖掘课的编程作业。

实现PageRank,计算某个网页最终的rank值,数据是谷歌提供的。

作业反馈结果显示,代码正确。一共进行了26次迭代,总共运行时间83s。

数据链接http://snap.stanford.edu/data/web-Google.txt.gz。

代码:

from math import fabsfrom time import timedata = open('web-Google.txt')N =  875713tax_rate = 0.8eps = 1e-6r = [1./N for i in range(N)]r2 = [1./N for i in range(N)]out_degree = [0 for i in range(N)]m = [[] for i in range(N*2)]hash_table = [-1 for i in range(N*2)]idx = 0def hash(x):global idxif hash_table[x] == -1:hash_table[x] = idxidx += 1return hash_table[x]data.readline()data.readline()data.readline()data.readline()for line in data:x, y = map(hash, map(int, line.split()))out_degree[x] += 1m[y].append(x)print 'data loaded'print 'start iterating...'t = 0begin = time()while True:for i in range(N):r[i] = 0for in_id in m[i]: r[i] += tax_rate * r2[in_id] / out_degree[in_id]der = 1 - sum(r)for i in range(N):r[i] += der / Ntag = 0for i in range(N):if fabs(r[i]-r2[i]) > eps:tag = 1breakfor i in range(N):r2[i] = r[i]t += 1if tag == 0:breakend = time()print r[hash(99)]print 'total iteration is %d' % tprint 'total time is %f' % (end - begin)



0 0