PageRank Spark implementation

来源:互联网 发布:如何将链接在淘宝打开 编辑:程序博客网 时间:2024/05/16 05:51

  As you know, PageRank is very famous algorithm. For the detail of pagerank defination and implemenation, you can refer to


  There are many implementations, I have written some programs that had implemented it before. This time I try to use spark.

 The basic idea is below:

Give pages ranks(or scores) based on links to them.

>> Links from many pages -> high rank

>> Links from high-rank pages -> high rank


1. Start each page at rank of 1
2. On each iteration, have page p contribute rank_of_p / |neighbors_of_p|
3. Set each page's rank to 0.15 + 0.85 * contribs

The Spark Program seems easy and short.

scala code:

object PageRank {  // get link matrix  val mat = sc.textFile("/home/will/myspace/mydev/mytest/sparktest/scalaspark/data/mat2.txt")  // get the links of every page node  val links = => {    val parts = line.split("\\s+")    (parts(0), parts(1))  }).distinct().groupByKey()  // initialize each page node's rank to 1.0  var ranks = links.mapValues(v => 1.0)  // set the iteration time to 10  val ITERATIONS = 10  // compute the page rank of each page node  for (i <- 0 until ITERATIONS) {    val contributions = links.join(ranks).flatMap {      case (pageId, (links, rank)) => => (dest, rank / links.size))    }        ranks = contributions.reduceByKey((x, y) => x + y).mapValues(v => 0.15 + 0.85 * v)  }  val rank_array = ranks.take(10)  // print the result  for (i <- 0 until rank_array.size) {    println(rank_array(i))  }}              

Python code:

# In[1]:filename1 = "/home/will/myspace/mydev/mytest/sparktest/scalaspark/data/mat2.txt"# get link matrixmat = sc.textFile(filename1, 4, 0)# In[3]:# because the matrix is small, so that can be collectedmat.collect()# In[4]:import redef parseNeighbors(urls):    """Parses a urls pair string into urls pair."""    parts = re.split(r'\s+', urls)    return parts[0], parts[1]# In[5]:def computeContribs(urls, rank):    """Calculates URL contributions to the rank of other URLs."""    num_urls = len(urls)    for url in urls:        yield (url, rank / num_urls)# In[6]:# get the links of every page nodelinks = urls: parseNeighbors(urls)).distinct().groupByKey().cache()# below method the get links will lead out of memory error# #links = line: line.split(" ")).map(lambda l: (l[0], l[1])).distinct().groupByKey().cache()#.mapValues(list).collect()#links = line: (line.split(" ")[0], line.split(" ")[1])).distinct().groupByKey().cache()# In[7]:links.count()# In[8]:# Loads all URLs with other URL(s) link to from input file and initialize ranks of them to one.ranks = (url, neighbors): (url, 1.0))# In[12]:# compute pagerankfrom operator import addITERATIONS = 10# Calculates and updates URL ranks continuously using PageRank algorithm.for iteration in xrange(ITERATIONS):    # Calculates URL contributions to the rank of other URLs.    contribs = links.join(ranks).flatMap(lambda (url, (urls, rank)): computeContribs(urls, rank))    # Re-calculates URL ranks based on neighbor contributions.    ranks = contribs.reduceByKey(add).mapValues(lambda rank: rank * 0.85 + 0.15)# In[13]:ranks.count()# In[14]:# print the resultfor (link, rank) in ranks.collect():    print "%s has rank: %s." % (link, rank)

The running result is below:

0 has rank: 0.772702281464.6 has rank: 0.56251510134.1 has rank: 1.72864431597.7 has rank: 0.56251510134.2 has rank: 1.14027517155.8 has rank: 0.59949206817.3 has rank: 0.970068542695.9 has rank: 1.45593564966.4 has rank: 1.23778322511.5 has rank: 0.970068542695.

The input file of graph nodes like below:

0 11 21 21 31 31 42 33 04 04 25 11 56 44 54 32 42 57 88 14 89 22 93 95 97 99 69 7

0 0