1 PageRank

  PageRank算法是通过计算每一个网页的 PageRank 值,然后根据这个值的大小对网页的重要性进行排序。它的思想是模拟网页浏览者的浏览行为,估计这个网页浏览者分布在各个网页上的概率(衡量对应网页的重要性)。网上有太多对它的作详尽介绍的文字(比如:度娘的百度百科)

2 PageRank的最简原理模型

  互联网中的网页可以看出是一个矢量图,其中网页是节点,如果网页 A 有链接到网页 B,则存在一条有向边 AB,下面是一个简单的示例:
  这个例子中有四个网页(小规模),如果当前在 A 网页,那么网页浏览者将会以 13 的概率点击跳转到 B、C、D,这里的 3 表示 A 有 3 条出链,如果一个网页有k条出链,那么跳转任意一个出链上的概率是 1k,同理 D 到 B 、C 的概率各为 12,而 B 到 C 的概率为 0。一般用转移矩阵表示网页浏览者的跳转概率,如果用 n 表示网页的数目,则转移矩阵 M 是一个 n×n 的方阵;如果网页 j 有 k 条出链,那么对出链指向的每一个网页i,有 M[i][j]=1k,而其他网页的 M[i][j]=0;上面示例图对应的转移矩阵如下:


  假设起初网页浏览者处在每一个网页的概率都是相等的,即 1n,于是初始的概率分布就是一个所有值都为 1n的 n 维列向量 V0,用 V0 去右乘转移矩阵 M,就得到了第一步之后网页浏览者的概率分布向量MV0, (n×n)×(n×1) 依然得到一个 n×1 的矩阵。下面是 V1的计算过程:

  注意矩阵M中 M[i][j] 不为0表示用一个链接从 j 指向 i,M 的第一行乘以 V0,表示累加所有网页到网页A的概率即得到9/24。得到了 V1后,再用V1去右乘 M 得到 V2,一直下去,最终 V 会收敛,即 Vn=MVn1,上面的图示例,不断的迭代,最终 V=[3/9,2/9,2/9,2/9]T

3 终止点问题

  互联网上的网页不满足强连通的特性,因为有一些网页不指向任何网页,如果按照上面的计算,网页浏览者到达这样的网页后便终止了,导致前面累计得到的转移概率被清零,这样下去,最终的得到的概率分布向量所有元素几乎都为 0。假设我们把上面图中 C 到 A 的链接丢掉,C 变成了一个终止点,得到下面这个图:


  连续迭代下去,最终所有元素都为 0:

4 陷阱问题

  网页浏览者跳转到 C 网页后,就像跳进了陷阱,陷入了漩涡,再也不能从 C 中出来,最终将导致概率分布值全部转移到 C 上来,这使得其他网页的概率分布值为 0,从而整个网页排名就失去了意义。如果按照上面图对应的转移矩阵为:



6 基于MapReduce的Page Rank算法实现

  上面的演算过程,采用矩阵相乘,不断迭代,直到迭代前后概率分布向量的值变化不大,一般迭代到 30 次以上就收敛了。真的的 web 结构的转移矩阵非常大,目前的网页数量已经超过 100 亿,转移矩阵是 100 亿* 100 亿的矩阵,故借助 MapReduce 的分布式计算方式来解决。

6.1 爬取的图数据

  我们把 web 图中的每一个网页及其链出的网页作为一行,这样第四节中的 web 图结构用如下方式表示:




6.2 MapReduce代码实现

(1)BuildPageRankRecords.java 将txt文本图数据转换为Hadoop可写的数据

public class BuildPageRankRecords extends Configured implements Tool {  private static final Logger LOG = Logger.getLogger(BuildPageRankRecords.class);  private static final String NODE_CNT_FIELD = "node.cnt";  private static class MyMapper extends Mapper<LongWritable, Text, IntWritable, PageRankNode> {    private static final IntWritable nid = new IntWritable();    private static final PageRankNode node = new PageRankNode(); public void setup(Mapper<LongWritable, Text, IntWritable, PageRankNode>.Context context) {      int n = context.getConfiguration().getInt(NODE_CNT_FIELD, 0);      if (n == 0) {        throw new RuntimeException(NODE_CNT_FIELD + " cannot be 0!");      }      node.setType(PageRankNode.Type.Complete);      node.setPageRank((float) -StrictMath.log(n));    }    public void map(LongWritable key, Text t, Context context) throws IOException,        InterruptedException {      String[] arr = t.toString().trim().split("\\s+");      nid.set(Integer.parseInt(arr[0]));      if (arr.length == 1) {        node.setNodeId(Integer.parseInt(arr[0]));        node.setAdjacencyList(new ArrayListOfIntsWritable());      } else {        node.setNodeId(Integer.parseInt(arr[0]));        int[] neighbors = new int[arr.length - 1];        for (int i = 1; i < arr.length; i++) {          neighbors[i - 1] = Integer.parseInt(arr[i]);}        node.setAdjacencyList(new ArrayListOfIntsWritable(neighbors));}      context.getCounter("graph", "numNodes").increment(1);      context.getCounter("graph", "numEdges").increment(arr.length - 1);      if (arr.length > 1) {        context.getCounter("graph", "numActiveNodes").increment(1);      }      context.write(nid, node);    }  }  public static void main(String[] args) throws Exception {    ToolRunner.run(new BuildPageRankRecords(), args);  }}

(2)PartitionGraph.java 划分图

public class PartitionGraph extends Configured implements Tool {  private static final Logger LOG = Logger.getLogger(PartitionGraph.class);  public static void main(String[] args) throws Exception {    ToolRunner.run(new PartitionGraph(), args);  }  public PartitionGraph() {}  private static final String INPUT = "input";  private static final String OUTPUT = "output";  private static final String NUM_NODES = "numNodes";  private static final String NUM_PARTITIONS = "numPartitions";  private static final String RANGE = "range";    if (useRange) {      job.setPartitionerClass(RangePartitioner.class); }    FileSystem.get(conf).delete(new Path(outPath), true);    job.waitForCompletion(true);    return 0;  }}


public class RunPageRankBasic extends Configured implements Tool {  private static class MapClass extends      Mapper<IntWritable, PageRankNode, IntWritable, PageRankNode> {    private static final IntWritable neighbor = new IntWritable();    private static final PageRankNode intermediateMass = new PageRankNode();    private static final PageRankNode intermediateStructure = new PageRankNode();    public void map(IntWritable nid, PageRankNode node, Context context)        throws IOException, InterruptedException {  // Mapper的combiner操控  private static class MapWithInMapperCombiningClass extends      Mapper<IntWritable, PageRankNode, IntWritable, PageRankNode> {    // 根据目标节点PageRank mass权重key值    private static final HMapIF map = new HMapIF();    private static final PageRankNode intermediateStructure = new PageRankNode();    public void setup(Context context) throws IOException {      map.clear();    }    public void map(IntWritable nid, PageRankNode node, Context context)        throws IOException, InterruptedException {      context.write(nid, intermediateStructure);      int massMessages = 0;      int massMessagesSaved = 0;      // 沿着出链边分配PageRank mass至相邻节点      if (node.getAdjacenyList().size() > 0) {        // Each neighbor gets an equal share of PageRank mass.        ArrayListOfIntsWritable list = node.getAdjacenyList();        float mass = node.getPageRank() - (float) StrictMath.log(list.size());        context.getCounter(PageRank.edges).increment(list.size());  private static class CombineClass extends      Reducer<IntWritable, PageRankNode, IntWritable, PageRankNode> {    private static final PageRankNode intermediateMass = new PageRankNode();    public void reduce(IntWritable nid, Iterable<PageRankNode> values, Context context)        throws IOException, InterruptedException {      int massMessages = 0;  // Reduce阶段  private static class ReduceClass extends      Reducer<IntWritable, PageRankNode, IntWritable, PageRankNode> {    private float totalMass = Float.NEGATIVE_INFINITY;    public void reduce(IntWritable nid, Iterable<PageRankNode> iterable, Context context)        throws IOException, InterruptedException {      Iterator<PageRankNode> values = iterable.iterator();      // PageRank mass累积节点更新      node.setPageRank(mass);      context.getCounter(PageRank.massMessagesReceived).increment(massMessagesReceived);     public void cleanup(Context context) throws IOException {      Configuration conf = context.getConfiguration();      String taskId = conf.get("mapred.task.id");      String path = conf.get("PageRankMassPath");      Preconditions.checkNotNull(taskId);      Preconditions.checkNotNull(path); }  }  // Mapper阶段:(分配丢失的PageRank mass)并记录随机跳转因子  private static class MapPageRankMassDistributionClass extends      Mapper<IntWritable, PageRankNode, IntWritable, PageRankNode> {    private float missingMass = 0.0f;    private int nodeCnt = 0;    public void setup(Context context) throws IOException {      Configuration conf = context.getConfiguration();      missingMass = conf.getFloat("MissingMass", 0.0f);      nodeCnt = conf.getInt("NodeCount", 0);  }    public void map(IntWritable nid, PageRankNode node, Context context)        throws IOException, InterruptedException {      float p = node.getPageRank();      float jump = (float) (Math.log(ALPHA) - Math.log(nodeCnt));      float link = (float) Math.log(1.0f - ALPHA)          + sumLogProbs(p, (float) (Math.log(missingMass) - Math.log(nodeCnt)));      p = sumLogProbs(jump, link);      node.setPageRank(p);      context.write(nid, node);  }  }  // PageRank的迭代过程    for (int i = s; i < e; i++) {      iteratePageRank(i, i + 1, basePath, n, useCombiner, useInmapCombiner);    }    return 0;  }    // 执行迭代  private void iteratePageRank(int i, int j, String basePath, int numNodes,      boolean useCombiner, boolean useInMapperCombiner) throws Exception {    // 每次迭代过程由两个阶段组成(两个MapReduce Job)    // Job 1: 沿着出链边分配PageRank mass    float mass = phase1(i, j, basePath, numNodes, useCombiner, useInMapperCombiner);    float missing = 1.0f - (float) StrictMath.exp(mass);    // Job 2: 分配丢失的mass并关注(网页)随机跳转因子    phase2(i, j, missing, basePath, numNodes); }  private void phase2(int i, int j, float missing, String basePath, int numNodes) throws Exception{   Job job = Job.getInstance(getConf());    job.setJobName("PageRank:Basic:iteration" + j + ":Phase2");    job.setJarByClass(RunPageRankBasic.class);    System.out.println("Job Finished in " + (System.currentTimeMillis() - startTime) / 1000.0 + " seconds");  }  // 对数概率相加  private static float sumLogProbs(float a, float b) {    if (a == Float.NEGATIVE_INFINITY)  return b;    if (b == Float.NEGATIVE_INFINITY)  return a;    if (a < b) {  return (float) (b + StrictMath.log1p(StrictMath.exp(a - b)));  }    return (float) (a + StrictMath.log1p(StrictMath.exp(b - a))); }}

(4)FindMaxPageRankNodes.java 对Node进行排序得到最终的权重序列

public class FindMaxPageRankNodes extends Configured implements Tool {  private static final Logger LOG = Logger.getLogger(FindMaxPageRankNodes.class);  private static class MyMapper extends      Mapper<IntWritable, PageRankNode, IntWritable, FloatWritable> {    private TopScoredObjects<Integer> queue;    public void setup(Context context) throws IOException {      int k = context.getConfiguration().getInt("n", 100);      queue = new TopScoredObjects<>(k);   }    public void map(IntWritable nid, PageRankNode node, Context context) throws IOException,        InterruptedException {      queue.add(node.getNodeId(), node.getPageRank());    }  }  private static class MyReducer extends      Reducer<IntWritable, FloatWritable, IntWritable, Text> {    private static TopScoredObjects<Integer> queue;    public void setup(Context context) throws IOException {      int k = context.getConfiguration().getInt("n", 100);      queue = new TopScoredObjects<Integer>(k);    }    public void reduce(IntWritable nid, Iterable<FloatWritable> iterable, Context context)        throws IOException {      Iterator<FloatWritable> iter = iterable.iterator();      queue.add(nid.get(), iter.next().get());      if (iter.hasNext()) {        throw new RuntimeException();  }    }  }  public FindMaxPageRankNodes() {  }  private static final String INPUT = "input";  private static final String OUTPUT = "output";  private static final String TOP = "top";  @SuppressWarnings({ "static-access" })  public int run(String[] args) throws Exception {    Options options = new Options();    if (!cmdline.hasOption(INPUT) || !cmdline.hasOption(OUTPUT) || !cmdline.hasOption(TOP)) {      System.out.println("args: " + Arrays.toString(args));      HelpFormatter formatter = new HelpFormatter();      formatter.setWidth(120);      formatter.printHelp(this.getClass().getName(), options);      ToolRunner.printGenericCommandUsage(System.out);      return -1;  }  }  public static void main(String[] args) throws Exception {    int res = ToolRunner.run(new FindMaxPageRankNodes(), args);System.exit(res);  }}

7 运行环境

VMware Workstation 12.0 (64 位)
CentOS 6.4 (64 位)
JDK 1.7.0 (64 位)
Hadoop 2.6.0 (64 位,伪分布式配置)
Eclipse 3.8 (64 位)

8 PageRank的实现结果

  建立实现算法过程用到的四个主要类(class)java文件,并经过编译、run on hadoop,最终得到PageRank MapReduce代码在Eclipse上的运行结果(这里取top 20),Eclipse界面截图如下: