Giraph添加应用程序Weakly Connected Components算法

来源:互联网 发布:好用的美白精华知乎 编辑:程序博客网 时间:2024/06/05 07:41

本人原创,转载请注明出处! 本人QQ:530422429,欢迎大家指正、讨论。

目的:举例说明如何在Giraph中添加应用程序,以WCC(Weakly Connected Components)算法为例,描述怎么添加Vertex的子类,自定义输入输出格式和使用Combiner等。

背景:Giraph源码中自带有WCC算法,类为:org.apache.giraph.examples.ConnectedComponentsVertex,代码如下:

package org.apache.giraph.examples;import org.apache.giraph.edge.Edge;import org.apache.giraph.graph.Vertex;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.NullWritable;import java.io.IOException;/** * Implementation of the HCC algorithm that identifies connected components and * assigns each vertex its "component identifier" (the smallest vertex id * in the component) * * The idea behind the algorithm is very simple: propagate the smallest * vertex id along the edges to all vertices of a connected component. The * number of supersteps necessary is equal to the length of the maximum * diameter of all components + 1 * * The original Hadoop-based variant of this algorithm was proposed by Kang, * Charalampos, Tsourakakis and Faloutsos in * "PEGASUS: Mining Peta-Scale Graphs", 2010 * * http://www.cs.cmu.edu/~ukang/papers/PegasusKAIS.pdf */@Algorithm(    name = "Connected components",    description = "Finds connected components of the graph")public class ConnectedComponentsVertex extends Vertex<IntWritable,    IntWritable, NullWritable, IntWritable> {  /**   * Propagates the smallest vertex id to all neighbors. Will always choose to   * halt and only reactivate if a smaller id has been sent to it.   *   * @param messages Iterator of messages from the previous superstep.   * @throws IOException   */  @Override  public void compute(Iterable<IntWritable> messages) throws IOException {    int currentComponent = getValue().get();    // First superstep is special, because we can simply look at the neighbors    if (getSuperstep() == 0) {      for (Edge<IntWritable, NullWritable> edge : getEdges()) {        int neighbor = edge.getTargetVertexId().get();        if (neighbor < currentComponent) {          currentComponent = neighbor;        }      }      // Only need to send value if it is not the own id      if (currentComponent != getValue().get()) {        setValue(new IntWritable(currentComponent));        for (Edge<IntWritable, NullWritable> edge : getEdges()) {          IntWritable neighbor = edge.getTargetVertexId();          if (neighbor.get() > currentComponent) {            sendMessage(neighbor, getValue());          }        }      }      voteToHalt();      return;    }    boolean changed = false;    // did we get a smaller id ?    for (IntWritable message : messages) {      int candidateComponent = message.get();      if (candidateComponent < currentComponent) {        currentComponent = candidateComponent;        changed = true;      }    }    // propagate new component id to the neighbors    if (changed) {      setValue(new IntWritable(currentComponent));      sendMessageToAllEdges(getValue());    }    voteToHalt();  }}
分析知:在compute()方法中,对第0次迭代做了优化,每个顶点先从自身和邻接顶点中找出最小的顶点ID值,然后把该最小值发送给所有的邻接顶点。后面每个超步中,先从收到的消息中找出最小值,若该最小值小于自身值,就把自身的值设为该最小值,同时把该最小值发送给所有的邻接顶点;若果大于,就不更新自身值和向外发送消息。最后把顶点voteToHalt,进入InActive状态。

继续添加WCC的原因:写最简单(未做优化)的WCC的代码。自带的WCC中I,V,M的类型均为IntWritable类型,对上百亿的大数据顶点不能满足需求,下面将修改为LongWritable类型,就要求自定义输入和输出的类型。同时会添加Combiner。修改步骤如下:

1. 首先自定义输入格式,添加类: org.apache.giraph.examples.LongLongNullTextInputFormat,I,V,E的类型依次为 LongWritable,LongWritable,NullWritable(表示没有权值)。图的输入格式为邻接表形式,以\t间隔。源码如下:

package org.apache.giraph.examples;import java.io.IOException;import java.util.List;import java.util.regex.Pattern;import org.apache.giraph.conf.ImmutableClassesGiraphConfigurable;import org.apache.giraph.conf.ImmutableClassesGiraphConfiguration;import org.apache.giraph.edge.Edge;import org.apache.giraph.edge.EdgeFactory;import org.apache.giraph.graph.Vertex;import org.apache.giraph.io.formats.TextVertexInputFormat;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.NullWritable;import org.apache.hadoop.io.Writable;import org.apache.hadoop.mapreduce.InputSplit;import org.apache.hadoop.mapreduce.TaskAttemptContext;import com.google.common.collect.Lists;/** * Input format for unweighted graphs with long ids and double vertex values */public class LongLongNullTextInputFormat    extends TextVertexInputFormat<LongWritable, LongWritable, NullWritable>    implements ImmutableClassesGiraphConfigurable<LongWritable, LongWritable,    NullWritable, Writable> {  /** Configuration. */  private ImmutableClassesGiraphConfiguration<LongWritable, LongWritable,      NullWritable, Writable> conf;  @Override  public TextVertexReader createVertexReader(InputSplit split,                                             TaskAttemptContext context)    throws IOException {    return new LongLongNullLongVertexReader();  }  @Override  public void setConf(ImmutableClassesGiraphConfiguration<LongWritable,  LongWritable, NullWritable, Writable> configuration) {    this.conf = configuration;  }  @Override  public ImmutableClassesGiraphConfiguration<LongWritable, LongWritable,      NullWritable, Writable> getConf() {    return conf;  }  /**   * Vertex reader associated with   * {@link LongLongNullTextInputFormat}.   */  public class LongLongNullLongVertexReader extends      TextVertexInputFormat<LongWritable, LongWritable,          NullWritable>.TextVertexReader {    /** Separator of the vertex and neighbors */    private final Pattern separator = Pattern.compile("\t");    @Override    public Vertex<LongWritable, LongWritable, NullWritable, ?>    getCurrentVertex() throws IOException, InterruptedException {      Vertex<LongWritable, LongWritable, NullWritable, ?>          vertex = conf.createVertex();      String[] tokens =          separator.split(getRecordReader().getCurrentValue().toString());      List<Edge<LongWritable, NullWritable>> edges =          Lists.newArrayListWithCapacity(tokens.length - 1);      for (int n = 1; n < tokens.length; n++) {        edges.add(EdgeFactory.create(            new LongWritable(Long.parseLong(tokens[n])),            NullWritable.get()));      }      LongWritable vertexId = new LongWritable(Long.parseLong(tokens[0]));      vertex.initialize(vertexId, new LongWritable(), edges);      return vertex;    }    @Override    public boolean nextVertex() throws IOException, InterruptedException {      return getRecordReader().nextKeyValue();    }  }}

2. 自定义输出格式,添加类: org.apache.giraph.examples.VertexWithLongValueNullEdgeTextOutputFormat,最后只输出顶点ID和value,源码如下:

import java.io.IOException;import org.apache.giraph.graph.Vertex;import org.apache.giraph.io.formats.TextVertexOutputFormat;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.NullWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.TaskAttemptContext;/** * Output format for vertices with a long as id, a double as value and * null edges */public class VertexWithLongValueNullEdgeTextOutputFormat extends    TextVertexOutputFormat<LongWritable, LongWritable, NullWritable> {  @Override  public TextVertexWriter createVertexWriter(TaskAttemptContext context)    throws IOException, InterruptedException {    return new VertexWithDoubleValueWriter();  }  /**   * Vertex writer used with   * {@link VertexWithLongValueNullEdgeTextOutputFormat}.   */  public class VertexWithDoubleValueWriter extends TextVertexWriter {    @Override    public void writeVertex(        Vertex<LongWritable, LongWritable, NullWritable, ?> vertex)      throws IOException, InterruptedException {      StringBuilder output = new StringBuilder();      output.append(vertex.getId().get());      output.append('\t');      output.append(vertex.getValue().get());      getRecordWriter().write(new Text(output.toString()), null);    }  }}

3. 继承Vertex类,添加类:org.apache.giraph.examples.WeaklyConnectedComponentsVertex ,覆写compute()方法,实现WCC算法。源码如下:

import java.io.IOException;/** * Weakly Connected Components Algorithm *  * @author baisong * */public class WeaklyConnectedComponentsVertex extends Vertex<LongWritable,    LongWritable, NullWritable, LongWritable> {  /**   * Propagates the smallest vertex id to all neighbors. Will always choose to   * halt and only reactivate if a smaller id has been sent to it.   *   * @param messages Iterator of messages from the previous superstep.   * @throws IOException   */  @Override  public void compute(Iterable<LongWritable> messages) throws IOException {  if(getSuperstep()==0) {  setValue(getId());}  long minValue=getValue().get();  for(LongWritable msg:messages) {  if(msg.get()<minValue) {  minValue=msg.get();  }  }  if(getSuperstep()==0 || minValue<getValue().get()) {  setValue(new LongWritable(minValue));  sendMessageToAllEdges(new LongWritable(minValue));  }  voteToHalt();      }}

4. 自定义Combiner,添加类:org.apache.giraph.combiner.MinimumLongCombiner ,   源码如下:

package org.apache.giraph.combiner;import org.apache.hadoop.io.LongWritable;/** * {@link Combiner} that finds the minimum {@link LongWritable} */public class MinimumLongCombiner    extends Combiner<LongWritable, LongWritable> {  @Override  public void combine(LongWritable vertexIndex, LongWritable originalMessage,  LongWritable messageToCombine) {    if (originalMessage.get() > messageToCombine.get()) {      originalMessage.set(messageToCombine.get());    }  }  @Override  public LongWritable createInitialMessage() {    return new LongWritable(Long.MAX_VALUE);  }}

5. 至此代码添加完毕,需用把所有的修改的class文件放入giraph-examples-1.0.0-for-hadoop-0.20.203.0-jar-with-dependencies.jar 包中。运行命令如下:
hadoop jar giraph-examples-1.0.0-for-hadoop-0.20.203.0-jar-with-dependencies.jar org.apache.giraph.GiraphRunner -Dgiraph.userPartitionCount=2 org.apache.giraph.examples.WeaklyConnectedComponentsVertex -vif org.apache.giraph.examples.LongLongNullTextInputFormat -vip WCC -of org.apache.giraph.examples.VertexWithLongValueNullEdgeTextOutputFormat -op WCC-Modify-1 -w 2

#使用Combiner
hadoop jar giraph-examples-1.0.0-for-hadoop-0.20.203.0-jar-with-dependencies.jar org.apache.giraph.GiraphRunner -Dgiraph.userPartitionCount=2 org.apache.giraph.examples.WeaklyConnectedComponentsVertex -vif org.apache.giraph.examples.LongLongNullTextInputFormat -vip WCC -of org.apache.giraph.examples.VertexWithLongValueNullEdgeTextOutputFormat -op WCC-Modify-2 -c org.apache.giraph.combiner.MinimumLongCombiner -w 2


完!
本人原创,转载请注明出处! 本人QQ:530422429,欢迎大家指正、讨论。

2 0