关于Giraph 数据划分V1.2版本

来源:互联网 发布:linux 建立用户 编辑:程序博客网 时间:2024/06/09 13:59

giraph ../giraph-core-1.2.0.jar  org.apache.giraph.benchmark.PageRankComputation -vif  org.apache.giraph.io.formats.IntFloatNullTextInputFormat -vip /test/youTube.txt  -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /output  -w 2


-w 后面的worker number指的是map任务数,实际执行时还要加上三个任务:

即taskID为0的master协调任务,1,2 Id为计算任务,  3Id为job setup , 4Id 为job cleanup 任务。



假设HDFS块大小为16M, 实际数据为30M, 那么spits数量为30/16,2个任务,

实际计算splits与partition关系时,使用如下函数:

 PartitionUtils.computePartitionCount

 public static int computePartitionCount(int availableWorkerCount,      ImmutableClassesGiraphConfiguration conf) {    if (availableWorkerCount == 0) {      throw new IllegalArgumentException(          "computePartitionCount: No available workers");    }    int userPartitionCount = USER_PARTITION_COUNT.get(conf);    int partitionCount;    if (userPartitionCount == USER_PARTITION_COUNT.getDefaultValue()) {      float multiplier = GiraphConstants.PARTITION_COUNT_MULTIPLIER.get(conf);      partitionCount = Math.max(          (int) (multiplier * availableWorkerCount * availableWorkerCount), 1);      int minPartitionsPerComputeThread =          MIN_PARTITIONS_PER_COMPUTE_THREAD.get(conf);      int totalComputeThreads =          NUM_COMPUTE_THREADS.get(conf) * availableWorkerCount;      partitionCount = Math.max(partitionCount,          minPartitionsPerComputeThread * totalComputeThreads);    } else {      partitionCount = userPartitionCount;    }    if (LOG.isInfoEnabled()) {      LOG.info("computePartitionCount: Creating " +          partitionCount + " partitions.");    }    return partitionCount;  }

multiplier默认为1, availableWorkerCount即为2,初始的worker number, 

 partitionCount = Math.max(
          (int) (multiplier * availableWorkerCount * availableWorkerCount), 1); 

实际上就是将一个split均匀切成worker number份,测试开始时由于只设定了2个任务,这样每个任务实际运行两个分区数据, 每个分区占1/4,

导致netty发送分区报内存不足:

2016-12-09 22:52:47,649 ERROR org.apache.giraph.comm.netty.NettyClient: Request failedjava.lang.OutOfMemoryError: Java heap spaceat io.netty.buffer.UnpooledHeapByteBuf.<init>(UnpooledHeapByteBuf.java:45)at io.netty.buffer.UnpooledByteBufAllocator.newHeapBuffer(UnpooledByteBufAllocator.java:43)at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:136)at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:127)at io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:85)at org.apache.giraph.comm.netty.handler.RequestEncoder.write(RequestEncoder.java:81)at io.netty.channel.DefaultChannelHandlerContext.invokeWrite(DefaultChannelHandlerContext.java:645)at io.netty.channel.DefaultChannelHandlerContext.access$2000(DefaultChannelHandlerContext.java:29)at io.netty.channel.DefaultChannelHandlerContext$WriteTask.run(DefaultChannelHandlerContext.java:906)at io.netty.util.concurrent.DefaultEventExecutor.run(DefaultEventExecutor.java:36)at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101)at java.lang.Thread.run(Thread.java:745)


后来把worker number改成3, 这样一共9个分区,平均每个任务3个分区,每个分区占1/9的数据,就OK了~



关于DiskBackedMessageStore

V1.2中,取消了DiskBackedMessageStoreFactory,  原来是图,消息是否存磁盘各设各的,现在改为在ServerData类中只要设置GiraphConstants.USE_OUT_OF_CORE_GRAPH  (giraph.useOutOfCoreGraph) 为true,  oocEngine即为true. 那么无论是图和消息都默认存储磁盘。

这个改进改的不是很好啊! 不如原来的! 

0 0
原创粉丝点击