毕业设计第四周(PowerGraph)

来源：互联网发布：mac支付宝安全控件编辑：程序博客网时间：2024/05/19 22:47

PowerGraph

承接上一周的内容，上一周看到GraphX中对Pregel的实现借鉴了PowerGraph。这周对PowerGraph进行了调研。
主要是这篇paper: PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs

这篇paper中指出Pregel和GraphLab对natural graphs的支持并不好。即当图的度分布呈现power-law的时候，少部分的点会与图的大部分有连接。用公式表示为：

P (d) \propto d - α

其中P(d)为度数，α为常数，一般取2。

因为图的分布不再是随机的假设了，因此一些问题也会暴露出来：

Work Balance
Partitioning
Communication
Storage
Computation

PowerGraph Abstraction

PowerGraph Abstraction 中阐述了PowerGraph的核心思想

eliminates the degree dependence of the
vertex-program by directly exploiting the GAS decomposition to factor vertex-programs over edges. By lifting the Gather and Scatter phases into the abstraction, PowerGraph is able to retain the natural “think-like-a-vertex” philosophy [30] while distributing the computation of a single vertex-program over the entire cluster.

这里需要提到两个图计算框架Pregel和GraphLab，两个框架都是对GAS抽象的实现，但实现方式略有不同。

GAS，分为三个部分Gather，Apply和Scatter，用公式表示为：

Σ \leftarrow \oplus v \in N b r [u] g (D u, D u, v, D v)

D n e w u \leftarrow a (D u, Σ)

\forall v \in N b r [u] : (D (u, v) \leftarrow s (D n e w u, D (u, v), D v))

在Pregel中，gather phase用message combiner来实现，apply和scatter phase用vertex program来实现

在GraphLab中，通过保证点状态和边状态的改变对邻点可见，来隐式的定义gather/scatter phases

PowerGraph借鉴了两者。首先，从GraphLab中借鉴了其data-graph和shared-memory的理念，这样可以减少用户对information movement的操作。再者，从Pregel中，PowerGraph借鉴了其commutative，associative gather的观念。
下图为PowerGraph的编程抽象
interface
编程抽象中显式的定义了gather，sum，apply和scatter等function。每一个function都被PowerGraph engine在stages里调用，并且都符合Alg.1的语义。

在gather阶段，gather和sum函数将被用作map和reduce。在GraphX里对Pregel的实现中，有这么一句

 messages = g.mapReduceTriplets(sendMsg, mergeMsg, Some((newVerts, activeDirection))).cache()

应该就是对PowerGraph的借鉴了，其中的sendMsg为map函数，mergeMsg为reduce函数。
同时gather函数在与u相邻的边上被并行的调用。
（现在我还是不能很好的理解并行中的执行顺序，这应该会对算法造成影响，但是这种影响是怎样被避免的？）

gather phase之后产生一个最终结果au，之后这个结果在apply phase被来产生新的Du，注意Du被写回图的时候是atomically的。

在scatter phase，与gather类似，也是在每个与u相邻的边上并行调用scatter function，产生新的D(u,v)之后被写回data-graph中。

## Initiating Future Computation ##
PowerGraph可以在两种模式下执行：

synchronously
与Pregel类似，然而受frequent barriers和inability to operate on the most recent data的限制
asynchronously（没有看过相关材料，需要接下来看）
在这种模式下，active vertices被看做processor并且网络资源可用。也就是说，vertex和edge状态的更新会马上提交到graph中，然后在接下来的运算中可见。在这种模式下，为了保证算法的正确性，PowerGraph像GraphLab一样，强制serializability，又为了针对natural graph的运算，PowerGraph采用了新的locking scheme。

Distributed Graph Placement

placement of the data-graph structure and data是减少通信和保证balance的关键。
paper中对比了几种partition策略（paper中对每种策略从被cut的边的期望或replication的期望进行了定量的分析）

p-way edge-cut 在powe-law graph中，现有的edge-cut的工具并不能保证balance
randomized p-way Vertex-Cut
Greedy Vertex-Cut 是一种启发式的算法
总结
看了PowerGraph的东西才突然明白之前看GraphX: A Resilient Distributed Graph System on Spark中提到的edge-cut与vertex-cut。GraphX中对Pregel的实现方法与标准的Pregel不同的根本原因也就是partition的策略不同，因此程序并行的方式不同。感觉好多论文都是第一遍看的时候不是很明白，之后看了相关的论文才有更深的理解。以后还是要增加阅读量，下一周想看一看GraphLab相关的东西，最好能读一下源代码。

0 0

毕业设计第四周(PowerGraph)

PowerGraph

PowerGraph Abstraction

Distributed Graph Placement

总结