Spark图计算(二)

来源:互联网 发布:重庆西南大学网络学费 编辑:程序博客网 时间:2024/05/17 07:13

图操作

如同RDDs有如同map,filter和reduceByKey这些基本操作,属性图也有一些基本操作可以接受用户自定义函数转化属性和结构从而生成新图。优化应用的核心操作定义在Graph中,简便操作是核心的集合并定义在GraphOps中。由于Scala的隐式性GraphOps中的操作可自动的在Graph中获得。例如我们可以计算每个点(定义在GraphOps)的入度如下:

val graph: Graph[(String, String), String]// Use the implicit GraphOps.inDegrees operatorval inDegrees: VertexRDD[Int] = graph.inDegrees
区别核心图操作和GraphOps是为了能够在将来区分图表征。每个图体现需要提供核心操作的应用并重复使用一些定义在GraphOps中的有用操作。

操作摘要列表

下列是定义在Graph和GraphOps中但简化起见定义在Graph中的功能摘要。注意一些函数的签名被简化(默认申明和类型限制被移除)并且一些高级的功能被移除。

/** Summary of the functionality in the property graph */class Graph[VD, ED] {  // Information about the Graph ===================================================================  val numEdges: Long  val numVertices: Long  val inDegrees: VertexRDD[Int]  val outDegrees: VertexRDD[Int]  val degrees: VertexRDD[Int]  // Views of the graph as collections =============================================================  val vertices: VertexRDD[VD]  val edges: EdgeRDD[ED]  val triplets: RDD[EdgeTriplet[VD, ED]]  // Functions for caching graphs ==================================================================  def persist(newLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED]  def cache(): Graph[VD, ED]  def unpersistVertices(blocking: Boolean = true): Graph[VD, ED]  // Change the partitioning heuristic  ============================================================  def partitionBy(partitionStrategy: PartitionStrategy): Graph[VD, ED]  // Transform vertex and edge attributes ==========================================================  def mapVertices[VD2](map: (VertexId, VD) => VD2): Graph[VD2, ED]  def mapEdges[ED2](map: Edge[ED] => ED2): Graph[VD, ED2]  def mapEdges[ED2](map: (PartitionID, Iterator[Edge[ED]]) => Iterator[ED2]): Graph[VD, ED2]  def mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]  def mapTriplets[ED2](map: (PartitionID, Iterator[EdgeTriplet[VD, ED]]) => Iterator[ED2])    : Graph[VD, ED2]  // Modify the graph structure ====================================================================  def reverse: Graph[VD, ED]  def subgraph(      epred: EdgeTriplet[VD,ED] => Boolean = (x => true),      vpred: (VertexId, VD) => Boolean = ((v, d) => true))    : Graph[VD, ED]  def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]  def groupEdges(merge: (ED, ED) => ED): Graph[VD, ED]  // Join RDDs with the graph ======================================================================  def joinVertices[U](table: RDD[(VertexId, U)])(mapFunc: (VertexId, VD, U) => VD): Graph[VD, ED]  def outerJoinVertices[U, VD2](other: RDD[(VertexId, U)])      (mapFunc: (VertexId, VD, Option[U]) => VD2)    : Graph[VD2, ED]  // Aggregate information about adjacent triplets =================================================  def collectNeighborIds(edgeDirection: EdgeDirection): VertexRDD[Array[VertexId]]  def collectNeighbors(edgeDirection: EdgeDirection): VertexRDD[Array[(VertexId, VD)]]  def aggregateMessages[Msg: ClassTag](      sendMsg: EdgeContext[VD, ED, Msg] => Unit,      mergeMsg: (Msg, Msg) => Msg,      tripletFields: TripletFields = TripletFields.All)    : VertexRDD[A]  // Iterative graph-parallel computation ==========================================================  def pregel[A](initialMsg: A, maxIterations: Int, activeDirection: EdgeDirection)(      vprog: (VertexId, VD, A) => VD,      sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId,A)],      mergeMsg: (A, A) => A)    : Graph[VD, ED]  // Basic graph algorithms ========================================================================  def pageRank(tol: Double, resetProb: Double = 0.15): Graph[Double, Double]  def connectedComponents(): Graph[VertexId, ED]  def triangleCount(): Graph[Int, ED]  def stronglyConnectedComponents(numIter: Int): Graph[VertexId, ED]}

属性操作

如同RDD的map操作,属性图包含如下:

class Graph[VD, ED] {  def mapVertices[VD2](map: (VertexId, VD) => VD2): Graph[VD2, ED]  def mapEdges[ED2](map: Edge[ED] => ED2): Graph[VD, ED2]  def mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]}

每个操作生成一个新图,点和边被用户定义的map函数修改。

注意上述操作不改变图结构。这是这些操作的关键特征,允许图重复使用原始图结构索引。下列代码在逻辑上相同,但第一个并不保存结构索引因此无法从GraphX系统优化中获益。

val newVertices = graph.vertices.map { case (id, attr) => (id, mapUdf(id, attr)) }val newGraph = Graph(newVertices, graph.edges)
使用mapVertices来保留索引

val newGraph = graph.mapVertices((id, attr) => mapUdf(id, attr))

这些操作常用来初始化特殊计算图。例如给定一个使用出度为点属性的图,我们为一个PageRank来初始化:

// Given a graph where the vertex property is the out degreeval inputGraph: Graph[Int, String] =  graph.outerJoinVertices(graph.outDegrees)((vid, _, degOpt) => degOpt.getOrElse(0))// Construct a graph where each edge contains the weight// and each vertex is the initial PageRankval outputGraph: Graph[Double, Double] =  inputGraph.mapTriplets(triplet => 1.0 / triplet.srcAttr).mapVertices((id, _) => 1.0)

结构属性

当前GraphX只支持简单常用的结构操作

class Graph[VD, ED] {  def reverse: Graph[VD, ED]  def subgraph(epred: EdgeTriplet[VD,ED] => Boolean,               vpred: (VertexId, VD) => Boolean): Graph[VD, ED]  def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]  def groupEdges(merge: (ED, ED) => ED): Graph[VD,ED]}
reverse操作返回一个新图,所有边的方向都相反。这在计算例如反向PageRank时很有用,因为不改变点和边的属性或是改变边的数量,可以有效的实现而无需移动或复制数据。

subgraph操作接受点和边的预测并返回只包括满足点预测(评估为真)的点和边预测及满足点预测的连接点的边。子图操作可用来选择感兴趣的图或是去除损坏的连接。例如下列代码移除了损坏的连接:

// Create an RDD for the verticesval users: RDD[(VertexId, (String, String))] =  sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")),                       (5L, ("franklin", "prof")), (2L, ("istoica", "prof")),                       (4L, ("peter", "student"))))// Create an RDD for edgesval relationships: RDD[Edge[String]] =  sc.parallelize(Array(Edge(3L, 7L, "collab"),    Edge(5L, 3L, "advisor"),                       Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"),                       Edge(4L, 0L, "student"),   Edge(5L, 0L, "colleague")))// Define a default user in case there are relationship with missing userval defaultUser = ("John Doe", "Missing")// Build the initial Graphval graph = Graph(users, relationships, defaultUser)// Notice that there is a user 0 (for which we have no information) connected to users// 4 (peter) and 5 (franklin).graph.triplets.map(  triplet => triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1).collect.foreach(println(_))// Remove missing vertices as well as the edges to connected to themval validGraph = graph.subgraph(vpred = (id, attr) => attr._2 != "Missing")// The valid subgraph will disconnect users 4 and 5 by removing user 0validGraph.vertices.collect.foreach(println(_))validGraph.triplets.map(  triplet => triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1).collect.foreach(println(_))
注意上述例子只提供了点预测。subgraph操作默认奖点和边的预测设置为true。
mask操作构建了一个子图,返回包含在输入图中的点和边。这可以与subgraph操作联合使用以基于另一个相关图的属性来限制图。例如我们使用有遗失点的图运行连接组件,然后限制答案到有效子图。

// Run Connected Componentsval ccGraph = graph.connectedComponents() // No longer contains missing field// Remove missing vertices as well as the edges to connected to themval validGraph = graph.subgraph(vpred = (id, attr) => attr._2 != "Missing")// Restrict the answer to the valid subgraphval validCCGraph = ccGraph.mask(validGraph)
groupEdges在多图中操作合并并行边(例如点对之间的重复边)。在很多数字应用,并行边可被加(权重结合)到一个边从而简化图的规模。

结合操作

有时需要将外部集合(RDDs)和图相结合。例如我们需要将额外的用户属性融合到现存的图,或者想将点属性从一个图放到另外一个图。这些任务可以通过join操作实现。

class Graph[VD, ED] {  def joinVertices[U](table: RDD[(VertexId, U)])(map: (VertexId, VD, U) => VD)    : Graph[VD, ED]  def outerJoinVertices[U, VD2](table: RDD[(VertexId, U)])(map: (VertexId, VD, Option[U]) => VD2)    : Graph[VD2, ED]}

joinVertices操作将点与输入RDD相结合并返回一个新图,点的属性通过将用户定义的map函数应用于连接点的结果获得。在RDD中没有符合值的点保留原值。

注意如果RDD对于给定点多于一个值,只会使用其中一个。因此建议输入RDD使用下述代码,pre-index结果值以加速此后的结合操作。

val nonUniqueCosts: RDD[(VertexId, Double)]val uniqueCosts: VertexRDD[Double] =  graph.vertices.aggregateUsingIndex(nonUnique, (a,b) => a + b)val joinedGraph = graph.joinVertices(uniqueCosts)(  (id, oldCost, extraCost) => oldCost + extraCost)

更普遍的outerJoinVertices与joinVertices类似,但用户定义的map函数可应用于所有点并改变点的属性类型。因为并非所有的点对输入RDD有对应的值map函数有一个Option类型。例如我们设置一个PageRank,使用outDegree来初始化点属性。

val outDegrees: VertexRDD[Int] = graph.outDegreesval degreeGraph = graph.outerJoinVertices(outDegrees) { (id, oldAttr, outDegOpt) =>  outDegOpt match {    case Some(outDeg) => outDeg    case None => 0 // No outDegree means zero outDegree  }}

你可能注意到了多个参数列表(例如f(a)(b))函数模式使用在上述例子中。虽然我们可以将f(a)(b)写成f(a,b),这意味着b的类型推断不依赖于a,用于需要为用户定义的函数注明类型。

val joinedGraph = graph.joinVertices(uniqueCosts,  (id: VertexId, oldCost: Double, extraCost: Double) => oldCost + extraCost)



原创粉丝点击