通过代码实例来说明spark api mapPartitions和mapPartitionsWithIndex的使用
来源:互联网 发布:笑话系统cms v3.0源码 编辑:程序博客网 时间:2024/05/29 15:43
代码片段1:
package com.oreilly.learningsparkexamples.scalaimport org.apache.spark._import org.eclipse.jetty.client.ContentExchangeimport org.eclipse.jetty.client.HttpClientobject BasicMapPartitions { def main(args: Array[String]) { val master = args.length match { case x: Int if x > 0 => args(0) case _ => "local" } val sc = new SparkContext(master, "BasicMapPartitions", System.getenv("SPARK_HOME")) val input = sc.parallelize(List("KK6JKQ", "Ve3UoW", "kk6jlk", "W6BB")) val result = input.mapPartitions{ signs => val client = new HttpClient() client.start() signs.map {sign => val exchange = new ContentExchange(true); exchange.setURL(s"http://qrzcq.com/call/${sign}") client.send(exchange) exchange }.map{ exchange => exchange.waitForDone(); exchange.getResponseContent() } } println(result.collect().mkString(",")) }}
上面的代码中,
mapPartitions的参数signs是input这个rdd的一个分区的所有element组成的Iterator
mapPartitions结果是一个分区的所有element被分区处理函数加工后的element组成的Iterator.
mapPartitions函数会对每个分区调用分区函数处理,然后将处理的结果(若干个Iterator)生成新的RDDs
如下这段代码:
package com.oreilly.learningsparkexamples.scalaimport org.apache.spark._object BasicAvgMapPartitions { case class AvgCount(var total: Int = 0, var num: Int = 0) { def merge(other: AvgCount): AvgCount = { total += other.total num += other.num this } def merge(input: Iterator[Int]): AvgCount = { input.foreach{elem => total += elem num += 1 } this } def avg(): Float = { total / num.toFloat; } } def main(args: Array[String]) { val master = args.length match { case x: Int if x > 0 => args(0) case _ => "local" } val sc = new SparkContext(master, "BasicAvgMapPartitions", System.getenv("SPARK_HOME")) val input = sc.parallelize(List(1, 2, 3, 4)) val result = input.mapPartitions(partition => // Here we only want to return a single element for each partition, but mapPartitions requires that we wrap our return in an Iterator Iterator(AvgCount(0, 0).merge(partition))) .reduce((x,y) => x.merge(y)) println(result) }}
上面的测试代码,首先对一个RDDs的分区所有元素组成的Iterator进行了合并操作,生成了一个元素,然后调用Iterator()生成一个新的Iterator,然后作为结果返回(虽然返回的Iterator中只有一个元素)
mapPartitionsWithIndex与mapPartition基本相同,只是在处理函数的参数是一个二元元组,元组的第一个元素是当前处理的分区的index,元组的第二个元素是当前处理的分区元素组成的Iterator
0 0
- 通过代码实例来说明spark api mapPartitions和mapPartitionsWithIndex的使用
- 通过代码实例来说明spark api mapPartitions和mapPartitionsWithIndex的使用
- 【Spark Java API】Transformation(1)—mapPartitions、mapPartitionsWithIndex
- spark学习-19-Spark的mapPartitions与MapPartitionsWithIndex理解
- spark 的transformations之map,flatMap,mapPartitions,mapPartitionsWithIndex的用法
- Spark编程之基本的RDD算子之map,mapPartitions, mapPartitionsWithIndex.
- Spark算子:RDD基本转换操作(mapPartitions、mapPartitionsWithIndex)
- Spark算子:RDD基本转换操作(5)–mapPartitions、mapPartitionsWithIndex
- Spark算子:RDD基本转换操作(5)–mapPartitions、mapPartitionsWithIndex
- Spark算子:RDD基本转换操作(5)–mapPartitions/mapPartitionsWithIndex
- Spark算子:RDD基本转换操作(5)–mapPartitions、mapPartitionsWithIndex
- Spark算子:RDD基本转换操作(5)–mapPartitions、mapPartitionsWithIndex
- 3.2 Spark RDD 基本转换操作5-mapPartitions、mapPartitionsWithIndex
- Spark算子[03]:mapPartitions,mapPartitionsWithIndex 源码实战案例分析
- spark map和mapPartitions的区别
- Spark中foreachPartition和mapPartitions的区别
- Spark中mapPartitions使用
- Spark API 之 map、mapPartitions
- workqueue
- slidingmenu_library
- Java并发编程:深入剖析ThreadLocal
- zookeeper 原理
- thinkphp最新版本上传bug与解决办法
- 通过代码实例来说明spark api mapPartitions和mapPartitionsWithIndex的使用
- 部分框架
- ps命令中的%CPU字段和top命令中的%CPU字段
- SylixOS I/O系统
- 分层自动化测试
- C++派生访问控制说明符
- Android 表情功能的完整处理方案
- 随机数
- JS中Null与Undefined的区别