SparkStream：4)foreachRDD详解

来源：互联网发布：信仰的定义知乎编辑：程序博客网时间：2024/06/05 03:09

转载自：http://blog.csdn.net/jiangpeng59/article/details/53318761

foreachRDD通常用来把SparkStream运行得到的结果保存到外部系统比如HDFS、Mysql、Redis等等。了解下面的知识可以帮助我们避免很多误区

误区1：实例化外部连接对象的位置不正确，比如下面代码

dstream.foreachRDD { rdd =>  
  val connection = createNewConnection()  // executed at the driver  
  rdd.foreach { record =>  
    connection.send(record) // executed at the worker  
  }  
}

其实例化的连接对象在driver中，然后通过序列化的方式发送到各个Worker，但实际上Connection的序列化通常是无法正确序列化的

误区2：为每条记录都创建一个连接对象

dstream.foreachRDD { rdd =>  
  rdd.foreach { record =>  
    val connection = createNewConnection()  
    connection.send(record)  
    connection.close()  
  }  
}

虽然误区1的问题得到了解决，但通常情况下，外部系统如mysql，其连接对象是非常可贵的，如果一条记录就申请一个连接资源，系统性能会非常糟糕

然后，给出了一个比较好的方法，为每一个分区创建一个连接对象，其具体代码如下

dstream.foreachRDD { rdd =>  
  rdd.foreachPartition { partitionOfRecords =>  
    val connection = createNewConnection()  
    partitionOfRecords.foreach(record => connection.send(record))  
    connection.close()  
  }  
}

最后给出一个较优的方案，使用一个连接池来维护连接对象

dstream.foreachRDD { rdd =>  
  rdd.foreachPartition { partitionOfRecords =>  
    // ConnectionPool is a static, lazily initialized pool of connections  
    val connection = ConnectionPool.getConnection()  
    partitionOfRecords.foreach(record => connection.send(record))  
    ConnectionPool.returnConnection(connection)  // return to the pool for future reuse  
  }  
}

正如上面代码阐述的，连接对象推荐是使用lazy关键字来修饰，用到的时候才去实例化

下面给出网上一段把SparkStream的结果保存到Mysql中的代码示例

package spark.examples.streaming  
  
import java.sql.{PreparedStatement, Connection, DriverManager}  
import java.util.concurrent.atomic.AtomicInteger  
  
import org.apache.spark.SparkConf  
import org.apache.spark.streaming.{Seconds, StreamingContext}  
import org.apache.spark.streaming._  
import org.apache.spark.streaming.StreamingContext._  
  
object SparkStreamingForPartition {  
  def main(args: Array[String]) {  
    val conf = new SparkConf().setAppName("NetCatWordCount")  
    conf.setMaster("local[3]")  
    val ssc = new StreamingContext(conf, Seconds(5))  
    //The DStream is a collection of RDD, which makes the method foreachRDD reasonable  
    val dstream = ssc.socketTextStream("192.168.26.140", 9999)  
    dstream.foreachRDD(rdd => {  
      //embedded function  
      def func(records: Iterator[String]) {  
        var conn: Connection = null  
        var stmt: PreparedStatement = null  
        try {  
          val url = "jdbc:mysql://192.168.26.140:3306/person";  
          val user = "root";  
          val password = ""  
          conn = DriverManager.getConnection(url, user, password)  
          records.flatMap(_.split(" ")).foreach(word => {  
            val sql = "insert into TBL_WORDS(word) values (?)";  
            stmt = conn.prepareStatement(sql);  
            stmt.setString(1, word)  
            stmt.executeUpdate();  
          })  
        } catch {  
          case e: Exception => e.printStackTrace()  
        } finally {  
          if (stmt != null) {  
            stmt.close()  
          }  
          if (conn != null) {  
            conn.close()  
          }  
        }  
      }  
      val repartitionedRDD = rdd.repartition(3)  
      repartitionedRDD.foreachPartition(func)  
    })  
    ssc.start()  
    ssc.awaitTermination()  
  }  
}

注意的细节：

Dstream和RDD一样是延迟执行，只有遇到action操作才会真正去计算。因此在Dstream的内部RDD必须包含Action操作才能是接受到的数据得到处理。即使代码中包含foreachRDD,但在内部却没有action的RDD，SparkStream只会简单地接受数据数据而不进行处理

阅读全文

0 0