Spark Streaming 结合Spark SQL 案例

来源：互联网发布：centos openstack 编辑：程序博客网时间：2024/05/29 11:27

本博文主要包含以下内容：

String+SQL技术实现解析
Streaming+SQL实现实战

一：SparkString+SparkSQL技术实现解析：

使用Spark Streaming + Spark SQL 来在线计算电商中不同类别中最热门的商品排名，例如手机这个类别下面最热门的三种手机、电视
这个类别下最热门的三种电视，该实例在实际生产环境下具有非常重大的意义；
实现技术：SparkStreaming+Spark SQL,之所以SparkStreaming能够使用ML、sql、graphx等功能是因为有foreach何Transformation等
接口，这些接口中其实是基于RDD进行操作的，所以以RDD为基石，就可以直接使用Spark其他所有的功能，就像直接调用API一样简单，假设说这里的数据的格式：user item category，例如Rocky Samsung Android

二：SparkStreaming+SparkSQL实现实战：
1、代码如下：

object OnlineTheTop3ItemForEachCategory2DB {  def main(args: Array[String]) {    val conf = new SparkConf().setAppName("OnlineForeachRDD2DB").setMaster("local[2]")    val ssc = new StreamingContext(conf, Seconds(5))    ssc.checkpoint("/root/Documents/SparkApps?checkpoint")    val userClickLogsDStream = ssc.socketTextStream("Master", 9999)    val formattedUserClickLogDStream = userClickLogsDStream.map(clickLog =>      (clickLog.split(" ")(2) +"_" + clickLog.split(" ")(1), 1))    val categoryUserClickLogsDStream = formattedUserClickLogDStream.reduceByKeyAndWindow(_ + _ ,      _ - _ ,Seconds(60),Seconds(20))    categoryUserClickLogsDStream.foreachRDD{rdd =>{      if(rdd.isEmpty()){        print("No data inputted!!!")      }else {        val categoryItemRow = rdd.map(reducedItem => {          val category = reducedItem._1.split("_")(0)          val item = reducedItem._1.split("——")(1)          val click_count = reducedItem._2          Row(category, item, click_count)        })        val structType = StructType(Array(          StructField("category", StringType, true),          StructField("item", StringType, true),          StructField("click_count", IntegerType, true)        ))        val hiveContext = new HiveContext(rdd.context)        val categoryItemDF = hiveContext.createDataFrame(categoryItemRow, structType)        categoryItemDF.registerTempTable("categoryItemTable")        val resultDataFrame = hiveContext.sql("SELECT category,item,click_count FROM (SELECT category,item,click_count,row_number()" +          "OVER(PARTITION BY category ORDER BY click_count DESC) rank FROM categoryItemTable) subquery " +          "WHERE rank <= 3")        resultDataFrame.show()        val resultRowRDD = resultDataFrame.rdd        resultRowRDD.foreachPartition { partitionOfRecords => {          if(partitionOfRecords.isEmpty){            println("this is RDD is not null but partition is null")          }else{            val connection = ConnectionPool.getConnection()            partitionOfRecords.foreach(record => {              val sql = "insert into categorytop3(category,item,client_count) values('" + record.getAs("category") + "','" +                record.getAs("item") + "','" + record.getAs("click_count") + ")"              val stmt = connection.createStatement              stmt.executeUpdate(sql)            })            ConnectionPool.returnConnection(connection)          }        }        }      }      }      }    ssc.start()    ssc.awaitTermination()  }}

2、接下来将代码打包放到集群中，并且写好shell脚本

/usr/local/spark/bin/spark-submit --files /usr/local/hive/conf/hive-site.xml --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.35-bin.jar /root/Documents/SparkApps/SparkStreamingApps.jar

3、启动集群以及Hive 服务：

hive --service metastore &

4、进入MySQL中创建表：

create table categorytop3 (category varchar(500),item varchar(2000),client_count int);

5、启动作业，以及打开nc -lk 9999服务

6、观察结果：

博文内容源自DT大数据梦工厂Spark课程的总结笔记。相关课程内容视频可以参考：
百度网盘链接：http://pan.baidu.com/s/1slvODe1（如果链接失效或需要后续的更多资源，请联系QQ460507491或者微信号：DT1219477246 获取上述资料）。

0 0