kafka+Structured Streaming+s3+dynamodb
来源:互联网 发布:手机淘宝自动登录 编辑:程序博客网 时间:2024/06/05 05:37
本文主要介绍从kafka消费数据,并通过两个业务需求(统计PV,UV),然后分别输出到dynamodb和S3的demo,demo仅做演示逻辑,无法直接使用
def main(args: Array[String]) { val bucket:String = _ val pvCheckLocation:String = _ val uvCheckLocation:String = _ val uvPath:String = _ //create spark session val spark = SparkSession.builder() .master("local[*]") .config("spark.eventLog.enabled", "false") .config("spark.driver.memory", "2g") .config("spark.executor.memory", "2g") .config("spark.sql.shuffle.partitions","4") .appName("kafkaDemoToS3AndDynamoDB") .getOrCreate() import spark.implicits._ //read stream from kafka val kafkaStream = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1,topic2") .option("startingOffsets", "latest") .option("failOnDataLoss", false) .option("maxOffsetsPerTrigger", 10L) .load() .selectExpr("CAST(value AS STRING)", "CAST(topic as STRING)", "CAST(partition as INTEGER)") .as[(String, String, Integer)] val cols = List("domain", "ip", "timestamp") //parse data to entity val df = kafkaStream.map { line => val columns = line._1.split(" ") (columns(0), columns(1), columns(2)) }.toDF(cols:_*) //all logs val ds = df.distinct().select($"domain", $"ip", $"timestamp").as[ClickStream] //compute pv val dsPV = ds.map(x => (x.domain,(x.ip,x.timestamp))).groupBy(x.domain)....count() //compute uv val dsUV = ds.map(x => (x.domain,x.ip)).groupBy(x.domain).....count() //save pv to dynamodb val queryPVCount = dsPV.writeStream .outputMode("update") .trigger(Trigger.ProcessingTime("10 seconds")) .option("checkpointLocation", pvCheckLocation) .foreach(new ForeachWriter[Row] { var ddbClient: AmazonDynamoDBClient = _ override def open(partitionId: Long, version: Long) = { ddbClient = new dynamodbv2.AmazonDynamoDBClient() true } override def process(value: Row) = { val put = new PutItemRequest() put.setTableName("table") var attr = new AttributeValue() attr.setS(value.getAs[String](0)) put.addItemEntry("domain", attr) attr = new AttributeValue() attr.setN(value.getAs[Long](1).toString) put.addItemEntry("pv", attr) ddbClient.putItem(put) } override def close(errorOrNull: Throwable) = { } } ).start() //save uv to s3 val queryUVCount = dsUV.writeStream .outputMode("update") .trigger(Trigger.ProcessingTime("10 seconds")) .option("checkpointLocation", uvCheckLocation) .start("s3://" + bucket+ "/" + uvPath) queryPVCount.awaitTermination() queryUVCount.awaitTermination() } //entity case class ClickStream(domain:String, ip:String,timestamp:Long)
阅读全文
1 0
- kafka+Structured Streaming+s3+dynamodb
- Structured Streaming
- Aws Dynamodb数据导出到S3
- 从S3中导入数据到Dynamodb
- Spark2.0 Structured Streaming
- Spark2.0 Structured Streaming
- Spark 2.1 structured streaming
- Structured Streaming 输入输出
- Spark2.0: Structured Streaming
- spark2.2 structured Streaming
- Spark2.0 Structured Streaming
- Spark Structured Streaming、Kafak整合
- structured streaming ——wordcounts_kafka
- DynamoDB
- spark 2.0.0 Structured Streaming Programming
- Structured Streaming Programming[结构化流式编程]
- 「Spark-2.2.0」Structured Streaming
- Spark Structured Streaming入门编程指南
- 第一篇文章
- git设置用户名密码
- Maven
- Spark2.x学习笔记:3、 Spark核心概念RDD
- 微信授权登录_代码可以直接使用
- kafka+Structured Streaming+s3+dynamodb
- 【CUGBACM15级BC第27场 B】hdu 5163 Taking Bus
- Android中AndroidManifest ARSC 二进制文件修改器AXMLEditor
- 简述数据库查询优化
- springmvc文件上传
- 编写一个程序,开启3个线程,这3个线程的ID分别为A、B、C,每个线程将自己的ID在屏幕上打印10遍,要求输出结果必须按ABC的顺序显示;如:ABCABC….依次递推。
- pandas
- 如何在一个可变参数函数中调用另一个可变参数函数
- Mybatis+Mysql批量插入返回自增主键