Spark 2.1 structured streaming

来源：互联网发布：淘宝虚标电池编辑：程序博客网时间：2024/05/02 00:00

概述

在Spark2.0时，Spark引入了structured streaming，structured streaming是建立在Spark SQL之上的可扩展和高容错的流处理架构。不同于Spark1.x时代的DStream和ForeachRDD, structured streaming的目的是使用户能够像使用Spark SQL处理批处理一样，能够使用相同的方法处理流数据。Spark SQL引擎会递增式的处理到来的数据，并且持续更新流处理输出的数据。

当前的structured streaming 特性还是处于alpha版本，可以进行实验环境的验证，不建议进行生产环境。

没有边界的大表

不同于Spark1.x使用interval将流数据分为不同的mini batch, structured streaming将流数据看作是一张没有边界的表，流数据不断的向表尾增加数据。如下图所示：
这里写图片描述

在每一个周期时，新的内容将会增加到表尾，查询的结果将会更新到结果表中。一旦结果表被更新，就需要将改变后的表内容输出到外部的sink中。

structured streaming支持三种输出模式:

Complete mode: 整个更新的结果表都会被输出。
Append mode: 只有新增加到结果表的数据会被输出。
Updated mode: 只有被更新的结果表会输出。当前版本暂不支持这个特性
Word count

不同于Spark1.x，用户需要自己保存历史的数据，structured steaming会帮助用户维护历史的计算数据放到结果表中，每次只需要更新结果表的数据。

Event time

event time作为Row的一列，表示的event的实际时间，而不是到达streaming处理的时间。
event time可以用来进行基于时间相关的计算

watermark处理延迟数据

上面提到了structured streaming可以维护历史的数据，但是如果一条数据的到来时间延迟过长，那么计算这条数据没有什么意义。因此需要一种机制丢弃掉延迟到来的数据。在Spark2.1中，引入了watermark机制。watermark指定列名称为event time的，并且定义了数据延迟到大的最大阈值。超过这个阈值到来的数据将会被忽略。
使用示例如下：

import spark.implicits._val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String }// Group the data by window and word and compute the count of each groupval windowedCounts = words    .withWatermark("timestamp", "10 minutes")    .groupBy(        window($"timestamp", "10 minutes", "5 minutes"),        $"word")    .count()

Kafka支持

Spark2.1支持Kafka0.10.0集成structured streaming. 可以支持从一个topic，多个topic读入数据生成data frame.

// Subscribe to 1 topicval ds1 = spark  .readStream  .format("kafka")  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")  .option("subscribe", "topic1")  .load()ds1.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")  .as[(String, String)]// Subscribe to multiple topicsval ds2 = spark  .readStream  .format("kafka")  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")  .option("subscribe", "topic1,topic2")  .load()ds2.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")  .as[(String, String)]// Subscribe to a patternval ds3 = spark  .readStream  .format("kafka")  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")  .option("subscribePattern", "topic.*")  .load()ds3.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")  .as[(String, String)]

转自简书

0 0