spark2.2 structured Streaming

来源：互联网发布：大数据存储技术研究编辑：程序博客网时间：2024/06/07 06:43

其实官方文档都说明了：

http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html

package com.renjiaming.spark2T2import java.util.concurrent.TimeUnitimport org.apache.log4j.{Level, Logger}import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}/**  * Created by othc on 2017/7/14.  */object StructuredStreaming {  def main(args: Array[String]): Unit = {    Logger.getLogger("org").setLevel(Level.ERROR)    val spark: SparkSession = SparkSession.builder()      .appName("StructuredStreaming")      .master("local[2]")      .getOrCreate()    import spark.implicits._    val lines: DataFrame = spark.readStream.format("socket")      .option("host","scfl4")      .option("port","9999")      .load()    val word: Dataset[String] = lines.as[String].flatMap(_.split(" "))    val count: DataFrame = word.groupBy("value").count()    val query: StreamingQuery = count.writeStream      .trigger(ProcessingTime.create(10, TimeUnit.SECONDS))      .outputMode("complete")      .format("console")      .start()    query.awaitTermination()  }}

以下是官方文档：

Creating streaming DataFrames and streaming DatasetsStreaming DataFrames can be created through the DataStreamReader interface (Scala/Java/Python docs)

returned by SparkSession.readStream(). In R, with the read.stream() method.

 Similar to the read interface for creating static DataFrame,

you can specify the details of the source – data format, schema, options, etc.Input SourcesIn Spark 2.0, there are a few built-in sources.File source - Reads files written in a directory as a stream of data. Supported file formats are text, csv, json, parquet. See the docs of the DataStreamReader interface for a more up-to-date list, and supported options for each file format. Note that the files must be atomically placed in the given directory, which in most file systems, can be achieved by file move operations.Kafka source - Poll data from Kafka. It’s compatible with Kafka broker versions 0.10.0 or higher.

 See the Kafka Integration Guide for more details.Socket source (for testing) - Reads UTF8 text data from a socket connection. T

he listening server socket is at the driver.

Note that this should be used only for testing as this does not provide end-to-end fault-tolerance guarantees.Some sources are not fault-tolerant because they do not guarantee that data can be replayed using checkpointed offsets after a failure. See the earlier section on fault-tolerance semantics. Here are the details of all the sources in Spark.Source Options  Fault-tolerant Notes:File source  path: path to the input directory, and common to all file formats.maxFilesPerTrigger: maximum number of new files to be considered in every trigger (default: no max)latestFirst: whether to processs the latest new files first, useful when there is a large backlog of files (default: false)fileNameOnly: whether to check new files based on only the filename instead of on the full path (default: false). With this set to `true`, the following files would be considered as the same file, because their filenames, "dataset.txt", are the same:· "file:///dataset.txt"· "s3://a/dataset.txt"· "s3n://a/b/dataset.txt"· "s3a://a/b/c/dataset.txt"For file-format-specific options, see the related methods in DataStreamReader (Scala/Java/Python/R). E.g. for "parquet" format options see DataStreamReader.parquet()  Yes  Supports glob paths, but does not support multiple comma-separated paths/globs.Socket Source:

host: host to connect to, must be specifiedport: port to connect to, must be specified  NoKafka Source See the Kafka Integration Guide. YesHere are some examples.val spark: SparkSession = ...// Read text from socketval socketDF = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()socketDF.isStreaming    // Returns True for DataFrames that have streaming sourcessocketDF.printSchema// Read all the csv files written atomically in a directoryval userSchema = new StructType().add("name", "string").add("age", "integer")val csvDF = spark.readStream.option("sep", ";").schema(userSchema)      // Specify schema of the csv files.csv("/path/to/directory")    // Equivalent to format("csv").load("/path/to/directory")

阅读全文

0 0