spark2.2 structured Streaming
来源:互联网 发布:大数据存储技术研究 编辑:程序博客网 时间:2024/06/07 06:43
其实官方文档都说明了:
http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
package com.renjiaming.spark2T2import java.util.concurrent.TimeUnitimport org.apache.log4j.{Level, Logger}import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}/** * Created by othc on 2017/7/14. */object StructuredStreaming { def main(args: Array[String]): Unit = { Logger.getLogger("org").setLevel(Level.ERROR) val spark: SparkSession = SparkSession.builder() .appName("StructuredStreaming") .master("local[2]") .getOrCreate() import spark.implicits._ val lines: DataFrame = spark.readStream.format("socket") .option("host","scfl4") .option("port","9999") .load() val word: Dataset[String] = lines.as[String].flatMap(_.split(" ")) val count: DataFrame = word.groupBy("value").count() val query: StreamingQuery = count.writeStream .trigger(ProcessingTime.create(10, TimeUnit.SECONDS)) .outputMode("complete") .format("console") .start() query.awaitTermination() }}
以下是官方文档:
Creating streaming DataFrames and streaming DatasetsStreaming DataFrames can be created through the DataStreamReader interface (Scala/Java/Python docs)
returned by SparkSession.readStream(). In R, with the read.stream() method.
Similar to the read interface for creating static DataFrame,
you can specify the details of the source – data format, schema, options, etc.Input SourcesIn Spark 2.0, there are a few built-in sources.File source - Reads files written in a directory as a stream of data. Supported file formats are text, csv, json, parquet. See the docs of the DataStreamReader interface for a more up-to-date list, and supported options for each file format. Note that the files must be atomically placed in the given directory, which in most file systems, can be achieved by file move operations.Kafka source - Poll data from Kafka. It’s compatible with Kafka broker versions 0.10.0 or higher.
See the Kafka Integration Guide for more details.Socket source (for testing) - Reads UTF8 text data from a socket connection. T
he listening server socket is at the driver.
Note that this should be used only for testing as this does not provide end-to-end fault-tolerance guarantees.Some sources are not fault-tolerant because they do not guarantee that data can be replayed using checkpointed offsets after a failure. See the earlier section on fault-tolerance semantics. Here are the details of all the sources in Spark.Source Options Fault-tolerant Notes:File source path: path to the input directory, and common to all file formats.maxFilesPerTrigger: maximum number of new files to be considered in every trigger (default: no max)latestFirst: whether to processs the latest new files first, useful when there is a large backlog of files (default: false)fileNameOnly: whether to check new files based on only the filename instead of on the full path (default: false). With this set to `true`, the following files would be considered as the same file, because their filenames, "dataset.txt", are the same:· "file:///dataset.txt"· "s3://a/dataset.txt"· "s3n://a/b/dataset.txt"· "s3a://a/b/c/dataset.txt"For file-format-specific options, see the related methods in DataStreamReader (Scala/Java/Python/R). E.g. for "parquet" format options see DataStreamReader.parquet() Yes Supports glob paths, but does not support multiple comma-separated paths/globs.Socket Source:
host: host to connect to, must be specifiedport: port to connect to, must be specified NoKafka Source See the Kafka Integration Guide. YesHere are some examples.val spark: SparkSession = ...// Read text from socketval socketDF = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()socketDF.isStreaming // Returns True for DataFrames that have streaming sourcessocketDF.printSchema// Read all the csv files written atomically in a directoryval userSchema = new StructType().add("name", "string").add("age", "integer")val csvDF = spark.readStream.option("sep", ";").schema(userSchema) // Specify schema of the csv files.csv("/path/to/directory") // Equivalent to format("csv").load("/path/to/directory")
阅读全文
0 0
- spark2.2 structured Streaming
- Spark2.0 Structured Streaming
- Spark2.0 Structured Streaming
- Spark2.0: Structured Streaming
- Spark2.0 Structured Streaming
- Structured Streaming
- Spark 2.1 structured streaming
- Structured Streaming 输入输出
- Spark Structured Streaming框架(2)之数据输入源详解
- Spark Structured Streaming框架(2)之数据输入源详解
- Spark Structured Streaming框架(2)之数据输入源详解
- Spark Structured Streaming、Kafak整合
- structured streaming ——wordcounts_kafka
- kafka+Structured Streaming+s3+dynamodb
- spark 2.0.0 Structured Streaming Programming
- Structured Streaming Programming[结构化流式编程]
- 「Spark-2.2.0」Structured Streaming
- Spark Structured Streaming入门编程指南
- 单体内置对象
- 20170714
- js call apply的使用
- redis配置文件
- 约瑟夫环-C++实现
- spark2.2 structured Streaming
- CodeForces 424 Div2 CBA题
- Failure [INSTALL_FAILED_NO_MATCHING_ABIS: Failed to extract native libraries, res=-113]
- Java 反射 使用总结
- List Copy
- HDU--2151--Worm(动态规划)
- php 导出excel压缩包
- 树状数组,离散化,突破口(Segment Game,HDU 5372)
- java socket文件传输