Spark Structured Streaming框架(2)之数据输入源详解
来源:互联网 发布:伤不起dj网络歌手 编辑:程序博客网 时间:2024/05/22 17:13
Spark Structured Streaming目前的2.1.0版本只支持输入源:File、kafka和socket。
1. Socket
Socket方式是最简单的数据输入源,如Quick example所示的程序,就是使用的这种方式。用户只需要指定"socket"形式并配置监听的IP和Port即可。
val scoketDF = spark.readStream
.format("socket")
.option("host","localhost")
.option("port", 9999)
.load()
注意:
Socket方式Streaming是接收UTF8的text数据,并且这种方式最后只用于测试,不要用户端到端的项目中。
2. Kafka
Structured streaming提供接收kafka数据源的接口,用户使用起来也非常方便,只是需要注意开发环境所依赖的特别库,同时streaming运行环境的kafka版本。
2.1 开发环境
若以kafka作为输入源,那么开发环境需要再引入所依赖的架包。如使用了Spark版本是2.1.0,那么maven的pom.xml文件中需要添加如下的依赖库。
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0.10_2.11</artifactId>
<version>2.1.0</version>
</dependency>
2.2 API
与使用socket作为输入源类似,只需要指定"kafka"作为输入源,同时传递kafka的server集和topic集。如下所示:
// Subscribe to 1 topic
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
// Subscribe to multiple topics
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1,topic2")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
// Subscribe to a pattern
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "topic.*")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
2.3 运行环境
由于spark 2.1.0使用了kafka的版本是0.10,所以kafka server也要使用同样版本,即发送数据的kafka也需要使用0.10版本。
否则会出现如下的错误:
图 21
3. File
Structured Streaming可以指定一个目录的文件作为数据输入源,其中支持的文件格式有:text、csv、json、parquet。
如下所示:
object StructuredFile{
def main(args:Array[String]){
val spark = SparkSession
.builder
.appName("StructuredNetWordCount")
.getOrCreate()
val userSchema = new StructType().add("name","string").add("age","integer")
val jsonDF = spark
.readStream
.schema(userSchema)
.json("/root/jar/directory")//Equivalent to format("json").load("/root/jar/directore")
Val query = jsonDF.writeStream
.format(console)
.start()
Query.awaitTermination()
}
}
读取文件的接口有5个:
- format(source).load(path):source参数是指文件的形式,有text、csv、json、parquet四种形式;
- text(path):其封装了format("text").load(path);
- json(path):其封装了format("json").load(path);
- csv(path):其封装了format("csv").load(path);
- parquet(path):其封装了format("parquet").load(path);
其中path参数为文件的路径,若该路径发现新增文件,则会被以数据流的形式被获取。但该路径只能是指定的格式文件,不能存放其它文件格式。
注意:
若是以Spark集群方式运行,则路径是hdfs种的文件路径;若是以local方式执行,则路径为本地路径。
获取的文件形式有四种,但并不是每种格式都需要调用schema()方法来配置文件信息:
- csv、json、parquet:用户需要通过schema()方法手动配置文件信息;
- text:不需要用户指定schema,其返回的列是只有一个"value"。
4) 自定义
若上述Spark Structured Streaming API提供的数据输入源不能满足要求,那么还有一种方法可以使用:修改源码。
如下通过获取"socket"数据源相应类的内容为例,介绍具体使用方式:
4.1 实现Provider
首先实现一个Provider,该类会返回一个数据的数据源对象。其中Provider实现类需要实现三个方法:
序号
方法
描述
1
souceSchema
该方法返回一个配置信息的词典,key是字符串,value是StructType对象
2
createSource
该方法返回一个接受数据源的对象,其为Source接口的子类
3
shortName
该方法返回一个数据源的标识符,如上述format()方法传递的参数:"socket"、"json"或"kafka";此时返回的字符串,就是format()方法传递的参数
如下所示实现一个TextRabbitMQSourceProvider类:
class TextRabbitMQSourceProvider extends StreamSourceProvider with DataSourceRegister with Logging {
private def parseIncludeTimestamp(params: Map[String, String]): Boolean = {
Try(params.getOrElse("includeTimestamp", "false").toBoolean) match {
case Success(bool) => bool
case Failure(_) =>
throw new AnalysisException("includeTimestamp must be set to either \"true\" or \"false\"")
}
}
/** Returns the name and schema of the source that can be used to continually read data. */
override def sourceSchema(
sqlContext: SQLContext,
schema: Option[StructType],
providerName: String,
parameters: Map[String, String]): (String, StructType) = {
logWarning("The socket source should not be used for production applications! " +
"It does not support recovery.")
if (!parameters.contains("host")) {
throw new AnalysisException("Set a host to read from with option(\"host\", ...).")
}
if (!parameters.contains("port")) {
throw new AnalysisException("Set a port to read from with option(\"port\", ...).")
}
val schema =
if (parseIncludeTimestamp(parameters)) {
TextSocketSource.SCHEMA_TIMESTAMP
} else {
TextSocketSource.SCHEMA_REGULAR
}
("textSocket", schema)
}
override def createSource(
sqlContext: SQLContext,
metadataPath: String,
schema: Option[StructType],
providerName: String,
parameters: Map[String, String]): Source = {
val host = parameters("host")
val port = parameters("port").toInt
newTextRabbitMQSource(host, port, parseIncludeTimestamp(parameters), sqlContext)
}
/** String that represents the format that this data source provider uses. */
override def shortName(): String = "RabbitMQ"
}
4.2 实现Source
用户需要实现一个真正接受数据的类,该类实例是由Provider实现类来实例化,如上述的createSource()方法。其中需要实现Source抽象类的几个方法,从而让Structured Streaming引擎能够调用:
序号
方法
描述
1
getOffset
获取可用的数据偏移量,表明是否有可用的数据
2
getBatch
获取可用的数据,以DataFrame对象形式返回
3
commit
传递已经接收的数据偏移量
4
stop
听着Source数据源
class TextRabbitMQSource(host: String, port: Int, includeTimestamp: Boolean, sqlContext: SQLContext)
extends Source with Logging {
@GuardedBy("this")
private var socket: Socket = null
@GuardedBy("this")
private var readThread: Thread = null
/**
* All batches from `lastCommittedOffset + 1` to `currentOffset`, inclusive.
* Stored in a ListBuffer to facilitate removing committed batches.
*/
@GuardedBy("this")
protected val batches = new ListBuffer[(String, Timestamp)]
@GuardedBy("this")
protected var currentOffset: LongOffset = new LongOffset(-1)
@GuardedBy("this")
protected var lastOffsetCommitted : LongOffset = new LongOffset(-1)
initialize()
private def initialize(): Unit = synchronized {
socket = new Socket(host, port)
val reader = new BufferedReader(new InputStreamReader(socket.getInputStream))
readThread = new Thread(s"TextSocketSource($host, $port)") {
setDaemon(true)
override def run(): Unit = {
try {
while (true) {
val line = reader.readLine()
if (line == null) {
// End of file reached
logWarning(s"Stream closed by $host:$port")
return
}
TextSocketSource.this.synchronized {
val newData = (line,
Timestamp.valueOf(
TextSocketSource.DATE_FORMAT.format(Calendar.getInstance().getTime()))
)
currentOffset = currentOffset + 1
batches.append(newData)
}
}
} catch {
case e: IOException =>
}
}
}
readThread.start()
}
/** Returns the schema of the data from this source */
override def schema: StructType = if (includeTimestamp) TextSocketSource.SCHEMA_TIMESTAMP
else TextSocketSource.SCHEMA_REGULAR
override def getOffset: Option[Offset] = synchronized {
if (currentOffset.offset == -1) {
None
} else {
Some(currentOffset)
}
}
/** Returns the data that is between the offsets (`start`, `end`]. */
override def getBatch(start: Option[Offset], end: Offset): DataFrame = synchronized {
val startOrdinal =
start.flatMap(LongOffset.convert).getOrElse(LongOffset(-1)).offset.toInt + 1
val endOrdinal = LongOffset.convert(end).getOrElse(LongOffset(-1)).offset.toInt + 1
// Internal buffer only holds the batches after lastOffsetCommitted
val rawList = synchronized {
val sliceStart = startOrdinal - lastOffsetCommitted.offset.toInt - 1
val sliceEnd = endOrdinal - lastOffsetCommitted.offset.toInt - 1
batches.slice(sliceStart, sliceEnd)
}
import sqlContext.implicits._
val rawBatch = sqlContext.createDataset(rawList)
// Underlying MemoryStream has schema (String, Timestamp); strip out the timestamp
// if requested.
if (includeTimestamp) {
rawBatch.toDF("value", "timestamp")
} else {
// Strip out timestamp
rawBatch.select("_1").toDF("value")
}
}
override def commit(end: Offset): Unit = synchronized {
val newOffset = LongOffset.convert(end).getOrElse(
sys.error(s"TextSocketStream.commit() received an offset ($end) that did not " +
s"originate with an instance of this class")
)
val offsetDiff = (newOffset.offset - lastOffsetCommitted.offset).toInt
if (offsetDiff < 0) {
sys.error(s"Offsets committed out of order: $lastOffsetCommitted followed by $end")
}
batches.trimStart(offsetDiff)
lastOffsetCommitted = newOffset
}
/** Stop this source. */
override def stop(): Unit = synchronized {
if (socket != null) {
try {
// Unfortunately, BufferedReader.readLine() cannot be interrupted, so the only way to
// stop the readThread is to close the socket.
socket.close()
} catch {
case e: IOException =>
}
socket = null
}
}
override def toString: String = s"TextSocketSource[host: $host, port: $port]"
}
4.3 注册Provider
由于Structured Streaming引擎会根据用户在format()方法传递的数据源类型来寻找具体数据源的provider,即在DataSource.lookupDataSource()方法中寻找。所以用户需要将上述实现的Provider类注册到Structured Streaming引擎中。所以用户需要将provider实现类的完整名称添加到引擎中的某个,这个地方就是在Spark SQL工程中的\spark-2.2.0\sql\core\src\main\resources\META-INF\services\org.apache.spark.sql.sources.DataSourceRegister文件中。用户通过将Provider实现类名称添加到该文件中,从而完成Provider类的注册工作。
如下所示在文件最后一行添加,我们自己自定义的实现类完整路径和名称:
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider
org.apache.spark.sql.execution.datasources.json.JsonFileFormat
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
org.apache.spark.sql.execution.datasources.text.TextFileFormat
org.apache.spark.sql.execution.streaming.ConsoleSinkProvider
org.apache.spark.sql.execution.streaming.TextSocketSourceProvider
org.apache.spark.sql.execution.streaming.RateSourceProvider
org.apache.spark.sql.execution.streaming.TextRabbitMQSourceProvider
4.4 使用API
再Spark SQL源码重新编译后,并肩其jar包丢进Spark的jars路径下。从而用户就能够像使用Structured Streaming自带的数据输入源一样,使用用户自定义的"RabbitMQ"数据输入源了。即用户只需将RabbitMQ字符串传递给format()方法,其使用方式和"socket"方式一样,因为上述的数据源内容其实是Socket方式的实现内容。
5. 参考文献
[1]. Structured Streaming Programming Guide.
- Spark Structured Streaming框架(2)之数据输入源详解
- Spark Structured Streaming框架(2)之数据输入源详解
- Spark Structured Streaming框架(2)之数据输入源详解
- Spark Structured Streaming框架(1)之基本用法
- Spark 2.1 structured streaming
- Spark Structured Streaming、Kafak整合
- 大数据开源框架之二:Berkeley Data Analytics Stack(tachyon, spark, shark, spark streaming)
- spark2.2 structured Streaming
- spark 2.0.0 Structured Streaming Programming
- 「Spark-2.2.0」Structured Streaming
- Spark Structured Streaming入门编程指南
- spark structured streaming GroupState setTimeoutDuration触发机制
- Spark Streaming源码解读之Job详解
- Spark Streaming源码解读之JobScheduler详解
- Structured Streaming
- 6.Spark Streaming:输入DStream和Receiver详解
- 谷歌Dataflow编程模型和spark 2.0 structured streaming
- google Dataflow编程模型和spark 2.0 structured streaming对比
- Maven2部署构件到Nexus时出现的Failed to transfer file错误
- MySQL标准变量、临时变量、系统变量
- QT程序打包--教你做一个可安装的应用程序
- STS 下 ctrl + shift + o 重新导包快捷键失效解决方法
- 在Ubuntu系统基于ROS使用废旧Android手机摄像头搭建监控设备
- Spark Structured Streaming框架(2)之数据输入源详解
- Ubuntu16.04下安装FFmpeg(超简单版)
- 安卓handler初识(三)
- JavaSE (五)关系,逻辑,其它运算符解释说明
- 排序,对namespace深入理解,命令行参数
- 关于定位服务这一块的知识。
- 一元运算符操作符重载
- 字符串
- 2354: 分班级【二分】