pyspark-结构化流编程指南
来源:互联网 发布:二维火软件招聘 编辑:程序博客网 时间:2024/06/07 23:10
参考:
1、http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
2、https://github.com/apache/spark/tree/v2.2.0
Structured Streaming Programming Guide
- Overview
- Quick Example
- Programming Model
- Basic Concepts
- Handling Event-time and Late Data
- Fault Tolerance Semantics
- API using Datasets and DataFrames
- Creating streaming DataFrames and streaming Datasets
- Input Sources
- Schema inference and partition of streaming DataFrames/Datasets
- Operations on streaming DataFrames/Datasets
- Basic Operations - Selection, Projection, Aggregation
- Window Operations on Event Time
- Handling Late Data and Watermarking
- Join Operations
- Streaming Deduplication
- Arbitrary Stateful Operations
- Unsupported Operations
- Starting Streaming Queries
- Output Modes
- Output Sinks
- Using Foreach
- Managing Streaming Queries
- Monitoring Streaming Queries
- Interactive APIs
- Asynchronous API
- Recovering from Failures with Checkpointing
- Creating streaming DataFrames and streaming Datasets
- Where to go from here
Quick Example
from pyspark.sql import SparkSessionfrom pyspark.sql.functions import explodefrom pyspark.sql.functions import splitspark = SparkSession \ .builder \ .appName("StructuredNetworkWordCount") \ .getOrCreate()# Create DataFrame representing the stream of input lines from connection to localhost:9999lines = spark \ .readStream \ .format("socket") \ .option("host", "localhost") \ .option("port", 9999) \ .load()# Split the lines into wordswords = lines.select( explode( split(lines.value, " ") ).alias("word"))# Generate running word countwordCounts = words.groupBy("word").count()# Start running the query that prints the running counts to the consolequery = wordCounts \ .writeStream \ .outputMode("complete") \ .format("console") \ .start()query.awaitTermination()
Creating streaming DataFrames and streaming Datasets
spark = SparkSession. ...# Read text from socketsocketDF = spark \ .readStream \ .format("socket") \ .option("host", "localhost") \ .option("port", 9999) \ .load()socketDF.isStreaming() # Returns True for DataFrames that have streaming sourcessocketDF.printSchema()# Read all the csv files written atomically in a directoryuserSchema = StructType().add("name", "string").add("age", "integer")csvDF = spark \ .readStream \ .option("sep", ";") \ .schema(userSchema) \ .csv("/path/to/directory") # Equivalent to format("csv").load("/path/to/directory")
Basic Operations - Selection, Projection, Aggregation
df = ... # streaming DataFrame with IOT device data with schema { device: string, deviceType: string, signal: double, time: DateType }# Select the devices which have signal more than 10df.select("device").where("signal > 10")# Running count of the number of updates for each device typedf.groupBy("deviceType").count()
Window Operations on Event Time
words = ... # streaming DataFrame of schema { timestamp: Timestamp, word: String }# Group the data by window and word and compute the count of each groupwindowedCounts = words.groupBy( window(words.timestamp, "10 minutes", "5 minutes"), words.word).count()
Handling Late Data and Watermarking
words = ... # streaming DataFrame of schema { timestamp: Timestamp, word: String }# Group the data by window and word and compute the count of each groupwindowedCounts = words \ .withWatermark("timestamp", "10 minutes") \ .groupBy( window(words.timestamp, "10 minutes", "5 minutes"), words.word) \ .count()
Join Operations
staticDf = spark.read. ...streamingDf = spark.readStream. ...streamingDf.join(staticDf, "type") # inner equi-join with a static DFstreamingDf.join(staticDf, "type", "right_join") # right outer join with a static DF
Streaming Deduplication
streamingDf = spark.readStream. ...// Without watermark using guid columnstreamingDf.dropDuplicates("guid")// With watermark using guid and eventTime columnsstreamingDf \ .withWatermark("eventTime", "10 seconds") \ .dropDuplicates("guid", "eventTime")
阅读全文
0 0
- pyspark-结构化流编程指南
- pyspark-Spark编程指南
- pyspark-Spark Streaming编程指南
- #########好####### pyspark-Spark Streaming编程指南
- 结构(C# 编程指南)
- linux shell编程指南第十八章------控制流结构
- linux shell编程指南第十八章------控制流结构1
- pyspark
- 使用结构(C# 编程指南)
- CoreAnimation编程指南(四)图层树结构
- CoreAnimation编程指南(四)图层树结构
- CoreAnimation编程指南(四)图层树结构
- CoreAnimation编程指南(四)图层树结构
- CoreAnimation编程指南(四)图层树结构
- 高质量C++编程指南笔记1:文件结构
- 高质量c++编程指南 读书笔记 第一章:文件结构
- C# 程序的通用结构(C# 编程指南)
- C# 程序的通用结构(C# 编程指南)
- 阿里巴巴的AI革命 | 4天云栖大会干货总结
- 两个月入门深度学习,全靠动手实践!一位前端小哥的经验分享
- 国内首个深度学习开发SDK发布:深鉴科技对标英伟达TensorRT
- Pony.ai签约落户广州南沙,计划年底前推出无人车队
- 自动驾驶公司Momenta完成B2轮融资,凯辉领投GGV跟投
- pyspark-结构化流编程指南
- 华为Mate10到底AI在哪?
- 一个基于区块链的AI平台即将ICO:这可以说是今年最潮的项目了
- 讲道理,我觉得TensorFlow太逊了
- 基础篇:6)形位公差标注(GD&T标准);
- 安卓中除去Titlebar的几种方法
- 1074. 宇宙无敌加法器(20)
- 独木舟问题---51Nod
- specified sample format s16 is invalid or not supported(linux)