pyspark-结构化流编程指南

来源：互联网发布：二维火软件招聘编辑：程序博客网时间：2024/06/07 23:10

参考：

1、http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

2、https://github.com/apache/spark/tree/v2.2.0

Structured Streaming Programming Guide

Overview
Quick Example
Programming Model
- Basic Concepts
- Handling Event-time and Late Data
- Fault Tolerance Semantics
API using Datasets and DataFrames
- Creating streaming DataFrames and streaming Datasets
  - Input Sources
  - Schema inference and partition of streaming DataFrames/Datasets
- Operations on streaming DataFrames/Datasets
  - Basic Operations - Selection, Projection, Aggregation
  - Window Operations on Event Time
  - Handling Late Data and Watermarking
  - Join Operations
  - Streaming Deduplication
  - Arbitrary Stateful Operations
  - Unsupported Operations
- Starting Streaming Queries
  - Output Modes
  - Output Sinks
  - Using Foreach
- Managing Streaming Queries
- Monitoring Streaming Queries
  - Interactive APIs
  - Asynchronous API
- Recovering from Failures with Checkpointing
Where to go from here

Quick Example

from pyspark.sql import SparkSessionfrom pyspark.sql.functions import explodefrom pyspark.sql.functions import splitspark = SparkSession \    .builder \    .appName("StructuredNetworkWordCount") \    .getOrCreate()# Create DataFrame representing the stream of input lines from connection to localhost:9999lines = spark \    .readStream \    .format("socket") \    .option("host", "localhost") \    .option("port", 9999) \    .load()# Split the lines into wordswords = lines.select(   explode(       split(lines.value, " ")   ).alias("word"))# Generate running word countwordCounts = words.groupBy("word").count()# Start running the query that prints the running counts to the consolequery = wordCounts \    .writeStream \    .outputMode("complete") \    .format("console") \    .start()query.awaitTermination()

Creating streaming DataFrames and streaming Datasets

spark = SparkSession. ...# Read text from socketsocketDF = spark \    .readStream \    .format("socket") \    .option("host", "localhost") \    .option("port", 9999) \    .load()socketDF.isStreaming()    # Returns True for DataFrames that have streaming sourcessocketDF.printSchema()# Read all the csv files written atomically in a directoryuserSchema = StructType().add("name", "string").add("age", "integer")csvDF = spark \    .readStream \    .option("sep", ";") \    .schema(userSchema) \    .csv("/path/to/directory")  # Equivalent to format("csv").load("/path/to/directory")

Basic Operations - Selection, Projection, Aggregation

df = ...  # streaming DataFrame with IOT device data with schema { device: string, deviceType: string, signal: double, time: DateType }# Select the devices which have signal more than 10df.select("device").where("signal > 10")# Running count of the number of updates for each device typedf.groupBy("deviceType").count()

Window Operations on Event Time

words = ...  # streaming DataFrame of schema { timestamp: Timestamp, word: String }# Group the data by window and word and compute the count of each groupwindowedCounts = words.groupBy(    window(words.timestamp, "10 minutes", "5 minutes"),    words.word).count()

Handling Late Data and Watermarking

words = ...  # streaming DataFrame of schema { timestamp: Timestamp, word: String }# Group the data by window and word and compute the count of each groupwindowedCounts = words \    .withWatermark("timestamp", "10 minutes") \    .groupBy(        window(words.timestamp, "10 minutes", "5 minutes"),        words.word) \    .count()

Join Operations

staticDf = spark.read. ...streamingDf = spark.readStream. ...streamingDf.join(staticDf, "type")  # inner equi-join with a static DFstreamingDf.join(staticDf, "type", "right_join")  # right outer join with a static DF

Streaming Deduplication

streamingDf = spark.readStream. ...// Without watermark using guid columnstreamingDf.dropDuplicates("guid")// With watermark using guid and eventTime columnsstreamingDf \  .withWatermark("eventTime", "10 seconds") \  .dropDuplicates("guid", "eventTime")

阅读全文

0 0