pyspark-结构化流编程指南

来源:互联网 发布:二维火软件招聘 编辑:程序博客网 时间:2024/06/07 23:10

参考:

1、http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

2、https://github.com/apache/spark/tree/v2.2.0



Structured Streaming Programming Guide

  • Overview
  • Quick Example
  • Programming Model
    • Basic Concepts
    • Handling Event-time and Late Data
    • Fault Tolerance Semantics
  • API using Datasets and DataFrames
    • Creating streaming DataFrames and streaming Datasets
      • Input Sources
      • Schema inference and partition of streaming DataFrames/Datasets
    • Operations on streaming DataFrames/Datasets
      • Basic Operations - Selection, Projection, Aggregation
      • Window Operations on Event Time
      • Handling Late Data and Watermarking
      • Join Operations
      • Streaming Deduplication
      • Arbitrary Stateful Operations
      • Unsupported Operations
    • Starting Streaming Queries
      • Output Modes
      • Output Sinks
      • Using Foreach
    • Managing Streaming Queries
    • Monitoring Streaming Queries
      • Interactive APIs
      • Asynchronous API
    • Recovering from Failures with Checkpointing
  • Where to go from here


Quick Example

from pyspark.sql import SparkSessionfrom pyspark.sql.functions import explodefrom pyspark.sql.functions import splitspark = SparkSession \    .builder \    .appName("StructuredNetworkWordCount") \    .getOrCreate()# Create DataFrame representing the stream of input lines from connection to localhost:9999lines = spark \    .readStream \    .format("socket") \    .option("host", "localhost") \    .option("port", 9999) \    .load()# Split the lines into wordswords = lines.select(   explode(       split(lines.value, " ")   ).alias("word"))# Generate running word countwordCounts = words.groupBy("word").count()# Start running the query that prints the running counts to the consolequery = wordCounts \    .writeStream \    .outputMode("complete") \    .format("console") \    .start()query.awaitTermination()


Creating streaming DataFrames and streaming Datasets

spark = SparkSession. ...# Read text from socketsocketDF = spark \    .readStream \    .format("socket") \    .option("host", "localhost") \    .option("port", 9999) \    .load()socketDF.isStreaming()    # Returns True for DataFrames that have streaming sourcessocketDF.printSchema()# Read all the csv files written atomically in a directoryuserSchema = StructType().add("name", "string").add("age", "integer")csvDF = spark \    .readStream \    .option("sep", ";") \    .schema(userSchema) \    .csv("/path/to/directory")  # Equivalent to format("csv").load("/path/to/directory")


Basic Operations - Selection, Projection, Aggregation

df = ...  # streaming DataFrame with IOT device data with schema { device: string, deviceType: string, signal: double, time: DateType }# Select the devices which have signal more than 10df.select("device").where("signal > 10")# Running count of the number of updates for each device typedf.groupBy("deviceType").count()


Window Operations on Event Time

words = ...  # streaming DataFrame of schema { timestamp: Timestamp, word: String }# Group the data by window and word and compute the count of each groupwindowedCounts = words.groupBy(    window(words.timestamp, "10 minutes", "5 minutes"),    words.word).count()


Handling Late Data and Watermarking

words = ...  # streaming DataFrame of schema { timestamp: Timestamp, word: String }# Group the data by window and word and compute the count of each groupwindowedCounts = words \    .withWatermark("timestamp", "10 minutes") \    .groupBy(        window(words.timestamp, "10 minutes", "5 minutes"),        words.word) \    .count()


Join Operations

staticDf = spark.read. ...streamingDf = spark.readStream. ...streamingDf.join(staticDf, "type")  # inner equi-join with a static DFstreamingDf.join(staticDf, "type", "right_join")  # right outer join with a static DF


Streaming Deduplication

streamingDf = spark.readStream. ...// Without watermark using guid columnstreamingDf.dropDuplicates("guid")// With watermark using guid and eventTime columnsstreamingDf \  .withWatermark("eventTime", "10 seconds") \  .dropDuplicates("guid", "eventTime")

阅读全文
0 0
原创粉丝点击