From Spark v3.2, session window is natively supported by Spark Structured Streaming. Session window based aggregation is a common requirement of streaming data processing, especially in the use cases such as user behaviour analytics. In this blog post, I will discuss how session window works under the hood in Spark Structured Streaming. Compared to the … Continue reading Spark Structured Streaming Deep Dive (8) – Session Window
Tag: Streaming
Spark Structured Streaming Deep Dive (7) – Stream-Stream Join
This blog post discusses another stateful operation supported by Spark Structured Streaming, Stream-Stream Join, which joins two streaming datasets. Unlike static datasets join, for the rows reaching to one side of the input streams in a micro-batch, the matching rows would highly likely be not received in the other side of the input streams at … Continue reading Spark Structured Streaming Deep Dive (7) – Stream-Stream Join
Spark Structured Streaming Deep Dive (6) – Stateful Operations
There are two types of streaming processing modes, Stateless and Stateful. Stateless is easy to understand that each message is processed independently without the needs to maintain the states across multiple messages. The challenge and fun one is the Stateful streaming processing where the processing of a message depends on the result of the processing … Continue reading Spark Structured Streaming Deep Dive (6) – Stateful Operations
Spark Structured Streaming Deep Dive (5) – IncrementalExecution
Spark Structured Streaming reuses the Spark SQL execution engine, including the analyser, optimiser, planner, and runtime code generator. QueryExecution is the core component of the Spark SQL execution engine, which manages the primary workflow of a relational query execution using Spark. IncrementalExecution is a variant of QueryExecution that supports the execution of a logical plan … Continue reading Spark Structured Streaming Deep Dive (5) – IncrementalExecution
Spark Structured Streaming Deep Dive (4) – Azure Event Hub Integration
This blog post deep dive into the Azure Event Hubs Connector for Apache Spark, the open-source streaming data source connector for integrating Azure Event Hubs with Spark Structured Streaming. The Azure Event Hubs Connector implements the Source and Sink traits with the EventHubSource and the EventHubSink for receiving streaming data from or writing streaming data … Continue reading Spark Structured Streaming Deep Dive (4) – Azure Event Hub Integration
Spark Structured Streaming Deep Dive (3) – Sink
This blog post discusses another main component in the Spark Structured Streaming framework, Sink. As the KafkaSink will be covered when discussing the Spark-Kafka integration, this blog post will focus on ForeachBatchSink, ForeachWriteTable, FileStreamSink and DeltaSink. Spark Structured Streaming defines the Sink trait representing the interface for external storage systems which can collect the results … Continue reading Spark Structured Streaming Deep Dive (3) – Sink
Spark Structured Streaming Deep Dive (2) – Source
As mentioned in the last blog discussing the execution flow of Spark Structured Streaming queries, the Spark Structured Streaming framework consists of three main components, Source, StreamExecution, and Sink. The source interfaces defined by the Spark Structured Streaming framework abstract the input data stream from the external streaming data sources and standarise the interaction patterns … Continue reading Spark Structured Streaming Deep Dive (2) – Source
Spark Structured Streaming Deep Dive (1) – Execution Flow
From this blog post, I am starting to write about streaming processing, focusing on Spark Structured Streaming, Kafka, Flink and Kappa architecture. This is the first blog post of the Spark Structured Streaming deep dive series. This blog post digs into the underlying, end-to-end execution flow of Spark streaming queries. Firstly, let's have a look … Continue reading Spark Structured Streaming Deep Dive (1) – Execution Flow








You must be logged in to post a comment.