Spark Structured Streaming Deep Dive (3) – Sink

This blog post discusses another main component in the Spark Structured Streaming framework, Sink. As the KafkaSink will be covered when discussing the Spark-Kafka integration, this blog post will focus on ForeachBatchSink, ForeachWriteTable, FileStreamSink and DeltaSink. Spark Structured Streaming defines the Sink trait representing the interface for external storage systems which can collect the results … Continue reading Spark Structured Streaming Deep Dive (3) – Sink →

Spark Structured Streaming Deep Dive (2) – Source

As mentioned in the last blog discussing the execution flow of Spark Structured Streaming queries, the Spark Structured Streaming framework consists of three main components, Source, StreamExecution, and Sink. The source interfaces defined by the Spark Structured Streaming framework abstract the input data stream from the external streaming data sources and standarise the interaction patterns … Continue reading Spark Structured Streaming Deep Dive (2) – Source →

Spark Structured Streaming Deep Dive (1) – Execution Flow

From this blog post, I am starting to write about streaming processing, focusing on Spark Structured Streaming, Kafka, Flink and Kappa architecture. This is the first blog post of the Spark Structured Streaming deep dive series. This blog post digs into the underlying, end-to-end execution flow of Spark streaming queries. Firstly, let's have a look … Continue reading Spark Structured Streaming Deep Dive (1) – Execution Flow →

Spark SQL Query Engine Deep Dive (20) – Adaptive Query Execution (Part 2)

In the previous blog post, we looked into how the Adaptive Query Execution (AQE) framework is implemented in Spark SQL. This blog post introduces the two core AQE optimizer rules, the CoalesceShufflePartitoins rule and the OptimizeSkewedJoin rule, and how are implemented under the hood. I will not repeat what I have covered in the previous … Continue reading Spark SQL Query Engine Deep Dive (20) – Adaptive Query Execution (Part 2) →

Spark SQL Query Engine Deep Dive (19) – Adaptive Query Execution (Part 1)

Cost-based optimisation (CBO) is not a new thing. It has been widely used in the RDBMS world for many years. However, the use of CBO in a distributed, storage/computing separated system, such as Spark, is an "extremely complex problem" (claimed by Spark guys in Databricks). It is challenging and expensive to collect and maintain a … Continue reading Spark SQL Query Engine Deep Dive (19) – Adaptive Query Execution (Part 1) →

Spark SQL Query Engine Deep Dive (18) -Partitioning & Bucketing

I was planning to write about the Adaptive Query Execution (AQE) in this and next few blog posts, and then end my Spark SQL deep dive series there and move on to another topic, either Spark Core or Pulsar. However, I realised that I haven't covered the mechanism of partitioning and bucketing on file systems, … Continue reading Spark SQL Query Engine Deep Dive (18) -Partitioning & Bucketing →

Spark SQL Query Engine Deep Dive (17) – Dynamic Partition Pruning

In this blog post, I will explain the Dynamic Partition Pruning (DPP), which is a performance optimisation feature introduced in Spark 3.0 along with the Adaptive Query Execution optimisation techniques (which I plan to cover in the next few of the blog posts). At the core, the Dynamic Partition Pruning is a type of predicate … Continue reading Spark SQL Query Engine Deep Dive (17) – Dynamic Partition Pruning →

Spark SQL Query Engine Deep Dive (16) – ShuffleExchangeExec & UnsafeShuffleWriter

This blog post continues to discuss the partitioning and ordering in Spark. In the last blog post, I explain the SortExec operator and the underlying UnsafeExternalSorter for ordering. This blog post focuses on the ShuffleExcahngeExec operator and the Tungsten supported shuffleExternalWriter for partitioning. As explained in the previous blog posts, when the output partitioning of … Continue reading Spark SQL Query Engine Deep Dive (16) – ShuffleExchangeExec & UnsafeShuffleWriter →

Spark SQL Query Engine Deep Dive (15) – UnsafeExternalSorter & SortExec

In the last blog post, I explained the partitioning and ordering requirements for preparing a physical operator to execute. In this and the next blog post, I look into the primary physical operators for implementing partitioning and ordering. This blog post focuses on the SortExec operator and the UnsafeExternalSorter for ordering, while the next blog … Continue reading Spark SQL Query Engine Deep Dive (15) – UnsafeExternalSorter & SortExec →

Spark SQL Query Engine Deep Dive (14) – Partitioning & Ordering

In the last few blog posts, I introduced the SparkPlanner for generating physical plans from logical plans and looked into the details of the aggregation, join and runnable command execution strategies. When a physical plan is selected as the "best" plan by a cost model (not implemented in Spark 3.0 yet though) or other approaches, … Continue reading Spark SQL Query Engine Deep Dive (14) – Partitioning & Ordering →

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: