Tag: Databricks

Spark SQL Query Engine Deep Dive (18) -Partitioning & Bucketing

Spark SQL Query Engine Deep Dive (18) -Partitioning & Bucketing

I was planning to write about the Adaptive Query Execution (AQE) in this and next few blog posts, and then end my Spark SQL deep dive series there and move on to another topic, either Spark Core or Pulsar. However, I realised that I haven't covered the mechanism of partitioning and bucketing on file systems, … Continue reading Spark SQL Query Engine Deep Dive (18) -Partitioning & Bucketing

Spark SQL Query Engine Deep Dive (17) – Dynamic Partition Pruning

In this blog post, I will explain the Dynamic Partition Pruning (DPP), which is a performance optimisation feature introduced in Spark 3.0 along with the Adaptive Query Execution optimisation techniques (which I plan to cover in the next few of the blog posts). At the core, the Dynamic Partition Pruning is a type of predicate … Continue reading Spark SQL Query Engine Deep Dive (17) – Dynamic Partition Pruning

Spark SQL Query Engine Deep Dive (16) – ShuffleExchangeExec & UnsafeShuffleWriter

Spark SQL Query Engine Deep Dive (16) – ShuffleExchangeExec & UnsafeShuffleWriter

This blog post continues to discuss the partitioning and ordering in Spark. In the last blog post, I explain the SortExec operator and the underlying UnsafeExternalSorter for ordering. This blog post focuses on the ShuffleExcahngeExec operator and the Tungsten supported shuffleExternalWriter for partitioning. As explained in the previous blog posts, when the output partitioning of … Continue reading Spark SQL Query Engine Deep Dive (16) – ShuffleExchangeExec & UnsafeShuffleWriter

Spark SQL Query Engine Deep Dive (15) –  UnsafeExternalSorter & SortExec

Spark SQL Query Engine Deep Dive (15) – UnsafeExternalSorter & SortExec

In the last blog post, I explained the partitioning and ordering requirements for preparing a physical operator to execute. In this and the next blog post, I look into the primary physical operators for implementing partitioning and ordering. This blog post focuses on the SortExec operator and the UnsafeExternalSorter for ordering, while the next blog … Continue reading Spark SQL Query Engine Deep Dive (15) – UnsafeExternalSorter & SortExec

Spark SQL Query Engine Deep Dive (14) – Partitioning & Ordering

Spark SQL Query Engine Deep Dive (14) – Partitioning & Ordering

In the last few blog posts, I introduced the SparkPlanner for generating physical plans from logical plans and looked into the details of the aggregation, join and runnable command execution strategies. When a physical plan is selected as the "best" plan by a cost model (not implemented in Spark 3.0 yet though) or other approaches, … Continue reading Spark SQL Query Engine Deep Dive (14) – Partitioning & Ordering

Spark SQL Query Engine Deep Dive (13) – Cache Commands Internal

Spark SQL Query Engine Deep Dive (13) – Cache Commands Internal

This blog post looks into Spark SQL Cache Commands under the hood, walking through the execution flows of the persisting and unpersisting operations, from the physical execution plan to the cache block storages. Cache Commands Spark SQL ships with three runnable commands for caching operations, including CacheTableCommand, UncacheTableCommand, and ClearCacheCommand. End-user developers or analysts can … Continue reading Spark SQL Query Engine Deep Dive (13) – Cache Commands Internal

Spark SQL Query Engine Deep Dive (12) – SessionCatalog & RunnableCommand Internal

Spark SQL Query Engine Deep Dive (12) – SessionCatalog & RunnableCommand Internal

In this blog posts, I will dig into the execution internals of the runnable commands, which inherit from the RunnableCommand parent class. The runnable commands are normally the commands for interacting with the Spark session catalog and managing the metadata. Unlike the data query alike operations which are distributed and lazily executed in Spark, the … Continue reading Spark SQL Query Engine Deep Dive (12) – SessionCatalog & RunnableCommand Internal

Spark SQL Query Engine Deep Dive (11) – Join Strategies

Spark SQL Query Engine Deep Dive (11) – Join Strategies

In this blog post, I am going to explain the Join strategies applied by the Spark Planner for generating physical Join plans. Based on the Join strategy selection logic implemented in the JoinSelection object (core object for planning physical join operations), I draw the following decision tree chart. Spark SQL ships five built-in Join physical … Continue reading Spark SQL Query Engine Deep Dive (11) – Join Strategies

Spark SQL Query Engine Deep Dive (10) – HashAggregateExec & ObjectHashAggregateExec

Spark SQL Query Engine Deep Dive (10) – HashAggregateExec & ObjectHashAggregateExec

This blog post continues to explore the Aggregate strategy and focuses on the two hash-based aggregation operators provided by Spark SQL, HashAggregateExec and ObjectHashAggregateExec. Hash-based aggregation is the preferred approach to sort-based aggregation which was explained in the last blog post. Compared to the sort-based aggregation, the hash-based aggregation does not need the extra sorting … Continue reading Spark SQL Query Engine Deep Dive (10) – HashAggregateExec & ObjectHashAggregateExec

Spark SQL Query Engine Deep Dive (9) – SortAggregateExec

Spark SQL Query Engine Deep Dive (9) – SortAggregateExec

The last blog post explains the Aggregation strategy for generating physical plans for aggregate operations. I will continue with this topic to look into the details of the physical aggregate operators supported by Spark SQL. As explained in the last blog post, a logical aggregate operator can be transformed into a physical plan consisting of … Continue reading Spark SQL Query Engine Deep Dive (9) – SortAggregateExec