Author: Linxiao Ma

Spark SQL Query Engine Deep Dive (13) – Cache Commands Internal

Spark SQL Query Engine Deep Dive (13) – Cache Commands Internal

This blog post looks into Spark SQL Cache Commands under the hood, walking through the execution flows of the persisting and unpersisting operations, from the physical execution plan to the cache block storages. Cache Commands Spark SQL ships with three runnable commands for caching operations, including CacheTableCommand, UncacheTableCommand, and ClearCacheCommand. End-user developers or analysts can … Continue reading Spark SQL Query Engine Deep Dive (13) – Cache Commands Internal

Spark SQL Query Engine Deep Dive (12) – SessionCatalog & RunnableCommand Internal

Spark SQL Query Engine Deep Dive (12) – SessionCatalog & RunnableCommand Internal

In this blog posts, I will dig into the execution internals of the runnable commands, which inherit from the RunnableCommand parent class. The runnable commands are normally the commands for interacting with the Spark session catalog and managing the metadata. Unlike the data query alike operations which are distributed and lazily executed in Spark, the … Continue reading Spark SQL Query Engine Deep Dive (12) – SessionCatalog & RunnableCommand Internal

Spark SQL Query Engine Deep Dive (11) – Join Strategies

Spark SQL Query Engine Deep Dive (11) – Join Strategies

In this blog post, I am going to explain the Join strategies applied by the Spark Planner for generating physical Join plans. Based on the Join strategy selection logic implemented in the JoinSelection object (core object for planning physical join operations), I draw the following decision tree chart. Spark SQL ships five built-in Join physical … Continue reading Spark SQL Query Engine Deep Dive (11) – Join Strategies

Spark SQL Query Engine Deep Dive (10) – HashAggregateExec & ObjectHashAggregateExec

Spark SQL Query Engine Deep Dive (10) – HashAggregateExec & ObjectHashAggregateExec

This blog post continues to explore the Aggregate strategy and focuses on the two hash-based aggregation operators provided by Spark SQL, HashAggregateExec and ObjectHashAggregateExec. Hash-based aggregation is the preferred approach to sort-based aggregation which was explained in the last blog post. Compared to the sort-based aggregation, the hash-based aggregation does not need the extra sorting … Continue reading Spark SQL Query Engine Deep Dive (10) – HashAggregateExec & ObjectHashAggregateExec

Spark SQL Query Engine Deep Dive (9) – SortAggregateExec

Spark SQL Query Engine Deep Dive (9) – SortAggregateExec

The last blog post explains the Aggregation strategy for generating physical plans for aggregate operations. I will continue with this topic to look into the details of the physical aggregate operators supported by Spark SQL. As explained in the last blog post, a logical aggregate operator can be transformed into a physical plan consisting of … Continue reading Spark SQL Query Engine Deep Dive (9) – SortAggregateExec

Spark SQL Query Engine Deep Dive (8) – Aggregation Strategy

Spark SQL Query Engine Deep Dive (8) – Aggregation Strategy

In the last blog post, I gave an overview of the SparkPlanner for planning physical execution plans from an optimised logical plan. In the next few blog posts, I will dig a bit deeper into some important strategies. I will start with the "Aggregation" strategy in this blog post. "Aggregation" strategy plans the physical execution … Continue reading Spark SQL Query Engine Deep Dive (8) – Aggregation Strategy

Spark SQL Query Engine Deep Dive (7) – Spark Planner Overview

Spark SQL Query Engine Deep Dive (7) – Spark Planner Overview

After logical plans are optimised by the Catalyst Optimizer rules, SparkPlanner takes an optimized logical plan and converts it into one or more candidate physical plans. The best physical plan is expected to be selected based on a cost model approach, i.e. comparing the CPU and IO costs of the candidate physical plans and selecting … Continue reading Spark SQL Query Engine Deep Dive (7) – Spark Planner Overview

Spark SQL Query Engine Deep Dive (6) – Catalyst Optimizer Rules (Part 3)

Spark SQL Query Engine Deep Dive (6) – Catalyst Optimizer Rules (Part 3)

After two lengthy blog posts on Catalyst Optimizer rules, this blog post will close this topic and cover all the remaining rules. Batch - Operator Optimization (continue) Continue with the "Operator Optimization" batch: CollapseRepartition The CollapseRepartition rule combines adjacent repartition operators. As we know, the repartition function and the coalesce function can be used for … Continue reading Spark SQL Query Engine Deep Dive (6) – Catalyst Optimizer Rules (Part 3)

Spark SQL Query Engine Deep Dive (5) – Catalyst Optimizer Rules (Part 2)

Spark SQL Query Engine Deep Dive (5) – Catalyst Optimizer Rules (Part 2)

In the previous blog post, I covered the rules included in the "Eliminate Distinct", "Finish Analysis", "Union", "OptimizeLimitZero", "LocalRelation early", "Pullup Correlated Expressions", and "Subquery" batches. This blog post carries on to cover the remaining Optimizer batches, including "Replace Operators", "Aggregate", and part of "Operator Optimisation" (the largest batch with 30+ rules). Replace Operators Some … Continue reading Spark SQL Query Engine Deep Dive (5) – Catalyst Optimizer Rules (Part 2)

Spark SQL Query Engine Deep Dive (4) – Catalyst Optimizer Rules (Part 1)

Spark SQL Query Engine Deep Dive (4) – Catalyst Optimizer Rules (Part 1)

Spark SQL Catalyst provides a rule-based optimizer to optimise the resolved logical plan before feeding it into the SparkPlanner to generate the physical plans. An experienced programmer can write more efficient code because they have a set of rules, design patterns and best practices in their brain and can choose a suitable one depending on … Continue reading Spark SQL Query Engine Deep Dive (4) – Catalyst Optimizer Rules (Part 1)