Tag: Spark

Spark SQL Query Engine Deep Dive (8) – Aggregation Strategy

Spark SQL Query Engine Deep Dive (8) – Aggregation Strategy

In the last blog post, I gave an overview of the SparkPlanner for planning physical execution plans from an optimised logical plan. In the next few blog posts, I will dig a bit deeper into some important strategies. I will start with the "Aggregation" strategy in this blog post. "Aggregation" strategy plans the physical execution … Continue reading Spark SQL Query Engine Deep Dive (8) – Aggregation Strategy

Spark SQL Query Engine Deep Dive (7) – Spark Planner Overview

Spark SQL Query Engine Deep Dive (7) – Spark Planner Overview

After logical plans are optimised by the Catalyst Optimizer rules, SparkPlanner takes an optimized logical plan and converts it into one or more candidate physical plans. The best physical plan is expected to be selected based on a cost model approach, i.e. comparing the CPU and IO costs of the candidate physical plans and selecting … Continue reading Spark SQL Query Engine Deep Dive (7) – Spark Planner Overview

Spark SQL Query Engine Deep Dive (6) – Catalyst Optimizer Rules (Part 3)

Spark SQL Query Engine Deep Dive (6) – Catalyst Optimizer Rules (Part 3)

After two lengthy blog posts on Catalyst Optimizer rules, this blog post will close this topic and cover all the remaining rules. Batch - Operator Optimization (continue) Continue with the "Operator Optimization" batch: CollapseRepartition The CollapseRepartition rule combines adjacent repartition operators. As we know, the repartition function and the coalesce function can be used for … Continue reading Spark SQL Query Engine Deep Dive (6) – Catalyst Optimizer Rules (Part 3)

Spark SQL Query Engine Deep Dive (5) – Catalyst Optimizer Rules (Part 2)

Spark SQL Query Engine Deep Dive (5) – Catalyst Optimizer Rules (Part 2)

In the previous blog post, I covered the rules included in the "Eliminate Distinct", "Finish Analysis", "Union", "OptimizeLimitZero", "LocalRelation early", "Pullup Correlated Expressions", and "Subquery" batches. This blog post carries on to cover the remaining Optimizer batches, including "Replace Operators", "Aggregate", and part of "Operator Optimisation" (the largest batch with 30+ rules). Replace Operators Some … Continue reading Spark SQL Query Engine Deep Dive (5) – Catalyst Optimizer Rules (Part 2)

Spark SQL Query Engine Deep Dive (4) – Catalyst Optimizer Rules (Part 1)

Spark SQL Query Engine Deep Dive (4) – Catalyst Optimizer Rules (Part 1)

Spark SQL Catalyst provides a rule-based optimizer to optimise the resolved logical plan before feeding it into the SparkPlanner to generate the physical plans. An experienced programmer can write more efficient code because they have a set of rules, design patterns and best practices in their brain and can choose a suitable one depending on … Continue reading Spark SQL Query Engine Deep Dive (4) – Catalyst Optimizer Rules (Part 1)

Spark SQL Query Engine Deep Dive (3) – Catalyst Analyzer

Spark SQL Query Engine Deep Dive (3) – Catalyst Analyzer

After the SparkSQLParser parses a SQL query text into an ANTLR tree and then into a logical plan, the logical plan at this moment is unresolved. The table names and column names are just unidentified text. Spark SQL engine does not know how those table names or column names map to the underlying database objects. … Continue reading Spark SQL Query Engine Deep Dive (3) – Catalyst Analyzer

Spark SQL Query Engine Deep Dive (2) – Catalyst Rule Executor

Spark SQL Query Engine Deep Dive (2) – Catalyst Rule Executor

As mentioned in the last blog post, a Spark Catalyst query plan is internally represented as a tree, defined as a subclass of the TreeNode type. Each tree node can have zero or more tree node children, which makes a mutually recursive structure. TreeNode type encapsulates a list of methods for traversing the tree and … Continue reading Spark SQL Query Engine Deep Dive (2) – Catalyst Rule Executor

Spark SQL Query Engine Deep Dive (1) – Catalyst QueryExecution Overview

Spark SQL Query Engine Deep Dive (1) – Catalyst QueryExecution Overview

From this blog post on, I am going to start writing about Spark SQL Catalyst. Catalyst is the core of Spark SQL and there are many topics to cover. I don't have a formal writing plan on this, but instead, I want to keep my blog to be casual, informal, and more importantly stress-free to … Continue reading Spark SQL Query Engine Deep Dive (1) – Catalyst QueryExecution Overview

Setup a Dockerised Spark Development Environment with VS code and Docker

Setup a Dockerised Spark Development Environment with VS code and Docker

Databricks is not cheap, especially when I need to use it for my personal R&D work (where unfortunately money has to be taken from my own pocket). Therefore, I have been developing in a dockerised Spark environment since a while ago and I found this way actually works well. Here I list the steps to set … Continue reading Setup a Dockerised Spark Development Environment with VS code and Docker

Create Custom Partitioner for Spark Dataframe

Spark dataframe provides the repartition function to partition the dataframe by a specified column and/or a specified number of partitions. However, for some use cases, the repartition function doesn't work in the way as required. For example, in the previous blog post, Handling Embarrassing Parallel Workload with PySpark Pandas UDF, we want to repartition the traveller dataframe so … Continue reading Create Custom Partitioner for Spark Dataframe