Tag: Databricks / Spark

Spark SQL Query Engine Deep Dive (10) – HashAggregateExec & ObjectHashAggregateExec

Spark SQL Query Engine Deep Dive (10) – HashAggregateExec & ObjectHashAggregateExec

This blog post continues to explore the Aggregate strategy and focuses on the two hash-based aggregation operators provided by Spark SQL, HashAggregateExec and ObjectHashAggregateExec. Hash-based aggregation is the preferred approach to sort-based aggregation which was explained in the last blog post. Compared to the sort-based aggregation, the hash-based aggregation does not need the extra sorting … Continue reading Spark SQL Query Engine Deep Dive (10) – HashAggregateExec & ObjectHashAggregateExec

Spark SQL Query Engine Deep Dive (9) – SortAggregateExec

Spark SQL Query Engine Deep Dive (9) – SortAggregateExec

The last blog post explains the Aggregation strategy for generating physical plans for aggregate operations. I will continue with this topic to look into the details of the physical aggregate operators supported by Spark SQL. As explained in the last blog post, a logical aggregate operator can be transformed into a physical plan consisting of … Continue reading Spark SQL Query Engine Deep Dive (9) – SortAggregateExec

Spark SQL Query Engine Deep Dive (8) – Aggregation Strategy

Spark SQL Query Engine Deep Dive (8) – Aggregation Strategy

In the last blog post, I gave an overview of the SparkPlanner for planning physical execution plans from an optimised logical plan. In the next few blog posts, I will dig a bit deeper into some important strategies. I will start with the "Aggregation" strategy in this blog post. "Aggregation" strategy plans the physical execution … Continue reading Spark SQL Query Engine Deep Dive (8) – Aggregation Strategy

Parquet for Spark Deep Dive (4) – Vectorised Parquet Reading

Parquet for Spark Deep Dive (4) – Vectorised Parquet Reading

In this blog post, I am going to dive into the vectorised Parquet file reading in Spark. Vectorised Parquet file reader is a feature added since Spark 2.0. Instead of reading and decoding a row at a time, the vectorised reader batches multiple rows in a columnar format and processes column by column in batches. … Continue reading Parquet for Spark Deep Dive (4) – Vectorised Parquet Reading

Parquet for Spark Deep Dive (3) – Parquet Encoding

Parquet for Spark Deep Dive (3) – Parquet Encoding

As promised in the last blog post, I am going to dedicate a whole blog post to explore Parquet encoding, focusing on finding answers to the following questions: Why does Parquet using encoding?What encoding algorithms are used in Parquet?How does Parquet implement encoding?How does Parquet choose encoding algorithms? Why does Parquet using encoding? Short answer: … Continue reading Parquet for Spark Deep Dive (3) – Parquet Encoding

Parquet for Spark Deep Dive (2) – Parquet Write Internal

Parquet for Spark Deep Dive (2) – Parquet Write Internal

This blog post continues the Delta Lake table write joruney into the parquet file write internal. As described in the last blog post, a ParquetOutputWriter instance is created and call the Parquet API for writing a partitoin of the Spark SQL dataframe into Parquet file. From this point on, the Delta table write journey steps … Continue reading Parquet for Spark Deep Dive (2) – Parquet Write Internal

Parquet for Spark Deep Dive (1) – Table Writing Journey Overview

In this blog post, I am going to explore the full delta table writing journey, from Spark SQL Dataframe to the underneath Parquet files. The diagram above shows the whole journey of a table writing operation (some steps have been simplified or abstracted in order to make the whole table writing journey presentable in the … Continue reading Parquet for Spark Deep Dive (1) – Table Writing Journey Overview

Databricks Lakehouse Breaks Data Warehousing Performance Record – Time to Forget about Data Warehouse?

Lakehouse is faster than Data Warehouse! This is a big deal! Now, what could stop Lakehouse from replacing Data Warehouse? This is the news I have been waiting for since Databricks SQL, the full suite of data warehousing capabilities, was announced last year. In the 100TB TPC-DS performance benchmark test for data warehousing, Databricks SQL … Continue reading Databricks Lakehouse Breaks Data Warehousing Performance Record – Time to Forget about Data Warehouse?

dqops – Query Databricks Database Schema through SQL Connector for Python

dqops Data Quality Studio (DQS) is one of my R&D projects I have been doing during my spare time. I plan to note down some tips & tricks I use in this project in the future blog posts from time to time. Databricks is one of the main data services that the dqops DQS is … Continue reading dqops – Query Databricks Database Schema through SQL Connector for Python

Setup a Dockerised Spark Development Environment with VS code and Docker

Setup a Dockerised Spark Development Environment with VS code and Docker

Databricks is not cheap, especially when I need to use it for my personal R&D work (where unfortunately money has to be taken from my own pocket). Therefore, I have been developing in a dockerised Spark environment since a while ago and I found this way actually works well. Here I list the steps to set … Continue reading Setup a Dockerised Spark Development Environment with VS code and Docker