Author: Linxiao Ma

Spark SQL Query Engine Deep Dive (3) – Catalyst Analyzer

Spark SQL Query Engine Deep Dive (3) – Catalyst Analyzer

After the SparkSQLParser parses a SQL query text into an ANTLR tree and then into a logical plan, the logical plan at this moment is unresolved. The table names and column names are just unidentified text. Spark SQL engine does not know how those table names or column names map to the underlying database objects. … Continue reading Spark SQL Query Engine Deep Dive (3) – Catalyst Analyzer

Spark SQL Query Engine Deep Dive (2) – Catalyst Rule Executor

Spark SQL Query Engine Deep Dive (2) – Catalyst Rule Executor

As mentioned in the last blog post, a Spark Catalyst query plan is internally represented as a tree, defined as a subclass of the TreeNode type. Each tree node can have zero or more tree node children, which makes a mutually recursive structure. TreeNode type encapsulates a list of methods for traversing the tree and … Continue reading Spark SQL Query Engine Deep Dive (2) – Catalyst Rule Executor

Spark SQL Query Engine Deep Dive (1) – Catalyst QueryExecution Overview

Spark SQL Query Engine Deep Dive (1) – Catalyst QueryExecution Overview

From this blog post on, I am going to start writing about Spark SQL Catalyst. Catalyst is the core of Spark SQL and there are many topics to cover. I don't have a formal writing plan on this, but instead, I want to keep my blog to be casual, informal, and more importantly stress-free to … Continue reading Spark SQL Query Engine Deep Dive (1) – Catalyst QueryExecution Overview

Parquet for Spark Deep Dive (4) – Vectorised Parquet Reading

Parquet for Spark Deep Dive (4) – Vectorised Parquet Reading

In this blog post, I am going to dive into the vectorised Parquet file reading in Spark. Vectorised Parquet file reader is a feature added since Spark 2.0. Instead of reading and decoding a row at a time, the vectorised reader batches multiple rows in a columnar format and processes column by column in batches. … Continue reading Parquet for Spark Deep Dive (4) – Vectorised Parquet Reading

Parquet for Spark Deep Dive (3) – Parquet Encoding

Parquet for Spark Deep Dive (3) – Parquet Encoding

As promised in the last blog post, I am going to dedicate a whole blog post to explore Parquet encoding, focusing on finding answers to the following questions: Why does Parquet using encoding?What encoding algorithms are used in Parquet?How does Parquet implement encoding?How does Parquet choose encoding algorithms? Why does Parquet using encoding? Short answer: … Continue reading Parquet for Spark Deep Dive (3) – Parquet Encoding

Parquet for Spark Deep Dive (2) – Parquet Write Internal

Parquet for Spark Deep Dive (2) – Parquet Write Internal

This blog post continues the Delta Lake table write joruney into the parquet file write internal. As described in the last blog post, a ParquetOutputWriter instance is created and call the Parquet API for writing a partitoin of the Spark SQL dataframe into Parquet file. From this point on, the Delta table write journey steps … Continue reading Parquet for Spark Deep Dive (2) – Parquet Write Internal

Parquet for Spark Deep Dive (1) – Table Writing Journey Overview

In this blog post, I am going to explore the full delta table writing journey, from Spark SQL Dataframe to the underneath Parquet files. The diagram above shows the whole journey of a table writing operation (some steps have been simplified or abstracted in order to make the whole table writing journey presentable in the … Continue reading Parquet for Spark Deep Dive (1) – Table Writing Journey Overview

Databricks Lakehouse Breaks Data Warehousing Performance Record – Time to Forget about Data Warehouse?

Lakehouse is faster than Data Warehouse! This is a big deal! Now, what could stop Lakehouse from replacing Data Warehouse? This is the news I have been waiting for since Databricks SQL, the full suite of data warehousing capabilities, was announced last year. In the 100TB TPC-DS performance benchmark test for data warehousing, Databricks SQL … Continue reading Databricks Lakehouse Breaks Data Warehousing Performance Record – Time to Forget about Data Warehouse?

dqops Data Quality Rules (Part 2) – CFD, Machine Learning

dqops Data Quality Rules (Part 2) – CFD, Machine Learning

The previous blog post introduces a list of basic data quality rules that have been developed for my R&D data quality improvement initiative. Those rules are fundamental and essential for detecting data quality problems. However, those rules have existed since a long, long time ago and they are neither innovative nor exciting. More importantly, those … Continue reading dqops Data Quality Rules (Part 2) – CFD, Machine Learning

dqops Data Quality Rules (Part 1) – Basic Rules

In my previous blog post, I discussed the requirements and core elements of the rule-based data quality assessments. In this and next blog posts, I am going to walk through the data quality rules designed in the dqops DQ Studio app (one of my R&D initiatives for data quality improvement), ranging from the basic data … Continue reading dqops Data Quality Rules (Part 1) – Basic Rules