Category: *Deep Dive – Parquet for Spark

Parquet for Spark Deep Dive (4) – Vectorised Parquet Reading

In this blog post, I am going to dive into the vectorised Parquet file reading in Spark. Vectorised Parquet file reader is a feature added since Spark 2.0. Instead of reading and decoding a row at a time, the vectorised reader batches multiple rows in a columnar format and processes column by column in batches. … Continue reading Parquet for Spark Deep Dive (4) – Vectorised Parquet Reading →

Parquet for Spark Deep Dive (3) – Parquet Encoding

As promised in the last blog post, I am going to dedicate a whole blog post to explore Parquet encoding, focusing on finding answers to the following questions: Why does Parquet using encoding?What encoding algorithms are used in Parquet?How does Parquet implement encoding?How does Parquet choose encoding algorithms? Why does Parquet using encoding? Short answer: … Continue reading Parquet for Spark Deep Dive (3) – Parquet Encoding →

Parquet for Spark Deep Dive (2) – Parquet Write Internal

This blog post continues the Delta Lake table write joruney into the parquet file write internal. As described in the last blog post, a ParquetOutputWriter instance is created and call the Parquet API for writing a partitoin of the Spark SQL dataframe into Parquet file. From this point on, the Delta table write journey steps … Continue reading Parquet for Spark Deep Dive (2) – Parquet Write Internal →

Parquet for Spark Deep Dive (1) – Table Writing Journey Overview

In this blog post, I am going to explore the full delta table writing journey, from Spark SQL Dataframe to the underneath Parquet files. The diagram above shows the whole journey of a table writing operation (some steps have been simplified or abstracted in order to make the whole table writing journey presentable in the … Continue reading Parquet for Spark Deep Dive (1) – Table Writing Journey Overview →

Share this:

Share this:

Share this:

Share this: