Tag: Parquet

Parquet for Spark Deep Dive (4) – Vectorised Parquet Reading

In this blog post, I am going to dive into the vectorised Parquet file reading in Spark. Vectorised Parquet file reader is a feature added since Spark 2.0. Instead of reading and decoding a row at a time, the vectorised reader batches multiple rows in a columnar format and processes column by column in batches. … Continue reading Parquet for Spark Deep Dive (4) – Vectorised Parquet Reading →

Parquet for Spark Deep Dive (3) – Parquet Encoding

As promised in the last blog post, I am going to dedicate a whole blog post to explore Parquet encoding, focusing on finding answers to the following questions: Why does Parquet using encoding?What encoding algorithms are used in Parquet?How does Parquet implement encoding?How does Parquet choose encoding algorithms? Why does Parquet using encoding? Short answer: … Continue reading Parquet for Spark Deep Dive (3) – Parquet Encoding →

Parquet for Spark Deep Dive (2) – Parquet Write Internal

This blog post continues the Delta Lake table write joruney into the parquet file write internal. As described in the last blog post, a ParquetOutputWriter instance is created and call the Parquet API for writing a partitoin of the Spark SQL dataframe into Parquet file. From this point on, the Delta table write journey steps … Continue reading Parquet for Spark Deep Dive (2) – Parquet Write Internal →

Share this:

Share this:

Share this: