Category: Databricks / Spark

Parquet for Spark Deep Dive (4) – Vectorised Parquet Reading

In this blog post, I am going to dive into the vectorised Parquet file reading in Spark. Vectorised Parquet file reader is a feature added since Spark 2.0. Instead of reading and decoding a row at a time, the vectorised reader batches multiple rows in a columnar format and processes column by column in batches. … Continue reading Parquet for Spark Deep Dive (4) – Vectorised Parquet Reading →

Parquet for Spark Deep Dive (3) – Parquet Encoding

As promised in the last blog post, I am going to dedicate a whole blog post to explore Parquet encoding, focusing on finding answers to the following questions: Why does Parquet using encoding?What encoding algorithms are used in Parquet?How does Parquet implement encoding?How does Parquet choose encoding algorithms? Why does Parquet using encoding? Short answer: … Continue reading Parquet for Spark Deep Dive (3) – Parquet Encoding →

Parquet for Spark Deep Dive (2) – Parquet Write Internal

This blog post continues the Delta Lake table write joruney into the parquet file write internal. As described in the last blog post, a ParquetOutputWriter instance is created and call the Parquet API for writing a partitoin of the Spark SQL dataframe into Parquet file. From this point on, the Delta table write journey steps … Continue reading Parquet for Spark Deep Dive (2) – Parquet Write Internal →

Parquet for Spark Deep Dive (1) – Table Writing Journey Overview

In this blog post, I am going to explore the full delta table writing journey, from Spark SQL Dataframe to the underneath Parquet files. The diagram above shows the whole journey of a table writing operation (some steps have been simplified or abstracted in order to make the whole table writing journey presentable in the … Continue reading Parquet for Spark Deep Dive (1) – Table Writing Journey Overview →

Databricks Lakehouse Breaks Data Warehousing Performance Record – Time to Forget about Data Warehouse?

Lakehouse is faster than Data Warehouse! This is a big deal! Now, what could stop Lakehouse from replacing Data Warehouse? This is the news I have been waiting for since Databricks SQL, the full suite of data warehousing capabilities, was announced last year. In the 100TB TPC-DS performance benchmark test for data warehousing, Databricks SQL … Continue reading Databricks Lakehouse Breaks Data Warehousing Performance Record – Time to Forget about Data Warehouse? →

dqops – Query Databricks Database Schema through SQL Connector for Python

dqops Data Quality Studio (DQS) is one of my R&D projects I have been doing during my spare time. I plan to note down some tips & tricks I use in this project in the future blog posts from time to time. Databricks is one of the main data services that the dqops DQS is … Continue reading dqops – Query Databricks Database Schema through SQL Connector for Python →

Setup a Dockerised Spark Development Environment with VS code and Docker

Databricks is not cheap, especially when I need to use it for my personal R&D work (where unfortunately money has to be taken from my own pocket). Therefore, I have been developing in a dockerised Spark environment since a while ago and I found this way actually works well. Here I list the steps to set … Continue reading Setup a Dockerised Spark Development Environment with VS code and Docker →

Create Custom Partitioner for Spark Dataframe

Spark dataframe provides the repartition function to partition the dataframe by a specified column and/or a specified number of partitions. However, for some use cases, the repartition function doesn't work in the way as required. For example, in the previous blog post, Handling Embarrassing Parallel Workload with PySpark Pandas UDF, we want to repartition the traveller dataframe so … Continue reading Create Custom Partitioner for Spark Dataframe →

Handling Embarrassing Parallel Workload with PySpark Pandas UDF

Introduction In the previous post, I walked through the approach to handle embarrassing parallel workload with Databricks notebook workflows. However, as all the parallel workloads are running on a single node (the cluster driver), that approach is only able to scale up to a certain point depending on the capability of the driver vm and … Continue reading Handling Embarrassing Parallel Workload with PySpark Pandas UDF →

Handling Embarrassing Parallel Workload with Databricks Notebook Workflows

Introduction Embarrassing Parallel refers to the problem where little or no effort is needed to separate the problem into parallel tasks, and there is no dependency for communication needed between the parallel tasks. Embarrassing parallel problem is very common with some typical examples like group-by analyses, simulations, optimisations, cross-validations or feature selections. Normally, an Embarrassing … Continue reading Handling Embarrassing Parallel Workload with Databricks Notebook Workflows →

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: