In this blog post, I am going to dive into the vectorised Parquet file reading in Spark. Vectorised Parquet file reader is a feature added since Spark 2.0. Instead of reading and decoding a row at a time, the vectorised reader batches multiple rows in a columnar format and processes column by column in batches. … Continue reading Parquet for Spark Deep Dive (4) – Vectorised Parquet Reading
Parquet for Spark Deep Dive (3) – Parquet Encoding
As promised in the last blog post, I am going to dedicate a whole blog post to explore Parquet encoding, focusing on finding answers to the following questions: Why does Parquet using encoding?What encoding algorithms are used in Parquet?How does Parquet implement encoding?How does Parquet choose encoding algorithms? Why does Parquet using encoding? Short answer: … Continue reading Parquet for Spark Deep Dive (3) – Parquet Encoding
Parquet for Spark Deep Dive (2) – Parquet Write Internal
This blog post continues the Delta Lake table write joruney into the parquet file write internal. As described in the last blog post, a ParquetOutputWriter instance is created and call the Parquet API for writing a partitoin of the Spark SQL dataframe into Parquet file. From this point on, the Delta table write journey steps … Continue reading Parquet for Spark Deep Dive (2) – Parquet Write Internal
Parquet for Spark Deep Dive (1) – Table Writing Journey Overview
In this blog post, I am going to explore the full delta table writing journey, from Spark SQL Dataframe to the underneath Parquet files. The diagram above shows the whole journey of a table writing operation (some steps have been simplified or abstracted in order to make the whole table writing journey presentable in the … Continue reading Parquet for Spark Deep Dive (1) – Table Writing Journey Overview
Databricks Lakehouse Breaks Data Warehousing Performance Record – Time to Forget about Data Warehouse?
Lakehouse is faster than Data Warehouse! This is a big deal! Now, what could stop Lakehouse from replacing Data Warehouse? This is the news I have been waiting for since Databricks SQL, the full suite of data warehousing capabilities, was announced last year. In the 100TB TPC-DS performance benchmark test for data warehousing, Databricks SQL … Continue reading Databricks Lakehouse Breaks Data Warehousing Performance Record – Time to Forget about Data Warehouse?
dqops Data Quality Rules (Part 2) – CFD, Machine Learning
The previous blog post introduces a list of basic data quality rules that have been developed for my R&D data quality improvement initiative. Those rules are fundamental and essential for detecting data quality problems. However, those rules have existed since a long, long time ago and they are neither innovative nor exciting. More importantly, those … Continue reading dqops Data Quality Rules (Part 2) – CFD, Machine Learning
dqops Data Quality Rules (Part 1) – Basic Rules
In my previous blog post, I discussed the requirements and core elements of the rule-based data quality assessments. In this and next blog posts, I am going to walk through the data quality rules designed in the dqops DQ Studio app (one of my R&D initiatives for data quality improvement), ranging from the basic data … Continue reading dqops Data Quality Rules (Part 1) – Basic Rules
Data Quality Improvement – Rule-Based Data Quality Assessment
As discussed in the previous blog posts in my Data Quality Improvement series, the key for successful data quality management is the continuous awareness and insights of how fit your data is being used for your business. Data quality assessment is the core and possibly the most challenging activity in the data quality management process. … Continue reading Data Quality Improvement – Rule-Based Data Quality Assessment
What is Data Management, actually? – DAMA-DMBOK Framework
"What is data management?". I guess many people will (at least I think I will) answer "em... data management is managing data, right?" at the same time swearing in their heads that "what a stupid question!". However, if I was asked this question in a job interview, I guess I'd better to provide a bit … Continue reading What is Data Management, actually? – DAMA-DMBOK Framework
How Azure Storage Cheats Over the CAP Theorem
Microsoft claims Azure Storage providing both high availability and strong consistency. It sounds good but obviously violates the CAP theorem as the 'P' (network partitioning) is not avoidable in the real world. In theory, you can only achieve either high availability or strong consistency in a distributed storage system. I have done a bit of … Continue reading How Azure Storage Cheats Over the CAP Theorem






You must be logged in to post a comment.