Category: Data Platform & Lakehouse

S3 + Parquet + Iceberg + Trino: A Poor Man’s Market Data Platform

Before I start talking about how effective this architecture can be at reducing infrastructure costs, I should first make the old point that there is really no free lunch. Compared with commercial cloud data platforms and warehouses such as Databricks, BigQuery, and Snowflake, an open lakehouse setup requires significantly more engineering effort to build, operate, … Continue reading S3 + Parquet + Iceberg + Trino: A Poor Man’s Market Data Platform →

How QuantFlow Handles Large-Scale Market Data

For many years, a large portion of systematic strategies relied on relatively low-frequency signals. These approaches worked well when they were under-explored, but over time they have been widely researched, increasingly arbitraged, and structurally compressed in edge. As a result, a growing share of remaining opportunity has shifted toward market microstructure — order flow dynamics, … Continue reading How QuantFlow Handles Large-Scale Market Data →

Databricks Lakehouse Breaks Data Warehousing Performance Record – Time to Forget about Data Warehouse?

Lakehouse is faster than Data Warehouse! This is a big deal! Now, what could stop Lakehouse from replacing Data Warehouse? This is the news I have been waiting for since Databricks SQL, the full suite of data warehousing capabilities, was announced last year. In the 100TB TPC-DS performance benchmark test for data warehousing, Databricks SQL … Continue reading Databricks Lakehouse Breaks Data Warehousing Performance Record – Time to Forget about Data Warehouse? →

How Azure Storage Cheats Over the CAP Theorem

Microsoft claims Azure Storage providing both high availability and strong consistency. It sounds good but obviously violates the CAP theorem as the 'P' (network partitioning) is not avoidable in the real world. In theory, you can only achieve either high availability or strong consistency in a distributed storage system. I have done a bit of … Continue reading How Azure Storage Cheats Over the CAP Theorem →

dqops – Query Databricks Database Schema through SQL Connector for Python

dqops Data Quality Studio (DQS) is one of my R&D projects I have been doing during my spare time. I plan to note down some tips & tricks I use in this project in the future blog posts from time to time. Databricks is one of the main data services that the dqops DQS is … Continue reading dqops – Query Databricks Database Schema through SQL Connector for Python →

Setup a Dockerised Spark Development Environment with VS code and Docker

Databricks is not cheap, especially when I need to use it for my personal R&D work (where unfortunately money has to be taken from my own pocket). Therefore, I have been developing in a dockerised Spark environment since a while ago and I found this way actually works well. Here I list the steps to set … Continue reading Setup a Dockerised Spark Development Environment with VS code and Docker →

Why I Prefer Hand-Coded Transformations over ADF Mapping Data Flow

Firstly, I need to clarify that what I am discussing in this blog post is only with ADF Mapping Data Flow instead of the whole ADF service. I am not going to challenge ADF’s role as the superb orchestration service in the Azure data ecosystem. In fact, I love ADF. At the control flow level, … Continue reading Why I Prefer Hand-Coded Transformations over ADF Mapping Data Flow →

Create Custom Partitioner for Spark Dataframe

Spark dataframe provides the repartition function to partition the dataframe by a specified column and/or a specified number of partitions. However, for some use cases, the repartition function doesn't work in the way as required. For example, in the previous blog post, Handling Embarrassing Parallel Workload with PySpark Pandas UDF, we want to repartition the traveller dataframe so … Continue reading Create Custom Partitioner for Spark Dataframe →

Configuration-Driven Azure Data Factory Pipelines

In this blog post, I will introduce two configuration-driven Azure Data Factory pipeline patterns I have used in my previous projects, including the Source-Sink pattern and the Key-Value pattern. The Source-Sink pattern is primarily used for parameterising and configuring the data movement activities, with the source location and sink location of the data movement configured in a … Continue reading Configuration-Driven Azure Data Factory Pipelines →

Handling Embarrassing Parallel Workload with PySpark Pandas UDF

Introduction In the previous post, I walked through the approach to handle embarrassing parallel workload with Databricks notebook workflows. However, as all the parallel workloads are running on a single node (the cluster driver), that approach is only able to scale up to a certain point depending on the capability of the driver vm and … Continue reading Handling Embarrassing Parallel Workload with PySpark Pandas UDF →

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: