dqops – Query Databricks Database Schema through SQL Connector for Python

dqops Data Quality Studio (DQS) is one of my R&D projects I have been doing during my spare time. I plan to note down some tips & tricks I use in this project in the future blog posts from time to time. Databricks is one of the main data services that the dqops DQS is … Continue reading dqops – Query Databricks Database Schema through SQL Connector for Python

Setup a Dockerised Spark Development Environment with VS code and Docker

Setup a Dockerised Spark Development Environment with VS code and Docker

Databricks is not cheap, especially when I need to use it for my personal R&D work (where unfortunately money has to be taken from my own pocket). Therefore, I have been developing in a dockerised Spark environment since a while ago and I found this way actually works well. Here I list the steps to set … Continue reading Setup a Dockerised Spark Development Environment with VS code and Docker

Why I Prefer Hand-Coded Transformations over ADF Mapping Data Flow

Firstly, I need to clarify that what I am discussing in this blog post is only with ADF Mapping Data Flow instead of the whole ADF service. I am not going to challenge ADF’s role as the superb orchestration service in the Azure data ecosystem. In fact, I love ADF. At the control flow level, … Continue reading Why I Prefer Hand-Coded Transformations over ADF Mapping Data Flow

Data Quality Improvement – DQ Dimensions = Confusions

Data Quality Improvement – DQ Dimensions = Confusions

DQ Dimensions are Confusing Data quality dimensions are great inventions from our data quality thought leaders and experts. Since the concept of quality dimensions was originally proposed in the course of the Total Data Quality Management (TDQM) program of MIT in the 1980s [5], a large number of data quality dimensions have been defined by … Continue reading Data Quality Improvement – DQ Dimensions = Confusions

Data Quality Improvement – Conditional Functional Dependency (CFD)

Data Quality Improvement – Conditional Functional Dependency (CFD)

To fulfil the promise I made before, I dedicate this blog post to cover the topic of Conditional Functional Dependency (CFD). The reason that I dedicate a whole blog post to this topic is that CFD is one of the most promising constraints to detect and repair inconsistencies in a dataset. The use of CFD … Continue reading Data Quality Improvement – Conditional Functional Dependency (CFD)

Data Quality Improvement – Data Profiling

This is the second post of my Data Quality Improvement blog series. This blog post discusses the data profiling tasks that I think are relevant to data quality improvement use cases. For anyone who has ever worked with data, she or he must has already done some sort of data profiling, either using a commercial … Continue reading Data Quality Improvement – Data Profiling

Data Quality – 80:20 Rule and 1:10:100 Rule

I came across two data quality rules from Martin Doyle's blog today. Martin Doyle is a data quality improvement evangelist and an industry expert on CRM. I found, to a certain extent, those data quality rules provide some kind of theoretical supports to some of my ideas with data quality improvements. 80:20 Rule The 80:20 … Continue reading Data Quality – 80:20 Rule and 1:10:100 Rule

Data Quality Improvement – Set the Scene Up

In this blog series I plan to write about data quality improvement from a data engineer's perspective. I plan this blog series to cover not only data quality concepts, methodologies, procedures but also to case study the architectural designs of some data quality management platforms and deep dive into technical details for implementing a data … Continue reading Data Quality Improvement – Set the Scene Up

Our Data Quality is Good, Nothing Breakdown

Boss: Our data quality is good, nothing breakdown IT: Our data quality is good, there are some unimportant known issues, but all under control BI Developer: All data is from source systems, the quality should be good. Hay, look, how cool is the dashboard I built Business Users: Once again, those reports don't make sense … Continue reading Our Data Quality is Good, Nothing Breakdown

What Makes Me Become a Data Quality Enthusiast

Data Quality is Important Most of the time I don't think I am an absolutist, however, I found I became more and more certain that data quality is the root of all evil. Not only bigger portion of project time should be allocated to data quality management, but also a type of lean, agile and … Continue reading What Makes Me Become a Data Quality Enthusiast