Author: Linxiao Ma

Data Quality Improvement – Rule-Based Data Quality Assessment

As discussed in the previous blog posts in my Data Quality Improvement series, the key for successful data quality management is the continuous awareness and insights of how fit your data is being used for your business. Data quality assessment is the core and possibly the most challenging activity in the data quality management process. … Continue reading Data Quality Improvement – Rule-Based Data Quality Assessment →

What is Data Management, actually? – DAMA-DMBOK Framework

"What is data management?". I guess many people will (at least I think I will) answer "em... data management is managing data, right?" at the same time swearing in their heads that "what a stupid question!". However, if I was asked this question in a job interview, I guess I'd better to provide a bit … Continue reading What is Data Management, actually? – DAMA-DMBOK Framework →

How Azure Storage Cheats Over the CAP Theorem

Microsoft claims Azure Storage providing both high availability and strong consistency. It sounds good but obviously violates the CAP theorem as the 'P' (network partitioning) is not avoidable in the real world. In theory, you can only achieve either high availability or strong consistency in a distributed storage system. I have done a bit of … Continue reading How Azure Storage Cheats Over the CAP Theorem →

dqops – Query Databricks Database Schema through SQL Connector for Python

dqops Data Quality Studio (DQS) is one of my R&D projects I have been doing during my spare time. I plan to note down some tips & tricks I use in this project in the future blog posts from time to time. Databricks is one of the main data services that the dqops DQS is … Continue reading dqops – Query Databricks Database Schema through SQL Connector for Python →

Setup a Dockerised Spark Development Environment with VS code and Docker

Databricks is not cheap, especially when I need to use it for my personal R&D work (where unfortunately money has to be taken from my own pocket). Therefore, I have been developing in a dockerised Spark environment since a while ago and I found this way actually works well. Here I list the steps to set … Continue reading Setup a Dockerised Spark Development Environment with VS code and Docker →

Why I Prefer Hand-Coded Transformations over ADF Mapping Data Flow

Firstly, I need to clarify that what I am discussing in this blog post is only with ADF Mapping Data Flow instead of the whole ADF service. I am not going to challenge ADF’s role as the superb orchestration service in the Azure data ecosystem. In fact, I love ADF. At the control flow level, … Continue reading Why I Prefer Hand-Coded Transformations over ADF Mapping Data Flow →

Data Quality Improvement – DQ Dimensions = Confusions

DQ Dimensions are Confusing Data quality dimensions are great inventions from our data quality thought leaders and experts. Since the concept of quality dimensions was originally proposed in the course of the Total Data Quality Management (TDQM) program of MIT in the 1980s [5], a large number of data quality dimensions have been defined by … Continue reading Data Quality Improvement – DQ Dimensions = Confusions →

Data Quality Improvement – Conditional Functional Dependency (CFD)

To fulfil the promise I made before, I dedicate this blog post to cover the topic of Conditional Functional Dependency (CFD). The reason that I dedicate a whole blog post to this topic is that CFD is one of the most promising constraints to detect and repair inconsistencies in a dataset. The use of CFD … Continue reading Data Quality Improvement – Conditional Functional Dependency (CFD) →

Data Quality Improvement – Data Profiling

This is the second post of my Data Quality Improvement blog series. This blog post discusses the data profiling tasks that I think are relevant to data quality improvement use cases. For anyone who has ever worked with data, she or he must has already done some sort of data profiling, either using a commercial … Continue reading Data Quality Improvement – Data Profiling →

Data Quality – 80:20 Rule and 1:10:100 Rule

I came across two data quality rules from Martin Doyle's blog today. Martin Doyle is a data quality improvement evangelist and an industry expert on CRM. I found, to a certain extent, those data quality rules provide some kind of theoretical supports to some of my ideas with data quality improvements. 80:20 Rule The 80:20 … Continue reading Data Quality – 80:20 Rule and 1:10:100 Rule →

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: