I came across two data quality rules from Martin Doyle’s blog today. Martin Doyle is a data quality improvement evangelist and an industry expert on CRM. I found, to a certain extent, those data quality rules provide some kind of theoretical supports to some of my ideas with data quality improvements.
The 80:20 Rule is introduced by Martin Doyle to evaluate the costs of the Type I and Type II data quality problems defined by Jim Barker.
In brief, the Type I data quality problems refer to those “syntax” problems that require “know what” to identify, such as the problems fallen in the completeness, consistency, uniqueness and validity data quality dimensions. Type I data quality problems can be easily detected and even solved by data quality software.
Type II data quality problems are those “semantic” problems that require “know how” domain knowledge and experience to detect and solve. This type of problem is more enigmatic. The data looks all fine on the surface and serves well most of the time. However, the damage caused by them to businesses can be much larger.
Martin Doyle’s 80:20 rule states around 80% of data quality problems are the Type I problems and only 20% of data quality problems are the Type II problems. However, the Type II problems cost 80% of effort and budget to solve while the Type I problems only takes 20% of effort and budget.
The 1:10:100 rule was initially developed by George Labovitz and Yu Sang Chang in 1992 that highlights the importance of early prevention for quality control. In brief, the 1:10:100 rule states that the costs of quality control increase exponentially over time:
- $1: prevent cost – verifying and correcting data at the start. This is the least expensive way to control data quality.
- $10: correction cost – identifying and cleaning data. Businesses need to set up a team to validate and correct data errors
- $100: failure cost – costs of failure caused by bad data.
This rule implies that the earlier you take care of your data the less prices you have to pay for the damages caused by the bad data.
- While data quality software is capable to automate the tasks for solving Type I data quality problems, the tasks for solving Type II data quality problems require domain experts to invest time in manual investigating and research. It is not surprising to see that the cost to solve Type II problems is much higher than the cost to solve Type I problems. The key question that interests me most is how to improve the effectiveness and efficiency for solving Type II problems. Not only the current manual approach is expensive and time-consuming but also it is highly dependant on the domain experts’ knowledge and experiences. The devil is in the detail. If organisations lose those experts, it would be difficult for their successors to understand and solve those hidden issues even though with comprehensive handover documents as the knowledge and experiences cannot be passed over (at least for constructivists).
- So, can we find an alternative approach to solve the Type II problems without the dependency on the domain experts’ knowledge and experiences? can we identify the patterns and relations existing in a dataset automatically with the help of the computation power from modern computers to exhaustively scan through all possibilities? I don’t know the answer yet, but I am interested at exploring this. I am in the process of building a prototype that experiments some machine learning and data mining techniques for identifying the Type II data quality problems.
- The 1:10:100 rule highlights the importance of preventing data quality issues at the early stage. However, the question is how to convince organisations to invest in data quality management when everything works fine on the surface. There are clear business drives for the error correction and failure resolution tasks as the business activities will be immediately affected if no action is taken. However, it is more difficult to evaluate the quantify the necessity and urgency of the issue preventing tasks. As Arkady Maydanchik, the author of the book “Data Quality Assessment”, mentioned, people do not like those who predict rain on a sunny day. One solution is to improve organisations’ awareness of their data quality, and let them know their data is not perfect but instead having high potentials to run into problems if they do nothing to prevent those potential issues.