Data Quality Improvement – DQ Dimensions = Confusions

Data Quality Improvement – DQ Dimensions = Confusions

DQ Dimensions are Confusing

Data quality dimensions are great inventions from our data quality thought leaders and experts. Since the concept of quality dimensions was originally proposed in the course of the Total Data Quality Management (TDQM) program of MIT in the 1980s [5], a large number of data quality dimensions have been defined by people from different backgrounds and industries. A survey conducted by DAMA NL in 2020 identifies 127 data quality dimensions from 9 authoritative sources. The DQ dimension word cloud created by MIOSoft [7] has perfectly demonstrated the scale of DQ dimensions.

Figure 1. DQ Dimension Word Cloud from MIOSoft [7]

It might not be that bad to have that many dimensions if they are clearly defined and universally agreed upon. Unfortunately, DQ dimensions are not based on concrete concepts and are not the fundamental property of the data [7]. Instead, they are context-dependent. The same DQ dimension can be interpreted differently in a different context by different people.

The Effort from DAMA UK

In 2013, DAMA UK organised a working group of data quality experts aiming to tackle the inconsistent understandings and interpretations of DQ dimensions and to define a set of core DQ dimensions that is the industry standard and well accepted by data professionals. Six core DQ dimensions have been defined by the group, including Completeness, Uniqueness, Consistency, Accuracy, Validity and Timeliness.

Figure 2. DAMA UK Data Quality Dimensions

To a certain extent, these six core DQ dimensions have indeed become the industry standard and well accepted in the data quality community: they have been used in the mainstream commercial data quality software (despite some of them tweaked one or two dimensions, e.g. replace timeliness with integrity); they have been widely referenced in the academic research community; they have been used as the standard dimension definitions in the government data quality guide, e.g. the UK Government Data Quality Framework published in 2020.

Confusion Continues

However, despite the six core DQ dimensions have been well known and accepted, confusion still exists. For example, the “Accuracy” dimension, arguably the most important data quality dimension [5], is defined as “The degree to which data correctly describes the ‘real world’ object or event being described”. However, how to measure “Accuracy” in practice? The reality is a gold standard dataset that can be used to refer to the ‘real world’ object is not available in most of the scenarios. In a survey of data quality measurement and monitoring tools conducted in 2019 [5], only one tool, Apache Griffin, was found to support the the “Accuracy” dimension despite it actually does not strictly follow the definition of “Accuracy” (It compares a target dataset to a source dataset without the validation of the source dataset reflecting the real world objects). The same situation happens to “Timeliness” dimension as well. According to DAMA UK’s definition, “Timeliness” is “the degree to which data represent reality from the required point in time”. Again, the reference to the time the real world event happens is not available in most real world scenarios. Instead, the time available in a database that represents the event is often the time when the event record is created or modified in the database.

Despite the DQ dimensions have been frequently referenced by the data quality software vendors in their website or sale materials, A study conducted by Ehrlinger, Rusz and Wöß [5] found that few DQ metrics have been actually implemented to measure the DQ dimensions. As figure 3 shown, amongst the popular commercial and open-source DQ software, only Apache Griffin implements metrics to measure “Accuracy” (for a certain extent, as mentioned above). No software supports “Consistency” and “Timeliness” dimensions. There is no widespread agreement in the implementation and definition of DQ dimensions in practice. When Ehrlinger, Rusz and Wöß [5] contacted the vendors for further details of the dimensions and metrics in their software, few vendors provided a satisfying reply of how the dimensions are measured.

Figure 3. DQ Dimensions Implemented in Commercial and Open-Source DQ software

However, it is possibly not fair to blame DQ software vendors for the confusion with DQ dimension measurements. DQ dimension itself is not a concrete concept but instead is context-dependent. It may be relatively simple to define dimensions and metrics to measure them for a specific business domain. However, those dimensions and metrics have little practical relevance for other business domains. Therefore, the question may not be how to create a set of universal dimensions and metrics that can be used by a general-purpose DQ software to fit all the scenarios, but instead, the question may be whether or not we should pursue a universal dimension at the first place.

What Can We Do?

As it is not practical to define universal dimensions and metrics and to use them in a general-purpose DQ software, shall we bother to use the concept of dimension in data quality assessments at all? According to Ehrlinger, Rusz and Wöß [5], several DQ tools have shown the capabilities to measure data quality without referring to the dimensions at all, and they suggest that a practical approach is required without the need for DQ dimensions but instead focusing on the measurements of core aspects such as missing data and duplicate detection that can be automated.

However, I think DQ dimension is a great invention! It creates a common language to express and classify the quality of data. Quality itself is an abstract concept that represents many aspects. That makes it difficult to communicate between people. DQ dimensions provide an efficient way to have comprehensive and organised descriptions of data quality.

The confusions are not caused by DQ dimension itself, but instead the problem is that the DQ dimensions are interpreted differently in a different context, for a different business domain, and by people from different backgrounds. Due to the context-dependent nature of data quality assessments, it is not realistic to have a set of dimensions with universal definition/interpterion and uniformly metrics to measure them.

Instead of pursuing a global consensus on the universal DQ dimensions and using them globally as the common language for describing data quality in any context, the DQ dimensions need to be interpreted based on the expected purposes of the data and the consensus on the DQ dimension meanings only need to be made in an internal, domain-specific environment. In other words, the DQ dimensions only need to be defined as the common language within the group of people who are relevant to the data, such as the producers, consumers and governors of the data. As long as the consensus on the meaning of the DQ dimensions is reached within the group, the DQ dimensions are effective.

References

[1] C. Batini, M. Scannapieco, Data and Information Quality: Concepts, Methodologies and Techniques. Springer International Publishing, Switzerland, 2016.

[2] A. Black, P. van Nederpelt, Dimensions of Data Quality (DDQ) Research Paper, DAMA NL Foundation, 3 September 2020.

[3] DAMA UK, The Six Primary Dimensions for Data Quality Assessment, 2013

[4] I. Diamond, A. Chisholm, The Government Data Quality Framework, https://www.gov.uk/government/publications/the-government-data-quality-framework/the-government-data-quality-framework, 2020

[5] L. Ehrlinger, E. Rusz, & W. Wöß, A Survey of Data Quality Measurement and Monitoring Tools, 2019, ArXiv, abs/1907.08138.

[6] B. Heinrich, M. Kaiser, and M. Klier, How to Measure Data Quality? A Metric-based Approach. In S. Rivard and J. Webster, editors, Proceedings of the 28th International Conference on Information Systems (ICIS), pages 1–15, Montreal, Canada, 2007. Association for Information Systems 2007.

[7] MIOSoft, Data Quality Dimensions Untangled, https://miosoft.com/resources/articles/data-quality-dimensions-untangled.html

[8] L. Sebastian-Coleman, Measuring Data Quality for Ongoing Improvement: A Data Quality Assessment
Framework. Elsevier, Waltham, MA, USA, 2012.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s