Category: Data Quality

Data Quality Improvement – Rule-Based Data Quality Assessment

Data Quality Improvement – Rule-Based Data Quality Assessment

As discussed in the previous blog posts in my Data Quality Improvement series, the key for successful data quality management is the continuous awareness and insights of how fit your data is being used for your business. Data quality assessment is the core and possibly the most challenging activity in the data quality management process. This blog post discusses the requirements and core elements of rule-based data quality assessments.

Requirements of Data Quality Assessments

What do we want from data quality assessments? In short, we want the data quality assessments to tell us the level of fitness of our data for supporting our business requirements. This statement implies two core requirements of data quality assessments:

  • Data quality assessments need to evaluate the quality of data in a business context. Quality is an abstract and multifaceted concept. The criteria to decide the data quality is based on what is the business required for good data.
  • Data quality assessments need to be able to evaluate the level of data fitness. As mentioned above, continuous monitoring is the key to successful data quality management. We need a tool to understand the trend and degree of the data quality evolving.
Rule-Based Data Quality Assessments

For those two requirements of data quality assessments mentioned above, let’s have a think about what we will need in order to fulfil those requirements.

Firstly, to evaluate the fitness of a set of data to a business requirement, we first need to know the criteria the business is using to evaluate whether or not the data is fit to use. A simple and good representation of the business-oriented criteria is a set of constraint-based rules. The fitness can be decided by checking whether or not the constraints are complied. The constraint-based rules not only simplify the representation and organisation of facets of data quality but also acts as an effective common language between the persona involved in the data quality management process.

Secondly, we need an effective representation of the level of data fitness. This representation needs to be able to describe the degree of the data quality changes and also the distances to the different data quality status (e.g., good, ok, bad, very bad) expected by the business. Yes, nothing is better than a numerical value that is normally referred to as a DQ metric. DQ metrics are widely used and studied in both commercial and research communities and they are often interpreted from different angles. Below is my definition of DQ metrics:

  • It represents the quality of a specific attribute of a set of data
  • It is a numerical number in a fixed range (e.g. 0..1 or 0..100)
  • It can represent data quality at a different level

As mentioned before, I personally consider data quality assessment as the most challenging activity in the data quality management process. At the same time, I think defining DQ metrics is the most challenging activity in the data quality assessment process. To ensure the DQ metrics you defined accurately representing the data quality, not only do you need to find a good formula for calculating the metrics, but also you need to take all sorts of business factors into consideration, such as business impacts, urgencies, and criticalities.

Figure 1. Pre-defined DQ Rules in dqops DQ studio
Elements of Data Quality Rules

In the rest of this blog post, I am going to explore the core elements of data quality rules and how they could be defined to support data quality assessments. The dqops DQ studio (the app I have been building for one of my personal R&D initiatives) will be used as examples for discussing those elements of DQ rules.

A data quality rule needs to contain the following five core elements:

  • Business Constraints
  • Metric Formula
  • Alert Thresholds
  • DQ Dimension
  • Metric Aggregation Algorithm

Business constraints specify the conditions for a set of data to qualify to be fit to use in a business scenario. Let’s take the Regular Expression constraint rule in the dqops DQ Studio as an example (as shown in the snapshot below). This rule is used to evaluate whether a column value in a dataset complies with a format specified by a regular expression. For example, a column that stores France postcodes is fit to use only when its values comply with the specified France postcode format.

Figure 2. Regular Expressions Constraint Rule in dqops DQ Studio

DQ metrics are the numerical representations of an aspect of the quality of a dataset. The Metric formula defines the calculation to determine the number. In the Regular Expression constraint rule example, the metric could be defined as the count of column values that comply with the regex divided by the count of rows in the dataset.

In the example above, the result calculated from the metric formula is a number between 0 to 1 or in the format of 0% to 100%. However, a number alone cannot tell whether the set of data is fit or not in the given business context. Alert Threshold is the tool to make the cut. In the Regular Expression constraint rule example, as the snapshot above shown, a warning threshold could be set as 90% while a critical threshold could be set as 70%. That means the quality of the data set (for the format validity aspect defined in regular expressions) is ‘Good’ when the metric number is over 90% and is ‘OK’ when it is between 70% and 90% and is not fit when the number is under 70%. Alert thresholds can be manually defined based on the business/DQ analysts experiences or be automatically defined based on the historic distribution of the observed values.

Figure 3. DQ Monitoring Panel in dqops DQ studio

Another element of a DQ rule is the Dimension that the DQ rule is evaluating into. As I mentioned in my previous blog post, I have a rather mixed feeling about DQ dimension. On one hand, DQ dimensions are context-dependent that could be interpreted differently in a different context by different people. On the other hand, DQ dimensions create a common language to express and classify the quality of data when quality itself is an abstract concept that could represent many aspects. To solve this dilemma, as I suggested before, instead of pursuing a global consensus on the universal DQ dimensions in any context indifferently, the DQ dimensions need to be defined and interpreted based on the expected purposes of the data and the consensus on the DQ dimension meanings need to be made in an internal, domain-specific environment. In the dqops DQ Studio, the dimension of a DQ rule is not pre-defined, but instead, business/DQ analysts are allowed to specify dimension for a DQ rule based on their interpretations of the dimension in their organisation.

The last but not least element of a DQ rule is the Metric Aggregation Algorithm. Data quality needs to be assessed and analysed at different granularity levels, such as column-level, dataset-level, database-level, domain-level, up to organisation-level. To represent the data quality using metrics at different levels, the low-level metrics need to be aggregated into the higher level. The algorithm for aggregating the metrics needs to take the business impacts as weighted variables, such as severity level and priority level of DQ issues.

Figure 4. DQ Dashboard in dqops DQ Studio

What is Data Management, actually? – DAMA-DMBOK Framework

What is Data Management, actually? – DAMA-DMBOK Framework

“What is data management?”. I guess many people will (at least I think I will) answer “em… data management is managing data, right?” at the same time swearing in their heads that “what a stupid question!”.

However, if I was asked this question in a job interview, I guess I’d better to provide a bit longer answer, such as the one given by DAMA cited below if I could ever memorise it.

Data Management is the development, execution, and supervision of plans, policies, programs, and practices that deliver, control, protect, and enhance the value of data and information assets throughout their lifecycles.

DAMA-DMBOK

If the interviewers asked me to elaborate in further detail, it could be a challenge as there are so many facets and aspects of Data Management. Many interdependent functions with their own goals, activities, and responsibilities are required in data management. For a data management professional, it is difficult to keep track of all those components and activities involved in data management.

Fortunately, DAMA developed the DMBOK framework, organising data knowledge areas in a structured form, that enables data professionals to understand data management comprehensively.

I have recently been diving into the heavy reading DAMA DMBOK book (“heavy” is in its literal manner, the book weighs 1.65kg!). I actually recommend all data professionals to give a read of this book. It is not only able to connect the dots in your knowledge network to have a comprehensive understanding of data management but also to provide a common language enabling you to communicate in the data world (instead of just nodding and smiling in a meeting when hearing some data jargon.

DAMA DMBOK framework defines 11 functional areas of data management.

As you can see from the DAMA Wheel above, Data Governance is at the heart of all the data management functional areas. Data governance provides direction and oversight for data management to ensure the secure, effective, and efficient use of data within the organisation. The other functional areas include:

  • Data Architecture – defines the structure of an organisation’s logical and physical data assets and data management processes through the data lifecycle
  • Data Modelling and Design – the process of discovering, analysing, representing and communicating data requirements in the form of data model
  • Data Storage and Operation – the process of designing, implementing, and supporting data storage
  • Date Security – ensuring data is accessed and used properly with data privacy and confidentiality are maintained
  • Data Integration and Interoperability – the process of designing and implementing data movement and consolidation within and between data sources
  • Document and Content Management – the process of managing data stored in unstructured medias
  • Reference and Master Data – the process of maintaining the core critical shared data within the organisation
  • Data Warehousing and Business Intelligence – the process of planning, implementing and controlling the processes for managing the decision supporting data
  • Metadata – managing the information of the data in the organisation
  • Data Quality – the process of ensuring data to be fit to use.

Based on the functional areas defined by the DAMA Wheel, Peter Aiken developed the DMBOK pyramid that defines the relation between those functional areas.

From the DMBOK pyramid, we can see the top of the pyramid is the golden function that is the most value-added for the business. However, the DMBOK pyramid reveals that the data analytics is just a very small part in an organisation’s data system. To make the data analytics workable, a lot of other functions need to work and collaborate seamlessly to build the foundation.

As the DMBOK pyramid shows, data governance is at the bottom that makes the ground foundation for the whole data system. Data architecture, data quality and metadata make up another layer of logical foundation on top of the data governance. The next level of upper layer includes data security, data modelling & design, and data storage & operations. Based on that layer, data integration & interoperability function can work on moving and consolidating data for enabling the functions at the upper layer, data warehousing/business intelligence, reference & master data, and documents & contents. From this layer, the functions start to be business faced. That also means the business cannot see the functions that need to run underneath.

The practical contribution from DMBOK pyramid is to reveal the logical progression of steps for constructing a data system. The DMBOK pyramid defines four phases for an organisation’s data management journey:

Phase 1 (Blue layer) – An organisation’s data management journey starts from purchasing applications that include database capabilities. That means the data modelling & design, data storage, and data security are the functions to be in place at first.

Phase 2 (Orange layer) – The organisation starts to feel the pains from bad data quality. To improve data quality, reliable metadata and consistent data architecture are essential.

Phase 3 (Green layer) – Along with the developments of the data quality, metadata and data architecture, the needs of structural supports for data management activities starts to appear that reveals the importance of a proper data governance practice. At the same time, data governance enables the execution of strategic initiatives, such as data warehousing, document management, and master data management.

Phase 4 (Red layer) – well-managed data enables advanced analytics.

The four phases of DMBOK seem to make sense, however, from my own experiences, organisations often go on different routes, mostly starting to rush on high-profiling initiatives such as data warehousing, BI, machine learning when no data governance, data quality, metadata etc. is in place.

Data Quality Improvement – DQ Dimensions = Confusions

Data Quality Improvement – DQ Dimensions = Confusions

DQ Dimensions are Confusing

Data quality dimensions are great inventions from our data quality thought leaders and experts. Since the concept of quality dimensions was originally proposed in the course of the Total Data Quality Management (TDQM) program of MIT in the 1980s [5], a large number of data quality dimensions have been defined by people from different backgrounds and industries. A survey conducted by DAMA NL in 2020 identifies 127 data quality dimensions from 9 authoritative sources. The DQ dimension word cloud created by MIOSoft [7] has perfectly demonstrated the scale of DQ dimensions.

Figure 1. DQ Dimension Word Cloud from MIOSoft [7]

It might not be that bad to have that many dimensions if they are clearly defined and universally agreed upon. Unfortunately, DQ dimensions are not based on concrete concepts and are not the fundamental property of the data [7]. Instead, they are context-dependent. The same DQ dimension can be interpreted differently in a different context by different people.

The Effort from DAMA UK

In 2013, DAMA UK organised a working group of data quality experts aiming to tackle the inconsistent understandings and interpretations of DQ dimensions and to define a set of core DQ dimensions that is the industry standard and well accepted by data professionals. Six core DQ dimensions have been defined by the group, including Completeness, Uniqueness, Consistency, Accuracy, Validity and Timeliness.

Figure 2. DAMA UK Data Quality Dimensions

To a certain extent, these six core DQ dimensions have indeed become the industry standard and well accepted in the data quality community: they have been used in the mainstream commercial data quality software (despite some of them tweaked one or two dimensions, e.g. replace timeliness with integrity); they have been widely referenced in the academic research community; they have been used as the standard dimension definitions in the government data quality guide, e.g. the UK Government Data Quality Framework published in 2020.

Confusion Continues

However, despite the six core DQ dimensions have been well known and accepted, confusion still exists. For example, the “Accuracy” dimension, arguably the most important data quality dimension [5], is defined as “The degree to which data correctly describes the ‘real world’ object or event being described”. However, how to measure “Accuracy” in practice? The reality is a gold standard dataset that can be used to refer to the ‘real world’ object is not available in most of the scenarios. In a survey of data quality measurement and monitoring tools conducted in 2019 [5], only one tool, Apache Griffin, was found to support the the “Accuracy” dimension despite it actually does not strictly follow the definition of “Accuracy” (It compares a target dataset to a source dataset without the validation of the source dataset reflecting the real world objects). The same situation happens to “Timeliness” dimension as well. According to DAMA UK’s definition, “Timeliness” is “the degree to which data represent reality from the required point in time”. Again, the reference to the time the real world event happens is not available in most real world scenarios. Instead, the time available in a database that represents the event is often the time when the event record is created or modified in the database.

Despite the DQ dimensions have been frequently referenced by the data quality software vendors in their website or sale materials, A study conducted by Ehrlinger, Rusz and Wöß [5] found that few DQ metrics have been actually implemented to measure the DQ dimensions. As figure 3 shown, amongst the popular commercial and open-source DQ software, only Apache Griffin implements metrics to measure “Accuracy” (for a certain extent, as mentioned above). No software supports “Consistency” and “Timeliness” dimensions. There is no widespread agreement in the implementation and definition of DQ dimensions in practice. When Ehrlinger, Rusz and Wöß [5] contacted the vendors for further details of the dimensions and metrics in their software, few vendors provided a satisfying reply of how the dimensions are measured.

Figure 3. DQ Dimensions Implemented in Commercial and Open-Source DQ software

However, it is possibly not fair to blame DQ software vendors for the confusion with DQ dimension measurements. DQ dimension itself is not a concrete concept but instead is context-dependent. It may be relatively simple to define dimensions and metrics to measure them for a specific business domain. However, those dimensions and metrics have little practical relevance for other business domains. Therefore, the question may not be how to create a set of universal dimensions and metrics that can be used by a general-purpose DQ software to fit all the scenarios, but instead, the question may be whether or not we should pursue a universal dimension at the first place.

What Can We Do?

As it is not practical to define universal dimensions and metrics and to use them in a general-purpose DQ software, shall we bother to use the concept of dimension in data quality assessments at all? According to Ehrlinger, Rusz and Wöß [5], several DQ tools have shown the capabilities to measure data quality without referring to the dimensions at all, and they suggest that a practical approach is required without the need for DQ dimensions but instead focusing on the measurements of core aspects such as missing data and duplicate detection that can be automated.

However, I think DQ dimension is a great invention! It creates a common language to express and classify the quality of data. Quality itself is an abstract concept that represents many aspects. That makes it difficult to communicate between people. DQ dimensions provide an efficient way to have comprehensive and organised descriptions of data quality.

The confusions are not caused by DQ dimension itself, but instead the problem is that the DQ dimensions are interpreted differently in a different context, for a different business domain, and by people from different backgrounds. Due to the context-dependent nature of data quality assessments, it is not realistic to have a set of dimensions with universal definition/interpterion and uniformly metrics to measure them.

Instead of pursuing a global consensus on the universal DQ dimensions and using them globally as the common language for describing data quality in any context, the DQ dimensions need to be interpreted based on the expected purposes of the data and the consensus on the DQ dimension meanings only need to be made in an internal, domain-specific environment. In other words, the DQ dimensions only need to be defined as the common language within the group of people who are relevant to the data, such as the producers, consumers and governors of the data. As long as the consensus on the meaning of the DQ dimensions is reached within the group, the DQ dimensions are effective.

References

[1] C. Batini, M. Scannapieco, Data and Information Quality: Concepts, Methodologies and Techniques. Springer International Publishing, Switzerland, 2016.

[2] A. Black, P. van Nederpelt, Dimensions of Data Quality (DDQ) Research Paper, DAMA NL Foundation, 3 September 2020.

[3] DAMA UK, The Six Primary Dimensions for Data Quality Assessment, 2013

[4] I. Diamond, A. Chisholm, The Government Data Quality Framework, https://www.gov.uk/government/publications/the-government-data-quality-framework/the-government-data-quality-framework, 2020

[5] L. Ehrlinger, E. Rusz, & W. Wöß, A Survey of Data Quality Measurement and Monitoring Tools, 2019, ArXiv, abs/1907.08138.

[6] B. Heinrich, M. Kaiser, and M. Klier, How to Measure Data Quality? A Metric-based Approach. In S. Rivard and J. Webster, editors, Proceedings of the 28th International Conference on Information Systems (ICIS), pages 1–15, Montreal, Canada, 2007. Association for Information Systems 2007.

[7] MIOSoft, Data Quality Dimensions Untangled, https://miosoft.com/resources/articles/data-quality-dimensions-untangled.html

[8] L. Sebastian-Coleman, Measuring Data Quality for Ongoing Improvement: A Data Quality Assessment
Framework. Elsevier, Waltham, MA, USA, 2012.

Data Quality Improvement – Conditional Functional Dependency (CFD)

Data Quality Improvement – Conditional Functional Dependency (CFD)

To fulfil the promise I made before, I dedicate this blog post to cover the topic of Conditional Functional Dependency (CFD). The reason that I dedicate a whole blog post to this topic is that CFD is one of the most promising constraints to detect and repair inconsistencies in a dataset. The use of CFD allows to automatically identifying context-dependent quality rules. That makes it a promising tool for automatic data quality rule discovery. CFD was originally developed by professor Wenfei Fan from the University of Edinburg. I have listed some of his papers regarding CFD [1][4][5] in the reference section at the end of this post.

Conditional Functional Dependency (CFD)

CFD is an extension to traditional functional dependencies (FD). FD is a class of constraints where the values of a set of columns X determine the values of another set of columns Y that can be represented as X → Y. FD that was developed mainly for schema design [1] is required to hold on entire relation(s) that makes it less useful for detecting data inconsistencies for the real-world datasets. CFD extends the FD by incorporating bindings of semantically related values that is capable to express the semantics of data fundamental to data cleaning [1].

Here I will borrow the classic ‘cust’ relation example from [1] that I found is possibly the simplest way to explain the CFD concept. Figure 1 shows an instance r0 of ‘cust’ relation that specifies a customer in terms of the customer’s phone (country code (CC), area code (AC), phone number (PN)), name (NM), and address (street (STR), city (CT), zip code (ZIP)).

Figure 1. An instance of the cust relation
Traditional Functional Dependency (FD)

From the customer records shown in Figure 1, we can see that all the customers with the same country code (CC) and area code (AC) are located in the same city (CT) as well. For example, all customers with CC as ’01’ and AC as ‘908’ are located in the CT ‘MH’ (t1, t2, and t4). Here we have the functional dependency, f1: [CC, AC ] → [CT ], that the CC and the AC can determine the value of CT. As this dependency applies to the entire relation, this is a traditional functional dependency. In addition, we can find another traditional functional dependency in the relation, f2: [CC, AC, PN ] → [STR, CT, ZIP ], that represents the dependency that phone country code, area code and number can determine the street name, city, and zip code.

Conditional Functional Dependency (CFD)

Let’s take a look at another example, f: [ZIP] → [STR]. For the customers (t5 and t6) with the same ZIP code ‘EH4 1DT’, they do have the same STR value ‘High St.’. Now let’s check the dependency on the entire relation for all customers. We can see that the dependency applies to t1 and t2 with ZIP as ‘07974’ and STR as ‘Tree Ave.’. However, this dependency does not apply to t4 that has ZIP as ‘07974’ but STR as ‘Elm Str.’. In this case, the ZIP code ‘07974’ cannot determine a unique STR value. Therefore, the constraint [ZIP] → [STR] does not hold on the entire relation so it is not a functional dependency. However, the constraint [ZIP] → [STR] does hold on for the customers with country code (CC) as ’44’, i.e. a UK address. In other words, the postcode can determine street in the UK address system but not in the US address system, and the constraint [ZIP] → [STR] only hold on when the country code is 44. This type of constraints is conditional functional dependency (CFD) that can be notated as ([CC, ZIP] → STR, (44, _ || _ )) where [CC, ZIP] → STR refers to the functional dependency and (44, _ || _ ) specifies the condition, i.e. when CC is 44, ZIP uniquely determines STR. The notation of a CFD can be generalised as (X → A, tp) where X → A is an FD and tp is a tuple pattern specifying the attributes in X and A that defines the subset where the FD holds on.

Constant Conditional Functional Dependency (CCFD)

As mentioned above, a CFD can be expressed as (X → A, tp) where tp is a tuple pattern, such as the example mentioned earlier (44, _ || _ ) where the symbol ‘||’ separate the left-hand side (LHS) from the right-hand side (RHS) of the dependency and the symbol ‘_’ represents any possible value, i.e. ‘_’ is a variable in the language of programming. One special case of CFD is that all the attributes for defining the tuple pattern are constants. For example, for the CDF, ([CC, AC] → CT, (01, 908 MH)), all the attributes defining the tuple pattern are constants: CC=01; AC=908; CT=MH. In plain English, the phone country code 01 and phone area code 908 of a customer determine the city where this customer is located to be MH. Compared to normal CFD, CCFD attracts more attention from the data quality community as CCFD expresses the semantics at the most detailed level and is easy to be used to check data consistencies.

CFD Discovery

Compared to FD, CFD is more effective in detecting and repairing inconsistencies of data [2][3]. CFD has demonstrated its potentials for commercial adoption in the real world. However, to use it commercially, effective and efficient methods for discovering the CFDs in a dataset are required. Even for a relatively small dataset, the workload to manually identify and validate all the CFDs is overwhelming. Therefore, the CFD discovery process has to be automated. A number of CFD discovery algorithms [2][5][6] have been developed since the concept of CFD was introduced.

In this section, I will introduce the CFD discovery algorithm developed by Chiang & Miller [2]. This method has been proved by Chiang & Miller [2] as effective to capture semantic data quality rules to enforce a broad range of domain constraints. In addition, as the redundant candidates are pruned as early as possible, the search space for the set of minimal CDFs is reduced as well as the computation time for discovering the CDFs.

Chiang & Miller have used 12 double-column pages in their paper [2] to elaborate their algorithm. That paper is informative and worth reading in detail if you are interested at data quality rule discovery. In the rest of this blog post, I will try my best to use simple language to explain how this algorithm works.

Firstly, let’s assume we have a relation with four attributes, A, B, C, D. We want to find the CFDs in the relation. The result set of CFDs needs to be minimum with redundant CFDs disregarded. In addition, we expect the algorithm takes as little computation time as possible. From the four attributes, A, B, C, D in the relation, the search space for the CFDs can be illustrated in an attribute search lattice (Figure 2). Each node in the lattice represents an attribute set. For example, ‘AC’ represents an attribute set including attribute ‘A’ and attribute ‘C’. The single-direction edge between two nodes represents a candidate CFD. For example, the edge (AC, ABC) represents ([A, C] → B, (x, x || x)) and the pattern tuples are unknown and needs to be mined out by the algorithm.

Figure 2: Attribute search lattice

The candidate CDFs are evaluated by traversing the lattice using a breadth-first search (BFS) algorithm, i.e. starting at the tree root and traversing all nodes at the present depth level before moving to the next depth level. In our example, we first traverse through the first level (k =1) that consists of single attribute sets (i.e. [A], [B], [C], [D]), followed by the second level (k = 2) that consists of 2-attribute sets (i.e. [A, B], [A, C], [A, D], [B, C], [B, D], [C, D]), and the next level until level k = total levels -1 or all CDFs in a minimum set have been found.

To achieve optimised algorithm efficiency, not all nodes in the lattice are visited. Instead, this algorithm prunes nodes that are supersets of already discovered rules and nodes that do not satisfy the threshold. In this way, the search space is reduced by skipping the evaluations of descendant nodes. For example, when the candidate ([A, C] → B, (A=’x’, _ || _)) is found to be a CDF, then it is no point to evaluate the candidate ([A, C, D] → B, (A=’x’, _, _ || _)) as the CDF ([A, C] → B, (A=’x’, _ || _)) already covers the semantics of ([A, C, D] → B, (A=’x’, _, _ || _)).

To evaluate whether or not a candidate in the lattice is a valid CDF, we can identify the values of left-hand side (LHS) of the candidate and check whether the same values of left-hand side (LHS) maps to the same value of right-hand side (RHS). To model this in the language of programming, we group the LHS attribute value set and RHS attribute value set, for a valid CDF, the tuples in the LHS group do not split into two or more RHS groups. Generally speaking, the more tuples fallen in a group that represents a valid CDF, this CDF is more preferable. A threshold can be imposed to the number of tuples a CDF need to hold on for it is qualified as a valid CDF.

For the tuple groups that fail the CDF validity test, they can still be refined by considering additional variable and conditional attributes.

Figure 3: A portion of the attribute search lattice

For example, as Figure 3 shown, we assume that no CFD can be materialised on a candidate [A, C] → B, of edge (AC, ABC). We can generate a new candidate in the next level of the lattice by considering an additional attribute such as D so that we have the new candidate as ([A, C, D] → B) as shown as an edge (X’, Y’) on Figure 3. If the new candidate is still not able to materialise to a valid CFD after adding a variable attribute, we can condition on an additional attribute. For example, we can condition on attribute D on top of an existing conditional attribute such as A. In Chiang & Miller’s algorithm, these new candidates are added into a global candidate list and the corresponding lattice edges are marked. The algorithm only searches the nodes with marked edges to ensure only minimal rules are returned.

References

[1] P. Bohannon, W. Fan, F. Geerts, X. Jia and A. Kementsietsidis, “Conditional Functional Dependencies for Data Cleaning,” 2007 IEEE 23rd International Conference on Data Engineering, 2007, pp. 746-755, doi: 10.1109/ICDE.2007.367920.

[2] F. Chiang, & R.J. Miller (2008). Discovering data quality rules. Proc. VLDB Endow., 1, 1166-1177.

[3] G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma, “Improving data quality: Consistency and accuracy,” in VLDB, 2007.

[4] W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis, “Conditional functional dependencies for capturing data inconsistencies,” TODS, vol. 33, no. 2, June, 2008.

[5] W. Fan, F. Geerts, J. Li and M. Xiong, “Discovering Conditional Functional Dependencies” in IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 5, pp. 683-698, May 2011, doi: 10.1109/TKDE.2010.154.

[6] M. Li, H. Wang and J. Li, “Mining conditional functional dependency rules on big data,” in Big Data Mining and Analytics, vol. 3, no. 1, pp. 68-84, March 2020, doi: 10.26599/BDMA.2019.9020019.

[7] V.S. Santana, F.S. Lopes, (2019) Method for the Assessment of Semantic Accuracy Using Rules Identified by Conditional Functional Dependencies. In: Garoufallou E., Fallucchi F., William De Luca E. (eds) Metadata and Semantic Research. MTSR 2019. Communications in Computer and Information Science, vol 1057. Springer, Cham. https://doi.org/10.1007/978-3-030-36599-8_25

Data Quality Improvement – Data Profiling

This is the second post of my Data Quality Improvement blog series. This blog post discusses the data profiling tasks that I think are relevant to data quality improvement use cases.

For anyone who has ever worked with data, she or he must has already done some sort of data profiling, either using a commercial data profiling software, or writing an ad-hoc SQL query, or just gazing over an Excel spreadsheet. Many tasks or activities can be put into the data profiling basket. To give a simple definition, data profiling is the process to find answers to your questions with the target datasets.

For different use cases, the focus of data profiling is often different. For example, for the data integration use cases and data migration use cases, the focus is to identify the semantics of tables and columns, for data analytics use cases, the focus is to identify the relationships between columns and between rows. Regarding data quality use cases, data profiling has a wider focus as data quality issues can be widely varied and can occur anywhere.

There are many data profiling tasks involved in the data quality use cases. A systematic data profiling process can be helpful for organising the profiling tasks. This blog post suggests a set of data profiling tasks that are relevant to data quality use cases. Those data profiling tasks are categories into three top-level categories that are executed in order:

  1. Table-level profiling
  2. Single-column profiling
  3. Multi-column profiling

Table-level profiling

Table-level profiling is to determine the metadata and to collect the statistics of a table that includes the following profiling tasks:

  • Table schema
  • Shape(rows/columns count) and size
  • Maintenance attributes (created/last updated datetime, created by)
  • Business attributes (business-critical levels, target users)
  • Update frequency and growth rate
  • Relationships to other tables

Single-column profiling

Single-column profiling is to capture column attributes and statistics of column values. There are much more profiling tasks involved in single-column profiling compared to table-level profiling. In this section, I categorise those profiling tasks according to data quality dimensions.

Completeness

The completeness of a single column can be measured by counting the missing values and comparing it to the count of total rows.

It sounds simple enough, right? However, the devil is in the details. The missing values do not have to be the null value or the empty cell. It can be some commonly used default values, such as “NA”, “Default” and “Empty”. It can also be the first option in the dropdown list, some meaning texts such as “abc” and randomly picked date when the users who input data through a UI do not care about that field. Those types of missing values are more difficult to detect but are more important for the data quality improvement tasks. Normally, those types of missing values can be detected from inspecting the least frequent values and outlier detection.

Uniqueness

The uniqueness of a single column can be profiled by counting the distinct values and compared it to the count of total rows. Uniqueness constraint can be used to validate the identity columns of a transactional dataset such as orderId in the order table or the attribute columns of a reference dataset such as customers, employees and products.

One of the challenges with uniqueness profiling is to detect those approximate matching records. For example, UK and GB refer to the same country. For some use cases, duplication resolution techniques may be required to support the uniqueness profiling.

In addition, depending on the business requirements, conditional uniqueness profiling is required. Conditional uniqueness refers to the scenarios where the value in a column is not unique at a global level but is unique inside a partition/group. For example, in a dataset holding world country & city records, the city column may not be unique as two cities in the different county may have the same name, such as Durham (the one in the UK and the one in the USA). However, at the country level, this column is indeed unique.

Another useful uniqueness-related column attribute is the monotonicity that indicates an identity column.

Validity

The validity of a single column can be profiled from a number of angles, including:

Range – range attribute of a column can be determined by the maximum value and the minimum value of the column. The column can be numerical or timestamp. One thing to note is that improper default values and outliers can affect the accuracy of the range profiling. For example, when 1970-01-01 is used as the null value, an infinite range will be profiled as a range from 1970-01-01 to the specified end date. In addition, some extreme cases may need to be disregarded from the range calculation.

Value Length – value length attribute of a column can be used to validate the format of a column. For example, a credit card number column is expected to have 16 digits number, and a UK address postcode is expected to have 5 to 7 characters.

Patterns – this profiling task is to find the patterns occurring in a column. For example, telephone numbers are following consistent patterns.

Decimals – the decimals in numeric values.

Domains – this profiling task can be conducted to determine the domain values for a single column or to exam the validity of a column to detect any value fallen out of the domain.

Value distributions – value distributions can also provide useful information about the validity of a column value. For example, the least frequent values are the potential candidates for the invalid values.

Multi-column profiling

As you may have noticed, the single-column profiling tasks mentioned above focus to extract the “syntax” information of the data that is useful for solving the Type I data quality problems (please refer to my previous blog post for further details of the Type I and Type II data quality problems). Compared to Type I data quality problems, Type II data quality problems are “semantic” type of problems that require more efforts and budget to tackle (80% of all data quality improvement efforts). Multi-column profiling that is often underused in data projects is good at identifying the relationships between columns and revealing the frequent patterns within a dataset that can be utilised to extract the “semantics” information of the data. In this section, I organise the multi-column profiling tasks based on the inter-column relationships:

Correlation – correlation profiling or correlation analysis is a statistical method to evaluate the strength of the relationships between two continuous, numerical columns. The strength of the relationships can be quantified as a numerical number, i.e. correlation coefficient, ranging between -1 to 1. The positive correlation coefficient represents the same-direction correlation and the closer the correlation coefficient to 1 the stronger the relationships between the columns. On the other hand, the negative correlation coefficient represents the opposite-direction correlation, such as the number of people who takes Covid-19 vaccine and the number of people admitted to hospital due to Covid-19 infections.

Dependency – dependency profiling refers to the activities of identifying the dependencies between columns. The dependency most relevant to data quality improvement use cases is functional dependency. A functional dependency exists when values in a set of column combinations functionally determine the values of another set of column combinations. For example, the value in the “capital” column is “London” can determine the value in the “country” column to be “UK”. For data quality improvement use cases, an extension of traditional functional dependency, conditional functional dependency (CDF), is more effective for capture the semantics of the data. As conditional functional dependency is a very useful tool for detect data quality issues at the semantics level, I will discuss it in a dedicated blog post.

Clustering – clustering profiling refers to the activities of identifying groups of similar records based on the values of a set of columns. The records that do not fit into any group can be a potential candidate for a data error.

In this blog post, I introduce the data profiling tasks that are relevant to data quality improvement use cases. I organised the data profiling tasks into three categories, table-level, single-column and multi-column. Table-level data profiling tasks mainly provide contextual information of a dataset. The single-column profiling tasks is more useful for Type I, “syntax” level of data quality problems while the multi-column profiling tasks that are capable of extracting “semantic” information have the potentials to support the resolution of Type II, “semantic” level of data quality problems.

This blog post only covers the data profiling tasks that I think are relevant to data quality improvement. However, data profiling covers much wider areas and topics, such as cross-table profiling, cross-source profiling, and schematic matching.

Data Quality – 80:20 Rule and 1:10:100 Rule

I came across two data quality rules from Martin Doyle’s blog today. Martin Doyle is a data quality improvement evangelist and an industry expert on CRM. I found, to a certain extent, those data quality rules provide some kind of theoretical supports to some of my ideas with data quality improvements.

80:20 Rule

The 80:20 Rule is introduced by Martin Doyle to evaluate the costs of the Type I and Type II data quality problems defined by Jim Barker.

In brief, the Type I data quality problems refer to those “syntax” problems that require “know what” to identify, such as the problems fallen in the completeness, consistency, uniqueness and validity data quality dimensions. Type I data quality problems can be easily detected and even solved by data quality software.

Type II data quality problems are those “semantic” problems that require “know how” domain knowledge and experience to detect and solve. This type of problem is more enigmatic. The data looks all fine on the surface and serves well most of the time. However, the damage caused by them to businesses can be much larger.

Martin Doyle’s 80:20 rule states around 80% of data quality problems are the Type I problems and only 20% of data quality problems are the Type II problems. However, the Type II problems cost 80% of effort and budget to solve while the Type I problems only takes 20% of effort and budget.

1:10:100 Rule

The 1:10:100 rule was initially developed by George Labovitz and Yu Sang Chang in 1992 that highlights the importance of early prevention for quality control. In brief, the 1:10:100 rule states that the costs of quality control increase exponentially over time:

  • $1: prevent cost – verifying and correcting data at the start. This is the least expensive way to control data quality.
  • $10: correction cost – identifying and cleaning data. Businesses need to set up a team to validate and correct data errors
  • $100: failure cost – costs of failure caused by bad data.

This rule implies that the earlier you take care of your data the less prices you have to pay for the damages caused by the bad data.

My Takeaways

  • While data quality software is capable to automate the tasks for solving Type I data quality problems, the tasks for solving Type II data quality problems require domain experts to invest time in manual investigating and research. It is not surprising to see that the cost to solve Type II problems is much higher than the cost to solve Type I problems. The key question that interests me most is how to improve the effectiveness and efficiency for solving Type II problems. Not only the current manual approach is expensive and time-consuming but also it is highly dependant on the domain experts’ knowledge and experiences. The devil is in the detail. If organisations lose those experts, it would be difficult for their successors to understand and solve those hidden issues even though with comprehensive handover documents as the knowledge and experiences cannot be passed over (at least for constructivists).
  • So, can we find an alternative approach to solve the Type II problems without the dependency on the domain experts’ knowledge and experiences? can we identify the patterns and relations existing in a dataset automatically with the help of the computation power from modern computers to exhaustively scan through all possibilities? I don’t know the answer yet, but I am interested at exploring this. I am in the process of building a prototype that experiments some machine learning and data mining techniques for identifying the Type II data quality problems.
  • The 1:10:100 rule highlights the importance of preventing data quality issues at the early stage. However, the question is how to convince organisations to invest in data quality management when everything works fine on the surface. There are clear business drives for the error correction and failure resolution tasks as the business activities will be immediately affected if no action is taken. However, it is more difficult to evaluate the quantify the necessity and urgency of the issue preventing tasks. As Arkady Maydanchik, the author of the book “Data Quality Assessment”, mentioned, people do not like those who predict rain on a sunny day. One solution is to improve organisations’ awareness of their data quality, and let them know their data is not perfect but instead having high potentials to run into problems if they do nothing to prevent those potential issues.