Tag: Data Quality

Data Quality Improvement – Conditional Functional Dependency (CFD)

Data Quality Improvement – Conditional Functional Dependency (CFD)

To fulfil the promise I made before, I dedicate this blog post to cover the topic of Conditional Functional Dependency (CFD). The reason that I dedicate a whole blog post to this topic is that CFD is one of the most promising constraints to detect and repair inconsistencies in a dataset. The use of CFD allows to automatically identifying context-dependent quality rules. That makes it a promising tool for automatic data quality rule discovery. CFD was originally developed by professor Wenfei Fan from the University of Edinburg. I have listed some of his papers regarding CFD [1][4][5] in the reference section at the end of this post.

Conditional Functional Dependency (CFD)

CFD is an extension to traditional functional dependencies (FD). FD is a class of constraints where the values of a set of columns X determine the values of another set of columns Y that can be represented as X → Y. FD that was developed mainly for schema design [1] is required to hold on entire relation(s) that makes it less useful for detecting data inconsistencies for the real-world datasets. CFD extends the FD by incorporating bindings of semantically related values that is capable to express the semantics of data fundamental to data cleaning [1].

Here I will borrow the classic ‘cust’ relation example from [1] that I found is possibly the simplest way to explain the CFD concept. Figure 1 shows an instance r0 of ‘cust’ relation that specifies a customer in terms of the customer’s phone (country code (CC), area code (AC), phone number (PN)), name (NM), and address (street (STR), city (CT), zip code (ZIP)).

Figure 1. An instance of the cust relation
Traditional Functional Dependency (FD)

From the customer records shown in Figure 1, we can see that all the customers with the same country code (CC) and area code (AC) are located in the same city (CT) as well. For example, all customers with CC as ’01’ and AC as ‘908’ are located in the CT ‘MH’ (t1, t2, and t4). Here we have the functional dependency, f1: [CC, AC ] → [CT ], that the CC and the AC can determine the value of CT. As this dependency applies to the entire relation, this is a traditional functional dependency. In addition, we can find another traditional functional dependency in the relation, f2: [CC, AC, PN ] → [STR, CT, ZIP ], that represents the dependency that phone country code, area code and number can determine the street name, city, and zip code.

Conditional Functional Dependency (CFD)

Let’s take a look at another example, f: [ZIP] → [STR]. For the customers (t5 and t6) with the same ZIP code ‘EH4 1DT’, they do have the same STR value ‘High St.’. Now let’s check the dependency on the entire relation for all customers. We can see that the dependency applies to t1 and t2 with ZIP as ‘07974’ and STR as ‘Tree Ave.’. However, this dependency does not apply to t4 that has ZIP as ‘07974’ but STR as ‘Elm Str.’. In this case, the ZIP code ‘07974’ cannot determine a unique STR value. Therefore, the constraint [ZIP] → [STR] does not hold on the entire relation so it is not a functional dependency. However, the constraint [ZIP] → [STR] does hold on for the customers with country code (CC) as ’44’, i.e. a UK address. In other words, the postcode can determine street in the UK address system but not in the US address system, and the constraint [ZIP] → [STR] only hold on when the country code is 44. This type of constraints is conditional functional dependency (CFD) that can be notated as ([CC, ZIP] → STR, (44, _ || _ )) where [CC, ZIP] → STR refers to the functional dependency and (44, _ || _ ) specifies the condition, i.e. when CC is 44, ZIP uniquely determines STR. The notation of a CFD can be generalised as (X → A, tp) where X → A is an FD and tp is a tuple pattern specifying the attributes in X and A that defines the subset where the FD holds on.

Constant Conditional Functional Dependency (CCFD)

As mentioned above, a CFD can be expressed as (X → A, tp) where tp is a tuple pattern, such as the example mentioned earlier (44, _ || _ ) where the symbol ‘||’ separate the left-hand side (LHS) from the right-hand side (RHS) of the dependency and the symbol ‘_’ represents any possible value, i.e. ‘_’ is a variable in the language of programming. One special case of CFD is that all the attributes for defining the tuple pattern are constants. For example, for the CDF, ([CC, AC] → CT, (01, 908 MH)), all the attributes defining the tuple pattern are constants: CC=01; AC=908; CT=MH. In plain English, the phone country code 01 and phone area code 908 of a customer determine the city where this customer is located to be MH. Compared to normal CFD, CCFD attracts more attention from the data quality community as CCFD expresses the semantics at the most detailed level and is easy to be used to check data consistencies.

CFD Discovery

Compared to FD, CFD is more effective in detecting and repairing inconsistencies of data [2][3]. CFD has demonstrated its potentials for commercial adoption in the real world. However, to use it commercially, effective and efficient methods for discovering the CFDs in a dataset are required. Even for a relatively small dataset, the workload to manually identify and validate all the CFDs is overwhelming. Therefore, the CFD discovery process has to be automated. A number of CFD discovery algorithms [2][5][6] have been developed since the concept of CFD was introduced.

In this section, I will introduce the CFD discovery algorithm developed by Chiang & Miller [2]. This method has been proved by Chiang & Miller [2] as effective to capture semantic data quality rules to enforce a broad range of domain constraints. In addition, as the redundant candidates are pruned as early as possible, the search space for the set of minimal CDFs is reduced as well as the computation time for discovering the CDFs.

Chiang & Miller have used 12 double-column pages in their paper [2] to elaborate their algorithm. That paper is informative and worth reading in detail if you are interested at data quality rule discovery. In the rest of this blog post, I will try my best to use simple language to explain how this algorithm works.

Firstly, let’s assume we have a relation with four attributes, A, B, C, D. We want to find the CFDs in the relation. The result set of CFDs needs to be minimum with redundant CFDs disregarded. In addition, we expect the algorithm takes as little computation time as possible. From the four attributes, A, B, C, D in the relation, the search space for the CFDs can be illustrated in an attribute search lattice (Figure 2). Each node in the lattice represents an attribute set. For example, ‘AC’ represents an attribute set including attribute ‘A’ and attribute ‘C’. The single-direction edge between two nodes represents a candidate CFD. For example, the edge (AC, ABC) represents ([A, C] → B, (x, x || x)) and the pattern tuples are unknown and needs to be mined out by the algorithm.

Figure 2: Attribute search lattice

The candidate CDFs are evaluated by traversing the lattice using a breadth-first search (BFS) algorithm, i.e. starting at the tree root and traversing all nodes at the present depth level before moving to the next depth level. In our example, we first traverse through the first level (k =1) that consists of single attribute sets (i.e. [A], [B], [C], [D]), followed by the second level (k = 2) that consists of 2-attribute sets (i.e. [A, B], [A, C], [A, D], [B, C], [B, D], [C, D]), and the next level until level k = total levels -1 or all CDFs in a minimum set have been found.

To achieve optimised algorithm efficiency, not all nodes in the lattice are visited. Instead, this algorithm prunes nodes that are supersets of already discovered rules and nodes that do not satisfy the threshold. In this way, the search space is reduced by skipping the evaluations of descendant nodes. For example, when the candidate ([A, C] → B, (A=’x’, _ || _)) is found to be a CDF, then it is no point to evaluate the candidate ([A, C, D] → B, (A=’x’, _, _ || _)) as the CDF ([A, C] → B, (A=’x’, _ || _)) already covers the semantics of ([A, C, D] → B, (A=’x’, _, _ || _)).

To evaluate whether or not a candidate in the lattice is a valid CDF, we can identify the values of left-hand side (LHS) of the candidate and check whether the same values of left-hand side (LHS) maps to the same value of right-hand side (RHS). To model this in the language of programming, we group the LHS attribute value set and RHS attribute value set, for a valid CDF, the tuples in the LHS group do not split into two or more RHS groups. Generally speaking, the more tuples fallen in a group that represents a valid CDF, this CDF is more preferable. A threshold can be imposed to the number of tuples a CDF need to hold on for it is qualified as a valid CDF.

For the tuple groups that fail the CDF validity test, they can still be refined by considering additional variable and conditional attributes.

Figure 3: A portion of the attribute search lattice

For example, as Figure 3 shown, we assume that no CFD can be materialised on a candidate [A, C] → B, of edge (AC, ABC). We can generate a new candidate in the next level of the lattice by considering an additional attribute such as D so that we have the new candidate as ([A, C, D] → B) as shown as an edge (X’, Y’) on Figure 3. If the new candidate is still not able to materialise to a valid CFD after adding a variable attribute, we can condition on an additional attribute. For example, we can condition on attribute D on top of an existing conditional attribute such as A. In Chiang & Miller’s algorithm, these new candidates are added into a global candidate list and the corresponding lattice edges are marked. The algorithm only searches the nodes with marked edges to ensure only minimal rules are returned.

References

[1] P. Bohannon, W. Fan, F. Geerts, X. Jia and A. Kementsietsidis, “Conditional Functional Dependencies for Data Cleaning,” 2007 IEEE 23rd International Conference on Data Engineering, 2007, pp. 746-755, doi: 10.1109/ICDE.2007.367920.

[2] F. Chiang, & R.J. Miller (2008). Discovering data quality rules. Proc. VLDB Endow., 1, 1166-1177.

[3] G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma, “Improving data quality: Consistency and accuracy,” in VLDB, 2007.

[4] W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis, “Conditional functional dependencies for capturing data inconsistencies,” TODS, vol. 33, no. 2, June, 2008.

[5] W. Fan, F. Geerts, J. Li and M. Xiong, “Discovering Conditional Functional Dependencies” in IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 5, pp. 683-698, May 2011, doi: 10.1109/TKDE.2010.154.

[6] M. Li, H. Wang and J. Li, “Mining conditional functional dependency rules on big data,” in Big Data Mining and Analytics, vol. 3, no. 1, pp. 68-84, March 2020, doi: 10.26599/BDMA.2019.9020019.

[7] V.S. Santana, F.S. Lopes, (2019) Method for the Assessment of Semantic Accuracy Using Rules Identified by Conditional Functional Dependencies. In: Garoufallou E., Fallucchi F., William De Luca E. (eds) Metadata and Semantic Research. MTSR 2019. Communications in Computer and Information Science, vol 1057. Springer, Cham. https://doi.org/10.1007/978-3-030-36599-8_25

Data Quality Improvement – Data Profiling

This is the second post of my Data Quality Improvement blog series. This blog post discusses the data profiling tasks that I think are relevant to data quality improvement use cases.

For anyone who has ever worked with data, she or he must has already done some sort of data profiling, either using a commercial data profiling software, or writing an ad-hoc SQL query, or just gazing over an Excel spreadsheet. Many tasks or activities can be put into the data profiling basket. To give a simple definition, data profiling is the process to find answers to your questions with the target datasets.

For different use cases, the focus of data profiling is often different. For example, for the data integration use cases and data migration use cases, the focus is to identify the semantics of tables and columns, for data analytics use cases, the focus is to identify the relationships between columns and between rows. Regarding data quality use cases, data profiling has a wider focus as data quality issues can be widely varied and can occur anywhere.

There are many data profiling tasks involved in the data quality use cases. A systematic data profiling process can be helpful for organising the profiling tasks. This blog post suggests a set of data profiling tasks that are relevant to data quality use cases. Those data profiling tasks are categories into three top-level categories that are executed in order:

  1. Table-level profiling
  2. Single-column profiling
  3. Multi-column profiling

Table-level profiling

Table-level profiling is to determine the metadata and to collect the statistics of a table that includes the following profiling tasks:

  • Table schema
  • Shape(rows/columns count) and size
  • Maintenance attributes (created/last updated datetime, created by)
  • Business attributes (business-critical levels, target users)
  • Update frequency and growth rate
  • Relationships to other tables

Single-column profiling

Single-column profiling is to capture column attributes and statistics of column values. There are much more profiling tasks involved in single-column profiling compared to table-level profiling. In this section, I categorise those profiling tasks according to data quality dimensions.

Completeness

The completeness of a single column can be measured by counting the missing values and comparing it to the count of total rows.

It sounds simple enough, right? However, the devil is in the details. The missing values do not have to be the null value or the empty cell. It can be some commonly used default values, such as “NA”, “Default” and “Empty”. It can also be the first option in the dropdown list, some meaning texts such as “abc” and randomly picked date when the users who input data through a UI do not care about that field. Those types of missing values are more difficult to detect but are more important for the data quality improvement tasks. Normally, those types of missing values can be detected from inspecting the least frequent values and outlier detection.

Uniqueness

The uniqueness of a single column can be profiled by counting the distinct values and compared it to the count of total rows. Uniqueness constraint can be used to validate the identity columns of a transactional dataset such as orderId in the order table or the attribute columns of a reference dataset such as customers, employees and products.

One of the challenges with uniqueness profiling is to detect those approximate matching records. For example, UK and GB refer to the same country. For some use cases, duplication resolution techniques may be required to support the uniqueness profiling.

In addition, depending on the business requirements, conditional uniqueness profiling is required. Conditional uniqueness refers to the scenarios where the value in a column is not unique at a global level but is unique inside a partition/group. For example, in a dataset holding world country & city records, the city column may not be unique as two cities in the different county may have the same name, such as Durham (the one in the UK and the one in the USA). However, at the country level, this column is indeed unique.

Another useful uniqueness-related column attribute is the monotonicity that indicates an identity column.

Validity

The validity of a single column can be profiled from a number of angles, including:

Range – range attribute of a column can be determined by the maximum value and the minimum value of the column. The column can be numerical or timestamp. One thing to note is that improper default values and outliers can affect the accuracy of the range profiling. For example, when 1970-01-01 is used as the null value, an infinite range will be profiled as a range from 1970-01-01 to the specified end date. In addition, some extreme cases may need to be disregarded from the range calculation.

Value Length – value length attribute of a column can be used to validate the format of a column. For example, a credit card number column is expected to have 16 digits number, and a UK address postcode is expected to have 5 to 7 characters.

Patterns – this profiling task is to find the patterns occurring in a column. For example, telephone numbers are following consistent patterns.

Decimals – the decimals in numeric values.

Domains – this profiling task can be conducted to determine the domain values for a single column or to exam the validity of a column to detect any value fallen out of the domain.

Value distributions – value distributions can also provide useful information about the validity of a column value. For example, the least frequent values are the potential candidates for the invalid values.

Multi-column profiling

As you may have noticed, the single-column profiling tasks mentioned above focus to extract the “syntax” information of the data that is useful for solving the Type I data quality problems (please refer to my previous blog post for further details of the Type I and Type II data quality problems). Compared to Type I data quality problems, Type II data quality problems are “semantic” type of problems that require more efforts and budget to tackle (80% of all data quality improvement efforts). Multi-column profiling that is often underused in data projects is good at identifying the relationships between columns and revealing the frequent patterns within a dataset that can be utilised to extract the “semantics” information of the data. In this section, I organise the multi-column profiling tasks based on the inter-column relationships:

Correlation – correlation profiling or correlation analysis is a statistical method to evaluate the strength of the relationships between two continuous, numerical columns. The strength of the relationships can be quantified as a numerical number, i.e. correlation coefficient, ranging between -1 to 1. The positive correlation coefficient represents the same-direction correlation and the closer the correlation coefficient to 1 the stronger the relationships between the columns. On the other hand, the negative correlation coefficient represents the opposite-direction correlation, such as the number of people who takes Covid-19 vaccine and the number of people admitted to hospital due to Covid-19 infections.

Dependency – dependency profiling refers to the activities of identifying the dependencies between columns. The dependency most relevant to data quality improvement use cases is functional dependency. A functional dependency exists when values in a set of column combinations functionally determine the values of another set of column combinations. For example, the value in the “capital” column is “London” can determine the value in the “country” column to be “UK”. For data quality improvement use cases, an extension of traditional functional dependency, conditional functional dependency (CDF), is more effective for capture the semantics of the data. As conditional functional dependency is a very useful tool for detect data quality issues at the semantics level, I will discuss it in a dedicated blog post.

Clustering – clustering profiling refers to the activities of identifying groups of similar records based on the values of a set of columns. The records that do not fit into any group can be a potential candidate for a data error.

In this blog post, I introduce the data profiling tasks that are relevant to data quality improvement use cases. I organised the data profiling tasks into three categories, table-level, single-column and multi-column. Table-level data profiling tasks mainly provide contextual information of a dataset. The single-column profiling tasks is more useful for Type I, “syntax” level of data quality problems while the multi-column profiling tasks that are capable of extracting “semantic” information have the potentials to support the resolution of Type II, “semantic” level of data quality problems.

This blog post only covers the data profiling tasks that I think are relevant to data quality improvement. However, data profiling covers much wider areas and topics, such as cross-table profiling, cross-source profiling, and schematic matching.

Data Quality – 80:20 Rule and 1:10:100 Rule

I came across two data quality rules from Martin Doyle’s blog today. Martin Doyle is a data quality improvement evangelist and an industry expert on CRM. I found, to a certain extent, those data quality rules provide some kind of theoretical supports to some of my ideas with data quality improvements.

80:20 Rule

The 80:20 Rule is introduced by Martin Doyle to evaluate the costs of the Type I and Type II data quality problems defined by Jim Barker.

In brief, the Type I data quality problems refer to those “syntax” problems that require “know what” to identify, such as the problems fallen in the completeness, consistency, uniqueness and validity data quality dimensions. Type I data quality problems can be easily detected and even solved by data quality software.

Type II data quality problems are those “semantic” problems that require “know how” domain knowledge and experience to detect and solve. This type of problem is more enigmatic. The data looks all fine on the surface and serves well most of the time. However, the damage caused by them to businesses can be much larger.

Martin Doyle’s 80:20 rule states around 80% of data quality problems are the Type I problems and only 20% of data quality problems are the Type II problems. However, the Type II problems cost 80% of effort and budget to solve while the Type I problems only takes 20% of effort and budget.

1:10:100 Rule

The 1:10:100 rule was initially developed by George Labovitz and Yu Sang Chang in 1992 that highlights the importance of early prevention for quality control. In brief, the 1:10:100 rule states that the costs of quality control increase exponentially over time:

  • $1: prevent cost – verifying and correcting data at the start. This is the least expensive way to control data quality.
  • $10: correction cost – identifying and cleaning data. Businesses need to set up a team to validate and correct data errors
  • $100: failure cost – costs of failure caused by bad data.

This rule implies that the earlier you take care of your data the less prices you have to pay for the damages caused by the bad data.

My Takeaways

  • While data quality software is capable to automate the tasks for solving Type I data quality problems, the tasks for solving Type II data quality problems require domain experts to invest time in manual investigating and research. It is not surprising to see that the cost to solve Type II problems is much higher than the cost to solve Type I problems. The key question that interests me most is how to improve the effectiveness and efficiency for solving Type II problems. Not only the current manual approach is expensive and time-consuming but also it is highly dependant on the domain experts’ knowledge and experiences. The devil is in the detail. If organisations lose those experts, it would be difficult for their successors to understand and solve those hidden issues even though with comprehensive handover documents as the knowledge and experiences cannot be passed over (at least for constructivists).
  • So, can we find an alternative approach to solve the Type II problems without the dependency on the domain experts’ knowledge and experiences? can we identify the patterns and relations existing in a dataset automatically with the help of the computation power from modern computers to exhaustively scan through all possibilities? I don’t know the answer yet, but I am interested at exploring this. I am in the process of building a prototype that experiments some machine learning and data mining techniques for identifying the Type II data quality problems.
  • The 1:10:100 rule highlights the importance of preventing data quality issues at the early stage. However, the question is how to convince organisations to invest in data quality management when everything works fine on the surface. There are clear business drives for the error correction and failure resolution tasks as the business activities will be immediately affected if no action is taken. However, it is more difficult to evaluate the quantify the necessity and urgency of the issue preventing tasks. As Arkady Maydanchik, the author of the book “Data Quality Assessment”, mentioned, people do not like those who predict rain on a sunny day. One solution is to improve organisations’ awareness of their data quality, and let them know their data is not perfect but instead having high potentials to run into problems if they do nothing to prevent those potential issues.

Data Quality Improvement – Set the Scene Up

In this blog series I plan to write about data quality improvement from a data engineer’s perspective. I plan this blog series to cover not only data quality concepts, methodologies, procedures but also to case study the architectural designs of some data quality management platforms and deep dive into technical details for implementing a data quality management solution. In addition, I am keen to cover some areas I have been exploring recently, such as data quality assessment of big data, automatic DQ, and DQ Ops.

This blog post aims to set the scene up for my data quality improvement journey, focusing on the motivation behind data quality improvement, the natures of data quality, and the requirements for data quality improvement.

Data Quality, Why?

Firstly, organisations are overoptimistic with their data quality and underestimate the loss caused by data quality issues. However, the reality is:

The average financial impact of poor data quality on organizations is $9.7 million per year.

 Gartner

In the US alone, businesses lose $3.1 trillion annually due to poor data quality

IBM

Organizations make (often erroneous) assumptions about the state of their data and continue to experience inefficiencies, excessive costs, compliance risks and customer satisfaction issues as a result

Gartner

Secondly, the trends across different industries toward business process automation and data-driven decision make the reliability of data quality critical for an organisation [4]. Techniques such as ML and AI are often highly sensitive to input data [5], subtle errors in the input data can cause significant inaccuracy no matter how good is the algorithm.

Thirdly, you may have to pay a scary large penalty bill due to your poor data quality:

China’s banking regulator has fined the country’s four major state-owned lenders for inconsistencies in their financial data reporting.

GBRR

Citi fined $400M over risk management, data governance issues

Banking Dive

Natures of Data Quality

Before jumping to the processes for improving data quality, we need to first understand what data quality is and why it is difficult to manage. The same set of data can be good data quality (i.e., fit to use) for one purpose, but can also be

No, Your Data is Not Bad, It is Just Not Useful

There is no such thing as “Good” data or “Bad” data. Data can only be evaluated as “fit to use” for a specific purpose or not fit to use for that purpose. Therefore the assessments of data quality have to be conducted in the context of how the data is expected to use. The same set of data can be “Good” (i.e., fit to use) for one purpose, but can also be “Bad” for another purpose.

For example, the same set of order transaction data, when it is used for demand prediction, even though a small number of transaction records missing, it may still fit to use for producing relatively accurate prediction. However, the same set of data will not be fit to use for monthly financial reports.

As we evaluate the quality of data as its fitness to use for a specific purpose, a couple of questions occurs. Firstly, how shall we quantify the fitness level and in what dimensions? On the other hand, can’t we just make our dataset universally fit all purpose? is that economically feasible or reasonable?

Those questions and the others raised in the rest of this section will form the core requirements for a data quality improvement process.

Data is not Static, so is Data Quality

The current enterprise data environment has become more and more complex where data is not static but continuously move about and evolve, from database schema changes due to source app redesign and upgrade, ill-maintained ETL pipelines, endless data migration and consolation efforts, to ungoverned self-service data usage. Subtle errors occurred in one place can trigger the butterfly effects and cause a widespread infections. Even worse those subtle errors introduced by changes in data can be very difficult to detect [2]. Based on my personal experiences, I cannot agree with this anymore, especially when the transformation/enrichment logics is complex or you have very limited access to the source databases.

Due to the dynamic attribute of data, data quality will not be static over time. As mentioned above, a single subtle error can trigger butterfly effects and cause data quality deteriorates unnoticeably. In addition, data represents the real world. While this real world is changing too fast, the meanings held by the data are changing, and the criteria judge data quality is changing as well.

Therefore, you should not be surprised to know that you are having serious data quality issues now when your data achieved high-quality score last week.

Data Quality Improvement Involves Too Many Parties, Even just in IT

It is well accepted in the data quality management community that data quality is not just a technical problem but also a business problem. IT and business have to work together with alignments on the data quality improvement goals and processes. The collaboration between IT and business is one of the main topics elaborated in data quality management literatures, such as [1][3]. However, I found few articles analyse the engineering challenges of data quality improvement within IT.

Data quality improvement is one of the most challenging jobs in IT. The root cause is that a running data system spans over too many territories owned by different teams which have different mindsets, different work priority and different “best practises” to do things. Just have a look at a data lineage diagram: there are the source databases where the team designs them is not the team operates them; for a large enterprise data environment, a messaging service is often deployed to decouple the operational databases and downstream systems which adds one team responsible for the messaging service and probably another team to maintain a set of APIs for integration; then there is the ETL team (often with the dependencies on other teams like MDM, infrastructure, 3rd parties vendors) for enriching and transforming the data to a status usable for end data user; and then there are the countless client analysis tools and languages. Not even solving a data issue during this process, just travel through the lineage and try to identify the issue is a pain.

For each territory along the data lineage paths, the data and the processing logics are safely protected within their borders. The people responsible for solving the data quality issue have often no access to investigate the data and code. In addition, the processing logic can be very complex and the used techniques can be very different between the territories. Therefore, you have to rely on the experts in that territory to look for the data quality issue. As that territory has its own work schedule and priority, you will find yourself to be in the “waiting” game. Even worse, the team who investigates your issue may have to wait for other teams to get some prerequisites ready before they can look into your issue. This is just one territory on your journey to find the DQ issue. If the issue is finally confirmed not from this territory. You have to move up along the data lineage paths, and do it all over again. Even worse, sometimes you have to revisit the territories you have passed.

Data Quality Requirements

From the analysis of data quality natures, we can see some high-level requirements for data quality improvements starting to pop up. As data quality can only be evaluated as “fit (or not fit) to use” for a specific purpose, data quality assessments have to be conducted in the context of how the data is expected to be used. In this case, a series of questions need to be considered when conducting a data quality assessment:

  • How to measure the ‘fitness to use’ level of a set of data for a specific purpose, in what dimensions, and how to quantify the ‘fitness’ levels, and finally how to assess and monitor the effectiveness of the defined data quality assessment designs?
  • The persona involved in a data quality assessment process. Who should define the data quality assessment rules, with the dependant supports from whom, and who should assess the effectiveness of the data quality assessment rules?
  • Tooling supports for data quality assessment. Data quality assessment can be a daunting task that involves data profiling, data quality rule design and execution, and assessment results logging and analysis. What tooling features can make the process easier?

As mentioned earlier data is not static but continuously move about and evolve in an enterprise data environment. One-shot data quality assessment is meaningless when the assessed data is constantly changing. Due to the dynamic attributes of data, the data quality assessment needs to be ongoing and continuously so that it timely reflect the up-to-date status of the data quality.

For data quality improvements, awareness is key. The ongoing monitoring of data quality is the drive of data quality improvements. For quality controls of other products, such as an oil rig or other machines, the quality issue is relative visible or easy to aware of. In comparison, the quality issues of data can stay much quieter and not easy to be aware of. You cannot simply blame an organisation for ignoring data quality issues if they are not aware of those data quality issues in the first place.

Compared to data quality assessment which focuses more on business analysis and process management, the requirements of ongoing data quality monitoring focuses more on engineering, for example:

  • What components are required to implement a complete, configurable and maintainable data quality monitoring platform?
  • Where to execute the assessment of data quality rules, at where the data is located or cached to and executed on the data quality monitoring platform?
  • The data stores in a data system vary, they can be oracle, spark, data lake and many more. How to design an extendable, data stores independent data quality monitoring platform at the same time keeping good performance and scalability?
  • Security and compliance considerations of data quality monitoring platform?

Last but not least, after we are aware of the data quality issues, the next step is to locate and solve the issues. As discussed above, solving a data quality issue can be challenging in an enterprise data environment. Fortunately, many other guys are feeling similar pains as well, and some of them, such as the DataOps community, start to practise new ways of work to tackle the pains. The data quality improvement process should be designed aligning those new ways of work and to be an integrated part.

References

[1] McGilvray, D. (2008) Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information, California: Morgan Kaufmann.

[2] Polyzotis, N., Roy S., Whang S. E., and Zinkevich M., Data management challenges in production machine learning. SIGMOD, 1723–1726, 2017

[3] Redman, T. 2008. Data Driven: Profiting from Your Most Important Business Asset. Cambridge, MA, USA: Harvard Business Press.

[4] Schelter, S., Lange, D., Schmidt, P., Celikel, M., Biessmann, F., & Grafberger, Automating large-scale
data quality verification, A. Proceedings of the VLDB Endowment, 11(12):1781–1794, VLDB Endowment, 2018.

[5] Sculley D., Holt G., Golovin D., Davydov E., Phillips T., Ebner D., Chaudhary V., Young M., Crespo J., and Dennison S. E., Hidden Technical Debt in Machine Learning Systems. NIPS, 2503–2511, 2015

Our Data Quality is Good, Nothing Breakdown

Boss: Our data quality is good, nothing breakdown

IT: Our data quality is good, there are some unimportant known issues, but all under control

BI Developer: All data is from source systems, the quality should be good. Hay, look, how cool is the dashboard I built

Business Users: Once again, those reports don’t make sense at all. I am busy, no time to report to IT, data-driven is not suitable for our business, let’s trust our experts’ instincts and get job done.

Boss: We have built a successful data-driven company, our experience is…

What Makes Me Become a Data Quality Enthusiast

Data Quality is Important

Most of the time I don’t think I am an absolutist, however, I found I became more and more certain that data quality is the root of all evil. Not only bigger portion of project time should be allocated to data quality management, but also a type of lean, agile and faster responsive data quality management methodologies need to be explored for making data quality improvement process less painful and more effective.

There is no doubt that the buzzwords, “big data”, “AI” and “streaming”, can decorate your data project as something “innovative” and “high-tech” (and make you not just that data guy but suddenly the posh kid). However, I found they are hardly the deciding factor of a data project success. With the jaw-dropping advance speed of data technologies, the “big data” is not “big”, the AI is not so-called “intelligent”, and the “streaming” concept may not exist very soon as it is being unified with “batch” processing.

Not long ago, performance and scalability can be one deciding factor of a data project success. As the processing power from a single machine cannot handle the volume of a big dataset (may not big with today’s standard), the aggregations of measures against all possible dimension attributes need to be pre-calculated and cached in the cubes. The BI practitioners have to allocate much project time to implement “best practices” and “tuning tricks” to make the process a bit faster. However, distributed and parallelised data processing models provide BI practitioners with more options to ensure a good performance and scalability without spending too much efforts on some tricky performance tuning. For example, in one of my previous project where I need to optimise a data processing pipeline which took 8+ hours to run, instead of diving into these thousands lines of code to identify the potential small performance improvement opportunities bit by bit, I parallelised the data processing logics and split the workloads into 25 nodes with 100 cores running in parallel in total which reduced the processing time from 8+ hours to 15 mins.

As those buzzwords (“big data”, “AI” and “streaming”) are not the deciding factors of a data project success and performance and scalability will not be the deciding challenge of a data project anymore, how about other elements in a data project, such as project management, maintainability, disaster recovery, and more… As a non-absolutist, I think I am obliged to make this claim that all the elements or tasks in a data project can decide the success or fail of the project and they are equally important in an absolute sense. The data quality just has a higher possibility to be the deciding factor, with the following reasons:

  • Many other people say so
  • Garbage in, garbage out, no matter how well the other parts of you data solution have been built, as long as the data is useless, the whole solution is useless
  • There are so many ways for data to go wrong, incomplete, duplicate, inconsistent, dated… I remember I read a paper which lists 30 or 40 dimensions where data can go wrong.
  • People lack awareness or have misperceptions (most of the time over optimistic) with their data quality. It is not surprised to see an art-class dashboard with full of non-sense data
  • The people who build the data solution are often not those who truly understand the data and the business logics it represents (unfortunately the people who truly understand the data are not always those people who have skills to build a data solution)
  • Data quality management is a continuous process. An effective management process and tooling supports are necessary, but few of data project has those.
  • There is no absolute good or bad data, and the definition of good or bad can vary significantly for different usage. That is actually the reason why data quality is defined as “fitness to use”. The ‘good’ data for a group people might be ‘bad’ for the other group of people.

Data Quality is Fun

You might be surprised to hear that I claim data quality work is something fun. Surely the endless, tedious and repetitive tasks involved in finding and cleaning dirty data cannot be fun. However, data quality work is not just that but instead involves the exciting areas of exploring the technologies and methodologies for make data quality management simple. Here are some of examples I feel interesting:

Automatic Data Quality Rule Discovery

A set of comprehensive and suitable data quality rules are critical for an accurate data quality assessment. However, this may not be as simple as you assumed as you may find it is impossible to identify all the patterns in a dataset manually. The automatic data quality rule discovery can significant reduce DQ analysts’ workloads on data profiling and analysis, and more importantly an automatic approach is capable for a full-coverage scan to identify unobvious data patterns.

Data Quality Assessment Framework

This is another interesting area. I have some questions regarding data quality assessment and interested at seeking the answers, for example:

  • How to measure data Accuracy? I have done some literature reviews and researched the data quality products currently available in the market, but I haven’t found any proper solution. So, is data accuracy not measurable?
  • Many data quality products use a numerical data quality ‘Score’ which aggregates the sub metric scores. However, how to interpret the ‘score’? is a dataset with DQ score 80 better than another dataset with DQ score 75?

Ongoing Data Quality Monitoring

For a data engineer, this is within my comfort zone. Some interesting areas to explore, such as efficient data quality monitoring of big dataset, data quality trends capturing and analysis.

Data Quality is Important and Fun in 2050

In 2050, there is no such thing called “AI” or “big data”. People maybe still concerned with data processing performance as the data volume generated from everywhere and everything keeps exploding. However, the computing power at that time is capable to handle that scale (will Quantum computing become mainstream? possibly not, but who knows, never underestimate the advance speed of technology).

In 2050, data quality is still important, “garbage in, garbage out” holds the constant truth as long as the data (or the machines producing data) is made by, used by and managed by human beings, unless robots have taken over the world (do I still need to work then? why do robots need me to work? do they need to keep human beings as slaves or pets?)