This is the second post of my Data Quality Improvement blog series. This blog post discusses the data profiling tasks that I think are relevant to data quality improvement use cases.
For anyone who has ever worked with data, she or he must has already done some sort of data profiling, either using a commercial data profiling software, or writing an ad-hoc SQL query, or just gazing over an Excel spreadsheet. Many tasks or activities can be put into the data profiling basket. To give a simple definition, data profiling is the process to find answers to your questions with the target datasets.
For different use cases, the focus of data profiling is often different. For example, for the data integration use cases and data migration use cases, the focus is to identify the semantics of tables and columns, for data analytics use cases, the focus is to identify the relationships between columns and between rows. Regarding data quality use cases, data profiling has a wider focus as data quality issues can be widely varied and can occur anywhere.
There are many data profiling tasks involved in the data quality use cases. A systematic data profiling process can be helpful for organising the profiling tasks. This blog post suggests a set of data profiling tasks that are relevant to data quality use cases. Those data profiling tasks are categories into three top-level categories that are executed in order:
- Table-level profiling
- Single-column profiling
- Multi-column profiling
Table-level profiling is to determine the metadata and to collect the statistics of a table that includes the following profiling tasks:
- Table schema
- Shape(rows/columns count) and size
- Maintenance attributes (created/last updated datetime, created by)
- Business attributes (business-critical levels, target users)
- Update frequency and growth rate
- Relationships to other tables
Single-column profiling is to capture column attributes and statistics of column values. There are much more profiling tasks involved in single-column profiling compared to table-level profiling. In this section, I categorise those profiling tasks according to data quality dimensions.
The completeness of a single column can be measured by counting the missing values and comparing it to the count of total rows.
It sounds simple enough, right? However, the devil is in the details. The missing values do not have to be the null value or the empty cell. It can be some commonly used default values, such as “NA”, “Default” and “Empty”. It can also be the first option in the dropdown list, some meaning texts such as “abc” and randomly picked date when the users who input data through a UI do not care about that field. Those types of missing values are more difficult to detect but are more important for the data quality improvement tasks. Normally, those types of missing values can be detected from inspecting the least frequent values and outlier detection.
The uniqueness of a single column can be profiled by counting the distinct values and compared it to the count of total rows. Uniqueness constraint can be used to validate the identity columns of a transactional dataset such as orderId in the order table or the attribute columns of a reference dataset such as customers, employees and products.
One of the challenges with uniqueness profiling is to detect those approximate matching records. For example, UK and GB refer to the same country. For some use cases, duplication resolution techniques may be required to support the uniqueness profiling.
In addition, depending on the business requirements, conditional uniqueness profiling is required. Conditional uniqueness refers to the scenarios where the value in a column is not unique at a global level but is unique inside a partition/group. For example, in a dataset holding world country & city records, the city column may not be unique as two cities in the different county may have the same name, such as Durham (the one in the UK and the one in the USA). However, at the country level, this column is indeed unique.
Another useful uniqueness-related column attribute is the monotonicity that indicates an identity column.
The validity of a single column can be profiled from a number of angles, including:
Range – range attribute of a column can be determined by the maximum value and the minimum value of the column. The column can be numerical or timestamp. One thing to note is that improper default values and outliers can affect the accuracy of the range profiling. For example, when 1970-01-01 is used as the null value, an infinite range will be profiled as a range from 1970-01-01 to the specified end date. In addition, some extreme cases may need to be disregarded from the range calculation.
Value Length – value length attribute of a column can be used to validate the format of a column. For example, a credit card number column is expected to have 16 digits number, and a UK address postcode is expected to have 5 to 7 characters.
Patterns – this profiling task is to find the patterns occurring in a column. For example, telephone numbers are following consistent patterns.
Decimals – the decimals in numeric values.
Domains – this profiling task can be conducted to determine the domain values for a single column or to exam the validity of a column to detect any value fallen out of the domain.
Value distributions – value distributions can also provide useful information about the validity of a column value. For example, the least frequent values are the potential candidates for the invalid values.
As you may have noticed, the single-column profiling tasks mentioned above focus to extract the “syntax” information of the data that is useful for solving the Type I data quality problems (please refer to my previous blog post for further details of the Type I and Type II data quality problems). Compared to Type I data quality problems, Type II data quality problems are “semantic” type of problems that require more efforts and budget to tackle (80% of all data quality improvement efforts). Multi-column profiling that is often underused in data projects is good at identifying the relationships between columns and revealing the frequent patterns within a dataset that can be utilised to extract the “semantics” information of the data. In this section, I organise the multi-column profiling tasks based on the inter-column relationships:
Correlation – correlation profiling or correlation analysis is a statistical method to evaluate the strength of the relationships between two continuous, numerical columns. The strength of the relationships can be quantified as a numerical number, i.e. correlation coefficient, ranging between -1 to 1. The positive correlation coefficient represents the same-direction correlation and the closer the correlation coefficient to 1 the stronger the relationships between the columns. On the other hand, the negative correlation coefficient represents the opposite-direction correlation, such as the number of people who takes Covid-19 vaccine and the number of people admitted to hospital due to Covid-19 infections.
Dependency – dependency profiling refers to the activities of identifying the dependencies between columns. The dependency most relevant to data quality improvement use cases is functional dependency. A functional dependency exists when values in a set of column combinations functionally determine the values of another set of column combinations. For example, the value in the “capital” column is “London” can determine the value in the “country” column to be “UK”. For data quality improvement use cases, an extension of traditional functional dependency, conditional functional dependency (CDF), is more effective for capture the semantics of the data. As conditional functional dependency is a very useful tool for detect data quality issues at the semantics level, I will discuss it in a dedicated blog post.
Clustering – clustering profiling refers to the activities of identifying groups of similar records based on the values of a set of columns. The records that do not fit into any group can be a potential candidate for a data error.
In this blog post, I introduce the data profiling tasks that are relevant to data quality improvement use cases. I organised the data profiling tasks into three categories, table-level, single-column and multi-column. Table-level data profiling tasks mainly provide contextual information of a dataset. The single-column profiling tasks is more useful for Type I, “syntax” level of data quality problems while the multi-column profiling tasks that are capable of extracting “semantic” information have the potentials to support the resolution of Type II, “semantic” level of data quality problems.
This blog post only covers the data profiling tasks that I think are relevant to data quality improvement. However, data profiling covers much wider areas and topics, such as cross-table profiling, cross-source profiling, and schematic matching.