Data Quality is Important
Most of the time I don’t think I am an absolutist, however, I found I became more and more certain that data quality is the root of all evil. Not only bigger portion of project time should be allocated to data quality management, but also a type of lean, agile and faster responsive data quality management methodologies need to be explored for making data quality improvement process less painful and more effective.
There is no doubt that the buzzwords, “big data”, “AI” and “streaming”, can decorate your data project as something “innovative” and “high-tech” (and make you not just that data guy but suddenly the posh kid). However, I found they are hardly the deciding factor of a data project success. With the jaw-dropping advance speed of data technologies, the “big data” is not “big”, the AI is not so-called “intelligent”, and the “streaming” concept may not exist very soon as it is being unified with “batch” processing.
Not long ago, performance and scalability can be one deciding factor of a data project success. As the processing power from a single machine cannot handle the volume of a big dataset (may not big with today’s standard), the aggregations of measures against all possible dimension attributes need to be pre-calculated and cached in the cubes. The BI practitioners have to allocate much project time to implement “best practices” and “tuning tricks” to make the process a bit faster. However, distributed and parallelised data processing models provide BI practitioners with more options to ensure a good performance and scalability without spending too much efforts on some tricky performance tuning. For example, in one of my previous project where I need to optimise a data processing pipeline which took 8+ hours to run, instead of diving into these thousands lines of code to identify the potential small performance improvement opportunities bit by bit, I parallelised the data processing logics and split the workloads into 25 nodes with 100 cores running in parallel in total which reduced the processing time from 8+ hours to 15 mins.
As those buzzwords (“big data”, “AI” and “streaming”) are not the deciding factors of a data project success and performance and scalability will not be the deciding challenge of a data project anymore, how about other elements in a data project, such as project management, maintainability, disaster recovery, and more… As a non-absolutist, I think I am obliged to make this claim that all the elements or tasks in a data project can decide the success or fail of the project and they are equally important in an absolute sense. The data quality just has a higher possibility to be the deciding factor, with the following reasons:
- Many other people say so
- Garbage in, garbage out, no matter how well the other parts of you data solution have been built, as long as the data is useless, the whole solution is useless
- There are so many ways for data to go wrong, incomplete, duplicate, inconsistent, dated… I remember I read a paper which lists 30 or 40 dimensions where data can go wrong.
- People lack awareness or have misperceptions (most of the time over optimistic) with their data quality. It is not surprised to see an art-class dashboard with full of non-sense data
- The people who build the data solution are often not those who truly understand the data and the business logics it represents (unfortunately the people who truly understand the data are not always those people who have skills to build a data solution)
- Data quality management is a continuous process. An effective management process and tooling supports are necessary, but few of data project has those.
- There is no absolute good or bad data, and the definition of good or bad can vary significantly for different usage. That is actually the reason why data quality is defined as “fitness to use”. The ‘good’ data for a group people might be ‘bad’ for the other group of people.
Data Quality is Fun
You might be surprised to hear that I claim data quality work is something fun. Surely the endless, tedious and repetitive tasks involved in finding and cleaning dirty data cannot be fun. However, data quality work is not just that but instead involves the exciting areas of exploring the technologies and methodologies for make data quality management simple. Here are some of examples I feel interesting:
Automatic Data Quality Rule Discovery
A set of comprehensive and suitable data quality rules are critical for an accurate data quality assessment. However, this may not be as simple as you assumed as you may find it is impossible to identify all the patterns in a dataset manually. The automatic data quality rule discovery can significant reduce DQ analysts’ workloads on data profiling and analysis, and more importantly an automatic approach is capable for a full-coverage scan to identify unobvious data patterns.
Data Quality Assessment Framework
This is another interesting area. I have some questions regarding data quality assessment and interested at seeking the answers, for example:
- How to measure data Accuracy? I have done some literature reviews and researched the data quality products currently available in the market, but I haven’t found any proper solution. So, is data accuracy not measurable?
- Many data quality products use a numerical data quality ‘Score’ which aggregates the sub metric scores. However, how to interpret the ‘score’? is a dataset with DQ score 80 better than another dataset with DQ score 75?
Ongoing Data Quality Monitoring
For a data engineer, this is within my comfort zone. Some interesting areas to explore, such as efficient data quality monitoring of big dataset, data quality trends capturing and analysis.
Data Quality is Important and Fun in 2050
In 2050, there is no such thing called “AI” or “big data”. People maybe still concerned with data processing performance as the data volume generated from everywhere and everything keeps exploding. However, the computing power at that time is capable to handle that scale (will Quantum computing become mainstream? possibly not, but who knows, never underestimate the advance speed of technology).
In 2050, data quality is still important, “garbage in, garbage out” holds the constant truth as long as the data (or the machines producing data) is made by, used by and managed by human beings, unless robots have taken over the world (do I still need to work then? why do robots need me to work? do they need to keep human beings as slaves or pets?)