What is Data Management, actually? – DAMA-DMBOK Framework

What is Data Management, actually? – DAMA-DMBOK Framework

“What is data management?”. I guess many people will (at least I think I will) answer “em… data management is managing data, right?” at the same time swearing in their heads that “what a stupid question!”.

However, if I was asked this question in a job interview, I guess I’d better to provide a bit longer answer, such as the one given by DAMA cited below if I could ever memorise it.

Data Management is the development, execution, and supervision of plans, policies, programs, and practices that deliver, control, protect, and enhance the value of data and information assets throughout their lifecycles.

DAMA-DMBOK

If the interviewers asked me to elaborate in further detail, it could be a challenge as there are so many facets and aspects of Data Management. Many interdependent functions with their own goals, activities, and responsibilities are required in data management. For a data management professional, it is difficult to keep track of all those components and activities involved in data management.

Fortunately, DAMA developed the DMBOK framework, organising data knowledge areas in a structured form, that enables data professionals to understand data management comprehensively.

I have recently been diving into the heavy reading DAMA DMBOK book (“heavy” is in its literal manner, the book weighs 1.65kg!). I actually recommend all data professionals to give a read of this book. It is not only able to connect the dots in your knowledge network to have a comprehensive understanding of data management but also to provide a common language enabling you to communicate in the data world (instead of just nodding and smiling in a meeting when hearing some data jargon.

DAMA DMBOK framework defines 11 functional areas of data management.

As you can see from the DAMA Wheel above, Data Governance is at the heart of all the data management functional areas. Data governance provides direction and oversight for data management to ensure the secure, effective, and efficient use of data within the organisation. The other functional areas include:

  • Data Architecture – defines the structure of an organisation’s logical and physical data assets and data management processes through the data lifecycle
  • Data Modelling and Design – the process of discovering, analysing, representing and communicating data requirements in the form of data model
  • Data Storage and Operation – the process of designing, implementing, and supporting data storage
  • Date Security – ensuring data is accessed and used properly with data privacy and confidentiality are maintained
  • Data Integration and Interoperability – the process of designing and implementing data movement and consolidation within and between data sources
  • Document and Content Management – the process of managing data stored in unstructured medias
  • Reference and Master Data – the process of maintaining the core critical shared data within the organisation
  • Data Warehousing and Business Intelligence – the process of planning, implementing and controlling the processes for managing the decision supporting data
  • Metadata – managing the information of the data in the organisation
  • Data Quality – the process of ensuring data to be fit to use.

Based on the functional areas defined by the DAMA Wheel, Peter Aiken developed the DMBOK pyramid that defines the relation between those functional areas.

From the DMBOK pyramid, we can see the top of the pyramid is the golden function that is the most value-added for the business. However, the DMBOK pyramid reveals that the data analytics is just a very small part in an organisation’s data system. To make the data analytics workable, a lot of other functions need to work and collaborate seamlessly to build the foundation.

As the DMBOK pyramid shows, data governance is at the bottom that makes the ground foundation for the whole data system. Data architecture, data quality and metadata make up another layer of logical foundation on top of the data governance. The next level of upper layer includes data security, data modelling & design, and data storage & operations. Based on that layer, data integration & interoperability function can work on moving and consolidating data for enabling the functions at the upper layer, data warehousing/business intelligence, reference & master data, and documents & contents. From this layer, the functions start to be business faced. That also means the business cannot see the functions that need to run underneath.

The practical contribution from DMBOK pyramid is to reveal the logical progression of steps for constructing a data system. The DMBOK pyramid defines four phases for an organisation’s data management journey:

Phase 1 (Blue layer) – An organisation’s data management journey starts from purchasing applications that include database capabilities. That means the data modelling & design, data storage, and data security are the functions to be in place at first.

Phase 2 (Orange layer) – The organisation starts to feel the pains from bad data quality. To improve data quality, reliable metadata and consistent data architecture are essential.

Phase 3 (Green layer) – Along with the developments of the data quality, metadata and data architecture, the needs of structural supports for data management activities starts to appear that reveals the importance of a proper data governance practice. At the same time, data governance enables the execution of strategic initiatives, such as data warehousing, document management, and master data management.

Phase 4 (Red layer) – well-managed data enables advanced analytics.

The four phases of DMBOK seem to make sense, however, from my own experiences, organisations often go on different routes, mostly starting to rush on high-profiling initiatives such as data warehousing, BI, machine learning when no data governance, data quality, metadata etc. is in place.

DAMA framework contains many knowledge points (not surprising a 1.65kg book is required to cover this framework). This blog post presents a high-level view of the framework. I plan to share my thoughts in the next few blog posts along with my digestion of the book.

How Azure Storage Cheats Over the CAP Theorem

Microsoft claims Azure Storage providing both high availability and strong consistency. It sounds good but obviously violates the CAP theorem as the ‘P’ (network partitioning) is not avoidable in the real world. In theory, you can only achieve either high availability or strong consistency in a distributed storage system. I have done a bit of research and found out the way how Azure Storage walks around the CAP theorem.

The figure below shows the high-level architecture of Azure Storage (from the SOSP paper authored by Microsoft product team)

Azure Storage stores data in Storage Stamps. Each Storage Stamp physically consists of a cluster of N racks of storage nodes. Each rack is built as a separate fault domain. There are three functional layers in each Storage Stamps, Front-Ends, Partition Layer and Stream Layer. The Stream Layer is where the data is stored and it can be thought of as a distributed file system layer with “Streams” can be understood as ordered lists of storage chunks, namely Extents. Extents are the units of replication that are distributed in different nodes in a storage stamp to support fault tolerance. Similar to Hadoop, the default replication policy in Azure Storage is to keep three replicas for each extent.

For Azure Storage to successfully cheat over the CAP theorem to provide high availability and strong consistency at the same time under the condition of network partitioning, the stream layer needs to be able to immediately response requests from client and the reads from any replica of the extent needs to return the same data. In the SOSP paper, Microsoft describes their solution implemented in Azure Storage as layered design with stream layer responsible for high availability and partition layer responsible for strong consistency. However, I have to admit I was confused with the explanations presented in the SOSP paper.

Instead, I found the core idea behind Azure Storage implementation is rather simple, i.e. offering strong consistency at the cost of higher latency, in the other words, enforcing data replication on the critical path of writing requests. The figure below shows the journey of a write request at the stream layer.

In a nutshell, extents are append-only and immutable. For an extent, every append is replicated three times across the extent replicas in a synchronous mode. A write request can only marked as success when all of the writes to all replicas are successful. Compared to asynchronous replication that is often used to offer eventual consistency, the synchronous replication will no doubt cause higher write latency. However, the append-only data model design should be able to compensate that.

It is worth to mention that the synchronous replication is only supported within a storage stamp. The inter-stamp replication still takes asynchronous approach. This design decision is understandable as having geo-replication on the synchronous critical write path can cause much worse write latency.

dqops – Query Databricks Database Schema through SQL Connector for Python

dqops Data Quality Studio (DQS) is one of my R&D projects I have been doing during my spare time. I plan to note down some tips & tricks I use in this project in the future blog posts from time to time.

Databricks is one of the main data services that the dqops DQS is going to support. This blog post notes down the approach for querying Databricks database schemas using the “Databricks SQL Connector for Python” library.

The “Databricks SQL Connector for Python” library supports to query Databricks databases through Spark Thrift Server (STS) that allows JDBC/ODBC clients to execute SQL queries over JDBC or ODBC protocols on Apache Spark. A similar approach has been used by dqops DQ studio to connect to on-premises Spark cluster with jaydebeapi library. However, the “Databricks SQL Connector for Python” library simplifies the setup and does not need the installation of JDBC driver on the client environment.

Setup

The setup of the “Databricks SQL Connector for Python” library is simple. You just need to include the python library ref “databricks-sql-connector==0.9.0” in the requirements.txt on the client environment from where your code needs to query the Databricks databases. Alternatively, you can manually install the library using “pip install databricks-sql-connector

Databricks Connection

The “Databricks SQL Connector for Python” library uses the same connections interface as the other JDBC/ODBC python libraries, such as pyodbc and jaydebeapi. That makes it very easy to build the Databricks connector on top of the existing dqops data connection framework that I have built, based on the open-source sodasql Dialect module, to support different types of data repositories, such as Azure Database, Databricks Azure Synapse, and Snowflakes.

To create a Databricks connection, we just need to import “sql” module and run the “connect” function (line 11-15 in the following figure).

The “connect” function requires the hostname and http_path of the thrift server deployed in the target Databricks cluster and also the Databricks access token. The hostname and http_path can be found in the “Advanced Options” section of the cluster setting page.

Databricks access token can be set in the “User Settings” page of the Databricks workspace.

As dqops DQS needs to allow end-users to create and test connection based on a specific data source type, a UI component has been created that generates the data connection form based on predefined connection type metadata.

This image has an empty alt attribute; its file name is image-19.pngThis image has an empty alt attribute; its file name is image-12.png
Query Metadata

After the connection instance is created, we can run sql commends against the Databricks databases using the “execute_sql” function with the sql statement passing in as the parameter.

Spark provides a set of SQL commands to read database/table/column metadata. In dqops DQS, the “SHOW TABLES” command is used to read the list of tables in a specified database, and the “DESCRIBE TABLE” commend is used to read column name and type in a data table.

More Spark sql commends related to metadata can be found here.

The snapshot below shows the metadata UI in the dqops DQS that is built using the Databricks SQL Connector for Python.

Setup a Dockerised Spark Development Environment with VS code and Docker

Setup a Dockerised Spark Development Environment with VS code and Docker

Databricks is not cheap, especially when I need to use it for my personal R&D work (where unfortunately money has to be taken from my own pocket). Therefore, I have been developing in a dockerised Spark environment since a while ago and I found this way actually works well. Here I list the steps to set the environment up with VS code and Docker.

1. Setup VS code remote development in Docker containers

Please refer to this guide for setting up VS code remote development in Docker containers. In addition, the docker-compose needs to be installed as well.

2. Setup base Spark cluster docker images

There are a number of open-source Spark cluster docker images available online that can be used as the base for our dev environment. After the hand-on tests, I found this one from André Perez suits best that consists of a Spark base image, Spark master image, Spark worker image, and a JupyterLab image. A simulated Hadoop distributed file system is also configured as the storage.

cluster-architecture.png

Apart from the Dockerfile image files, there are also a build.sh and a build.yml for configuring and building the images. In addition, a docker-compose file is provided to run up the cluster.

3. Customise the Spark cluster

We can then customise the Spark cluster to suit the requirements of our dev environment. To give an example, here is a list of customisations I made for my dev environment:

Add additional components/services required for the development, such as Microsoft SQL Server (docker image), Hadoop (docker image), OpenLDAP (docker image). You can add the build commends of those additional images to the build.sh and build.yml scripts to simplify the image build process. In addition, you need to add entries (corresponding to each image) in the docker-compose file.

Customise the JupyterLab container as the primary dev container. The Dockerfile of JupyterLab image already contains the installations of the necessary libraries for Spark development. You can install the additional libraries required for your development by extending this Dcokerfile. For example, I have added pyodbc to connect to mssql, SparkJDBC jar to connect to Spark Thrift Server, and the requirements.txt file containing all the dependency Python libraries I need.

COPY docker/jupyterlab/requirements.txt /opt/req/requirements.txt
COPY docker/jupyterlab/jdbc/SparkJDBC42.jar /opt/req/SparkJDBC42.jar

RUN curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add - && \
    curl https://packages.microsoft.com/config/debian/9/prod.list > /etc/apt/sources.list.d/mssql-release.list && \
    apt-get update && \
    ACCEPT_EULA=Y apt-get -y install msodbcsql17 && \
    apt-get -y install unixodbc-dev && \
    pip3 install pyodbc && \
    pip3 install -r /opt/req/requirements.txt

ENV CLASSPATH /opt/jdbc/drivers/SparkJDBC42.jar
4. (Optional) Push images to Azure Container Registry

Please refer to this doc.

5. Build images

Extend the build.sh and build.yml to include the entries for the additional images you have added, and then run build.sh shell file.

6. Mapping source file folder to the working folder in dev container

Map your source file folder (in your local machine) to a folder in the dev Docker container (extended from the original jupyterlab container) by configuring “volumes” of the dev container entry in your docker-compose file.

7. Run docker-compose up to start containers
8. Attach VS code to the dev container

Start VS code window, open the “command palette”, and select “Remote-Containers: Attach to Running Container…”.

After the VS code window is attached to the dev container, install “Python Extension for Visual Studio Code” (Install inside the dev container, not the local host machine).

Open the working folder you mapped in step 6.

Then you have the dev environment in docker container where you can author, run, debug your code.

Here is the code for creating a SparkSession instance with the support of Delta Lake.

Why I Prefer Hand-Coded Transformations over ADF Mapping Data Flow

Firstly, I need to clarify that what I am discussing in this blog post is only with ADF Mapping Data Flow instead of the whole ADF service. I am not going to challenge ADF’s role as the superb orchestration service in the Azure data ecosystem. In fact, I love ADF. At the control flow level, it is neat, easy to use, and can be very flexible if you can be a little bit creative and imaginative. To avoid potential confusion, I define the “Hand-Coded Transformations” and the “ADF Mapping Data Flow” in this blog post as:

  • Hand-Coded Transformations – Programming transformation logics in code (pyspark, SQL, or others) and orchestrating the execution of the code using ADF data transformation activities in control flow, such as Databricks activities, Stored Procedure activity, and custom activity.
  • ADF Mapping Data Flow – Programming transformation logics using ADF Mapping Data Flow visual editor.

I totally understand why people prefer using a GUI-based, ready-to-use, code-free ETL tool to manually crafting data wrangling/transforming code. Intuitively, this is a no-brainer choice for people who view it from 10,000 feet. However, the devil is in the details. Here is the list of reasons why I prefer hand-coded transformation to ADF Mapping Data Flow:

  • One of the main drives for using ADF Mapping Data Flow is possibly the simplicity and flat learning curves. This may be true for the simple, generic transformation logic. However, for the more complicated, domain-specific requirements, ADF Mapping Data Flow does not have to be simpler than writing code in SQL or Python. They either require you to create complex data flow logic or you have to embed custom code in the data flow eventually. On the other hand, the hand-coded approach can be much simpler due to the flexibility of custom coding and the huge number of ready-to-use data wrangling/transformation libraries.
  • You may argue that the ADF Mapping Data Flow is for people who don’t know coding. However, ADF does not play the role of self-service ETL tools in the Azure data ecosystem (Power Query does). The target users of ADF should be data engineers or DataOps instead of data analysts. I understand some junior data engineers may resisting coding and shy away even without a try. The ADF Mapping Data Flow gives them the opportunity to escape and actually deprives their motivations of learning coding.
  • We can implement some validation logics in an ADF Mapping Data Flow, but that would be very high-level and simple. That does not allow us to conduct a comprehensive unit test on the transformation logics.
  • One of the main reasons I prefer coding over visual editor is the reusability, including reusability of domain-specific business logics and reusability of domain-independent non-functional logics, such as auditing, QA and exception handling. As the ADF Mapping Data Flow is working in the declarative paradigm, it is difficult to abstract and generalise the logic implemented in ADF Mapping Data Flow. On the other hand, abstraction and generalisation can be easily implemented with custom coding.
  • Custom coding allows you to build configurable and metadata-driven pipelines. A configurable, metadata-driven design can significantly improve the maintainability of your ADF pipelines. One of my previous blog posts describes some of my early designs of metadata-driven ADF pipelines with custom coding in SQL. If you are interested, you can find the blog post here.
  • The last but also the most important reason for me to favour hand-coded approach is the control and flexibility I can have to withstand the uncertainty in my projects. Just like any other IT project, it is not realistic to predict and control all the possible situations in a data engineering project. There are always some kinds of surprises popping up or user requirement changes. A code-free tool like ADF Mapping Data Flow hides the implementation under the hood and requires you to work in their way. In case an unpredicted situation happens and the code-free tool cannot handle it, it is expensive to change the tool at a late stage of a project. Compared to a code-free tool, hand-coding give you more flexibility and resilience to handle unpredicted situations.
Data Quality Improvement – DQ Dimensions = Confusions

Data Quality Improvement – DQ Dimensions = Confusions

DQ Dimensions are Confusing

Data quality dimensions are great inventions from our data quality thought leaders and experts. Since the concept of quality dimensions was originally proposed in the course of the Total Data Quality Management (TDQM) program of MIT in the 1980s [5], a large number of data quality dimensions have been defined by people from different backgrounds and industries. A survey conducted by DAMA NL in 2020 identifies 127 data quality dimensions from 9 authoritative sources. The DQ dimension word cloud created by MIOSoft [7] has perfectly demonstrated the scale of DQ dimensions.

Figure 1. DQ Dimension Word Cloud from MIOSoft [7]

It might not be that bad to have that many dimensions if they are clearly defined and universally agreed upon. Unfortunately, DQ dimensions are not based on concrete concepts and are not the fundamental property of the data [7]. Instead, they are context-dependent. The same DQ dimension can be interpreted differently in a different context by different people.

The Effort from DAMA UK

In 2013, DAMA UK organised a working group of data quality experts aiming to tackle the inconsistent understandings and interpretations of DQ dimensions and to define a set of core DQ dimensions that is the industry standard and well accepted by data professionals. Six core DQ dimensions have been defined by the group, including Completeness, Uniqueness, Consistency, Accuracy, Validity and Timeliness.

Figure 2. DAMA UK Data Quality Dimensions

To a certain extent, these six core DQ dimensions have indeed become the industry standard and well accepted in the data quality community: they have been used in the mainstream commercial data quality software (despite some of them tweaked one or two dimensions, e.g. replace timeliness with integrity); they have been widely referenced in the academic research community; they have been used as the standard dimension definitions in the government data quality guide, e.g. the UK Government Data Quality Framework published in 2020.

Confusion Continues

However, despite the six core DQ dimensions have been well known and accepted, confusion still exists. For example, the “Accuracy” dimension, arguably the most important data quality dimension [5], is defined as “The degree to which data correctly describes the ‘real world’ object or event being described”. However, how to measure “Accuracy” in practice? The reality is a gold standard dataset that can be used to refer to the ‘real world’ object is not available in most of the scenarios. In a survey of data quality measurement and monitoring tools conducted in 2019 [5], only one tool, Apache Griffin, was found to support the the “Accuracy” dimension despite it actually does not strictly follow the definition of “Accuracy” (It compares a target dataset to a source dataset without the validation of the source dataset reflecting the real world objects). The same situation happens to “Timeliness” dimension as well. According to DAMA UK’s definition, “Timeliness” is “the degree to which data represent reality from the required point in time”. Again, the reference to the time the real world event happens is not available in most real world scenarios. Instead, the time available in a database that represents the event is often the time when the event record is created or modified in the database.

Despite the DQ dimensions have been frequently referenced by the data quality software vendors in their website or sale materials, A study conducted by Ehrlinger, Rusz and Wöß [5] found that few DQ metrics have been actually implemented to measure the DQ dimensions. As figure 3 shown, amongst the popular commercial and open-source DQ software, only Apache Griffin implements metrics to measure “Accuracy” (for a certain extent, as mentioned above). No software supports “Consistency” and “Timeliness” dimensions. There is no widespread agreement in the implementation and definition of DQ dimensions in practice. When Ehrlinger, Rusz and Wöß [5] contacted the vendors for further details of the dimensions and metrics in their software, few vendors provided a satisfying reply of how the dimensions are measured.

Figure 3. DQ Dimensions Implemented in Commercial and Open-Source DQ software

However, it is possibly not fair to blame DQ software vendors for the confusion with DQ dimension measurements. DQ dimension itself is not a concrete concept but instead is context-dependent. It may be relatively simple to define dimensions and metrics to measure them for a specific business domain. However, those dimensions and metrics have little practical relevance for other business domains. Therefore, the question may not be how to create a set of universal dimensions and metrics that can be used by a general-purpose DQ software to fit all the scenarios, but instead, the question may be whether or not we should pursue a universal dimension at the first place.

What Can We Do?

As it is not practical to define universal dimensions and metrics and to use them in a general-purpose DQ software, shall we bother to use the concept of dimension in data quality assessments at all? According to Ehrlinger, Rusz and Wöß [5], several DQ tools have shown the capabilities to measure data quality without referring to the dimensions at all, and they suggest that a practical approach is required without the need for DQ dimensions but instead focusing on the measurements of core aspects such as missing data and duplicate detection that can be automated.

However, I think DQ dimension is a great invention! It creates a common language to express and classify the quality of data. Quality itself is an abstract concept that represents many aspects. That makes it difficult to communicate between people. DQ dimensions provide an efficient way to have comprehensive and organised descriptions of data quality.

The confusions are not caused by DQ dimension itself, but instead the problem is that the DQ dimensions are interpreted differently in a different context, for a different business domain, and by people from different backgrounds. Due to the context-dependent nature of data quality assessments, it is not realistic to have a set of dimensions with universal definition/interpterion and uniformly metrics to measure them.

Instead of pursuing a global consensus on the universal DQ dimensions and using them globally as the common language for describing data quality in any context, the DQ dimensions need to be interpreted based on the expected purposes of the data and the consensus on the DQ dimension meanings only need to be made in an internal, domain-specific environment. In other words, the DQ dimensions only need to be defined as the common language within the group of people who are relevant to the data, such as the producers, consumers and governors of the data. As long as the consensus on the meaning of the DQ dimensions is reached within the group, the DQ dimensions are effective.

References

[1] C. Batini, M. Scannapieco, Data and Information Quality: Concepts, Methodologies and Techniques. Springer International Publishing, Switzerland, 2016.

[2] A. Black, P. van Nederpelt, Dimensions of Data Quality (DDQ) Research Paper, DAMA NL Foundation, 3 September 2020.

[3] DAMA UK, The Six Primary Dimensions for Data Quality Assessment, 2013

[4] I. Diamond, A. Chisholm, The Government Data Quality Framework, https://www.gov.uk/government/publications/the-government-data-quality-framework/the-government-data-quality-framework, 2020

[5] L. Ehrlinger, E. Rusz, & W. Wöß, A Survey of Data Quality Measurement and Monitoring Tools, 2019, ArXiv, abs/1907.08138.

[6] B. Heinrich, M. Kaiser, and M. Klier, How to Measure Data Quality? A Metric-based Approach. In S. Rivard and J. Webster, editors, Proceedings of the 28th International Conference on Information Systems (ICIS), pages 1–15, Montreal, Canada, 2007. Association for Information Systems 2007.

[7] MIOSoft, Data Quality Dimensions Untangled, https://miosoft.com/resources/articles/data-quality-dimensions-untangled.html

[8] L. Sebastian-Coleman, Measuring Data Quality for Ongoing Improvement: A Data Quality Assessment
Framework. Elsevier, Waltham, MA, USA, 2012.