Databricks Lakehouse Breaks Data Warehousing Performance Record – Time to Forget about Data Warehouse?

Lakehouse is faster than Data Warehouse! This is a big deal! Now, what could stop Lakehouse from replacing Data Warehouse? This is the news I have been waiting for since Databricks SQL, the full suite of data warehousing capabilities, was announced last year.

In the 100TB TPC-DS performance benchmark test for data warehousing, Databricks SQL outperformed the previous record by 2.2x. In a separate test conducted by Barcelona Supercomputing Center, Databricks SQL demonstrates 2.7x faster and 12x cheaper than snowflake on the same data warehousing workloads.

Now, what could stop Lakehouse from replacing Data Warehouse?

Functionality-Wise

Databricks SQL, powered by Delta Lake, offers the full suite of data warehousing capabilities such as ACID transactions, fine-grained data security, scalable metadata handling, first-class SQL support, and BI reporting. In addition, new capabilities are being continuously added at a fast pace, such as the latest Low Shuffle Merge feature and SQL custom function feature. This is impressive to see those data warehousing capabilities on top of data lakes. Just a few years ago, we had to write rather awkward code as a workaround for the lack of merge capability when updating data in the data lake.

Regarding the ‘Lake’ side workloads, there is no doubt that Lakehouse outperformed the data warehouses (of course, that is why it is called “lakehouse”) and offers the capabilities that cannot be achieved by the data warehouses, such as native supports of unstructured or semi-structured data and machine learning type of workloads.

Capability-Wise

The born of the data lake is driven by the difficulties that businesses face to handle data at greater volume, variety and speed with a classic data warehouse. Lakehouse is capable of storing and processing very large volumes of data. It is impossible or is too expensive to handle that scale for a classic data warehouse. Lakehouse natively supports unstructured and semistructured data. Lakehouse natively supports streaming data.

Flexibility-Wise

I personally consider Lakehouse being more flexible than the classic data warehouse. Databricks Lakehouse originated from open-source initiatives and adopts an open architecture instead of building into a closed black-box as most of the classic data warehouses. It offers data engineers more options and flexibility to integrate into or extend their lakehouse.

Reliability-Wise

The data warehouse has been around for more than 30 years. It is no doubt that the data warehouse is normally more reliable and robust than Lakehouse under most of the conditions. However, the situation is changing at a fast pace.

Cost-Wise

Databricks Lakehouse is not cheap, especially when you need to pay for both the Databricks Units and the VMs provisioned for supporting it. However, for similar data warehousing workloads and data volumes, Lakehouse has the advantages: low-cost cloud-based storage, elastic and pay as you go computing powers, and the latest serverless SQL feature (Databricks claims a 40% cost saving).

Performance-Wise

Now, get to the decisive factor, the performance, more specifically the interactive query performance. Manageability/Reliability and interactive query performance are two of the biggest hurdles for Lakehouse to be competent for data warehousing workloads. With the rapid advance of Delta Lake capabilities, the hurdle of manageability/reliability is not that formidable.

Databricks and its open-source cousin, Apache Spark, were originally designed for offline processing of big data workloads. In the other words, this is a design favouring high throughput over low latency and people don’t have that high expectation for their performance with interactive queries. However, with the complete rewrite of its processing engine and performance optimisation techniques (caching, cost-based query optimizer, data skipping, data compaction, and so on), Databricks Lakehouse gets to the same performance level as data warehouse.

How about Snowflake, the Most Promising Data Warehouse?

Yes, Snowflake started as a data warehousing company, however, it has been adding more and more data lake features. Even though it takes the opposite route of Databricks, which started as a big data company but has been adding more and more data warehousing features, they are becoming more and more alike. Eventually, a new name would be given to them. It might be “Lakehouse” or not (at the end of the day, they might be called back as “Data Warehouse 2.0” or something else), however, we could expect the new “thing” is capable to make the “big” data (big volume, variety of data structures, and high-velocity) not “big” in future.

dqops Data Quality Rules (Part 2) – CFD, Machine Learning

dqops Data Quality Rules (Part 2) – CFD, Machine Learning

The previous blog post introduces a list of basic data quality rules that have been developed for my R&D data quality improvement initiative. Those rules are fundamental and essential for detecting data quality problems. However, those rules have existed since a long, long time ago and they are neither innovative nor exciting. More importantly, those rules are only capable to detect the Type I, “syntax” data quality problems but are not able to find the Type II, “semantic” data quality problems.

Type II “Semantic” Data Quality Problems

Data quality rule is expected to be an effective tool for automating the process of detecting and assessing data quality issues, especially for a large dataset that is labour-intensive to do manually. The basic rules introduced in the previous blog post are effective to detect the Type I problems that require “know what” knowledge of the data, such as the problems falling in the completeness, consistency, uniqueness and validity data quality dimensions.

Type II data quality problems are those “semantic” problems that require “know-how” domain knowledge and experience to detect and solve. This type of problem is more enigmatic. The data looks all fine on the surface and serves well most of the time. However, the damage caused by them to businesses can be much bigger.

Martin Doyle’s 80:20 rule states around 80% of data quality problems are Type I problems and only 20% of data quality problems are Type II problems. However, the Type II problems cost 80% of effort and budget to solve while the Type I problems only takes 20% of effort and budget.  

Under the pressure of limited project time and budgets, plus that the type II data quality problems are normally less obvious than the Type I problems, Type II data quality problems are often overlooked and keep unnoticeable until damages are made to the business by the bad data. The later the problems are found, the severe the damages could be. Based on the 1:10:100 rule initially developed by George Labovitz and Yu Sang Chang in 1992, the costs of solving quality problems increase exponentially over time. When it might only cost you $1 to detect and solve data quality problems at the early stage, the costs of failure caused by hidden data quality problems could be $100 or more.

Therefore, it is important to detect and solve Type II problems before they have made damage to the business. Unfortunately, the job of solving Type II data quality problems is not easy that requires domain experts to invest a significant amount of time to study the data, identify patterns and locate the issues. This could become mission impossible when the data volume is getting larger and larger. Not only the current manual approach is expensive and time-consuming but also it is highly dependant on the domain experts’ knowledge and experiences in the business processes, including the “devils in the details” type of knowledge that only exists in those domain experts’ heads.

So, can we find an alternative approach to solve the Type II problems without the dependency on the domain experts’ knowledge and experiences? can we identify the patterns and relations existing in a dataset automatically with the help of the computation power from modern computers to exhaustively scan through all possibilities? The Machine Learning and Data Mining techniques might provide an answer.

Based on my literature review, a number of ML/DM techniques are promising for handling Type II data quality problems. In my dqops R&D project, I have been experimenting with those techniques and exploring the potential ways to adapt them in a software app for non-technical DQ analysts. In the rest of the blog post, I am going to describe the ways I have been exploring in my toy dqops app, focusing on Conditional Functional Dependency and self-service supervised machine learning, with a brief introduction of anomaly detection and PII detection features.

Conditional Functional Dependency (CFD)

CFD, originally discovered by professor Wenfei Fan from the University of Edinburg, is an extension to traditional functional dependencies (FD). FD is a class of constraints where the values of a set of columns X determine the values of another set of columns Y that can be represented as X → Y. CFD extends the FD by incorporating bindings of semantically related values that is capable to express the semantics of data attribute dependencies. CFD is one of the most promising constraints to detect and repair inconsistencies in a dataset. The use of CFD allows to automatically identify context-dependent quality rules. Please refer to my previous blog post for a detailed explanation of CFD

CFD has demonstrated its potentials for commercial adoption in the real world. However, to use it commercially, effective and efficient methods for discovering the CFDs in a dataset are required. Even for a relatively small dataset, the workload to manually identify and validate all the CFDs is overwhelming. Therefore, the CFD discovery process has to be automated. A number of CFD discovery algorithms have been developed since the concept of CFD was introduced. In my dqops app, I have implemented the discovery algorithm from Chiang & Miller in Python. Please refer to the paper on this algorithm from Chiang & Miller for details.

The snapshot below shows the dependencies discovered from Kaggle Adult Income dataset.

Based on the discovered dependencies, DQ analysts can create DQ rules to detect any data row that does not comply with the dependency. In dqops DQ studio, there are two rule templates that can be used to define the dependency constraint, the “Business Logics” rule and the “Custom SQL” rule. The snapshot below shows the definition panel of the “Business Logics” rule.

Self-serivce Supervised Machine Learning

One traditional but effective method for detecting data quality problems is to use “Golden Dataset” as a reference and search for records that conflict with the pattern presented in the “Golden Dataset”. This method is feasible for a “narrow” dataset with a manageable number of columns/attributes. However, for a dataset with a large number of attributes for each record, it becomes impossible to manually identify the patterns in the dataset. Fortunately, this happens to be what Machine Learning is good at.

In my dqops R&D project, I have been exploring the potential approaches to take advantage of Machine Learning techniques to identify Type II data quality problems. There are two prerequisites required for this idea to work:

  • A “Golden Dataset”
  • A self-service machine learning model builder

The main steps for this idea are:

  1. DQ analysts prepare the “Golden Dataset”
  2. DQ analysts train and verify a regression or classification model with the “Golden Dataset”
  3. DQ analysts apply the trained model to test target datasets

In the dqops app, I have developed a self-service machine learning model builder that allows DQ analysts without machine learning skills to train machine learning models by themselves. Under the hood, the auto-sklearn package is used for the model training.

I have been trying my best to make the self-service model builder as easy to use as possible. DQ analysts just need to upload the “Golden Dataset”, specify the target column (i.e. the dependent variable) and select the task type, regression or classification. Some advanced settings are also enabled for further tunning for advanced users.

Once submitted, the model will be trained backend. When the model is ready for use, DQ analysts can create an “ML Regression” rule on the datasets with the same schema of the “Golden Dataset”.

The execution of the rule load the model and predict the target column (dependent variable) and calculate the tolerant range of expected value. When a field value falls out of the expected value range, that value is treated as inconsistent.

The decisive factor for this idea to work is the accuracy of the models trained by the self-service builder. Auto-sklearn is able to choose the most performant algorithms and the most optimal hyperparameters based on the cross-validation results automatically. Therefore, the quality of the “Golden Dataset” decides whether the trained model would work or not.

Anomaly Detection

Anomaly detection is a popular use case for commercial AI-enabled data quality monitoring software. It logs the history of data quality observations and detects the anomaly. Anomaly detection is an effective technique to raise awareness of data quality issues. I have also implemented an anomaly detection feature in the dqops app. The approach I have been using is to conduct a time-series analysis of the historical observations using fbprophet package and check whether or not the current observation is in the predicted value range.

PII Detection

Strictly speaking, data privacy is a regulation issue instead of a data quality issue. However, PII handling is often put into the data quality assurance process and treated as a data clean task. In dqops app, I created a PII detection rule that can be applied to a text column and detect any person name. The implementation of PII detection rule is rather simple with SpaCy package.

dqops Data Quality Rules (Part 1) – Basic Rules

In my previous blog post, I discussed the requirements and core elements of the rule-based data quality assessments. In this and next blog posts, I am going to walk through the data quality rules designed in the dqops DQ Studio app (one of my R&D initiatives for data quality improvement), ranging from the basic data quality rules for Type 1 (syntactic) data quality issues to the CFD, Machine Learning powered data quality rules for Type 2 (schematic) data quality issues.

This blog post covers the basic data quality rules for Type 1 data quality issues, including the rules for assessing validity, completeness, accuracy, uniqueness, timeliness and integrity.

Validity

At the time when this blog post is written, four DQ rule templates have been developed in dqops DQ Studio for assessing the validity of a data field:

  • Regex Constraint Rule
  • Range Constraint Rule
  • Length Constraint Rule
  • Domain Constraint Rule

Regex Constraint Rule

Regex Constraint rule allows the DQ analysts to specify the regular expressions of the valid formats a column has to comply with. This rule scans the column to count the values with the valid formats complying with the specified regular expressions and divide by the count of total rows to calculate the validity metric value.

metric = count({values matching regex}) / total rows

The DQ analysts can select one or more regular expressions from pre-defined, commonly used formats for a column, or define custom regular expressions.

When the rule is being executed, the column associated with this rule is scanned to look up the field values that comply with the specified regular expressions. The result is then logged into a backend database and the rows with invalid values are logged into a low-costs, object-based storage account to keep a full history for the data quality trend analysis and issue investigations.


Range Constraint Rule

Range Constraint rule allows the DQ analysts to specify a valid range of a numerical or date type column. This rule scans the target column to detect the values falling outside of specified range.

metric = 1- count({values out of valid range})/total rows

Range Constraint rule detects the data type of the column and generate input controls for DQ analysts to specify the the minimum and maximum numerical values (for numerical column type) or dates (for date column type)

Length Constraint Rule

Similar with the Range Constraint rule, the Length Constraint rule allows DQ analysts to specify the valid length range of a column and detect any field value that falls out of the range, for example, a Length Constraint rule can be used to check the validity of a bank card number column with valid length to be 16 digits.

metric = 1 - count({values out of valid range}) / total rows

Domain Constraint Rule

Domain Constraint rule allows DQ analysts to pre-define a list of valid domain values or “Golden Records” of an entity attribute, such as “Product Code” and “Cost Center Code”. Any field value that is not on the pre-defined list is invalid.

At the time when this blog post is written, DQ analysts can upload and centerally manage the pre-defined domain values (in csv file format) in the dqops DQ studio.

Once the pre-defined domain value csv file is uploaded, DQ analysts can apply the Domain Constraint rule to a column and specify the domain values.

Completeness

Completeness rule scans a column to evaluate the scale of missing values. When the scale of incompleteness of a column reaches to a level, the column can be not usable, especially for LOB datasets with transactional data, missing data implies potential business or IT exceptions or errors.

The missing values refer to not only the syntactic empty or null data type value but also possibly the default values, such as “Not Appliable”, “Default”, 0, 1970-01-01. The default values are either set by client application automatically when a field on the input UI is not filled or users just pick a default value to fill a compulsory form field.

Completeness rule counts the missing values and divide it by the total rows in the dataset and substract the result from 1 to calculate the completeness metric.

metric = 1 - count({missing or default}) / total rows
Accurary

I borrowed the ideas from Apache Griffin for assessing accuracy of a dataset, i.e. comparing the target dataset to a reference dataset that can be treated as reflecting “real world” truth. DQ analysts can create connections to the reference dataset and select columns and aggregate functions to define the comparison rules.

This “Source vs Target” rule can also be used for data reconciliation in a ETL process to validate the target dataset is same with the source dataset.

Uniqueness

Uniqueness rule scan a column to detect the duplicated values.

metrics = count(distinct values) / total rows
Timeliness

Timeliness rule exams the last modified date of each entity in the dataset to detect the entities that are older than the specified period.

metric = count(entities [last updated < (Today - late_allowed)]) / total entities

Referential Integrity

Referential Integrity rule scan a column to detect the invalid reference values.

metric = 1 - count(invalid values) / total rows
Data Quality Improvement – Rule-Based Data Quality Assessment

Data Quality Improvement – Rule-Based Data Quality Assessment

As discussed in the previous blog posts in my Data Quality Improvement series, the key for successful data quality management is the continuous awareness and insights of how fit your data is being used for your business. Data quality assessment is the core and possibly the most challenging activity in the data quality management process. This blog post discusses the requirements and core elements of rule-based data quality assessments.

Requirements of Data Quality Assessments

What do we want from data quality assessments? In short, we want the data quality assessments to tell us the level of fitness of our data for supporting our business requirements. This statement implies two core requirements of data quality assessments:

  • Data quality assessments need to evaluate the quality of data in a business context. Quality is an abstract and multifaceted concept. The criteria to decide the data quality is based on what is the business required for good data.
  • Data quality assessments need to be able to evaluate the level of data fitness. As mentioned above, continuous monitoring is the key to successful data quality management. We need a tool to understand the trend and degree of the data quality evolving.
Rule-Based Data Quality Assessments

For those two requirements of data quality assessments mentioned above, let’s have a think about what we will need in order to fulfil those requirements.

Firstly, to evaluate the fitness of a set of data to a business requirement, we first need to know the criteria the business is using to evaluate whether or not the data is fit to use. A simple and good representation of the business-oriented criteria is a set of constraint-based rules. The fitness can be decided by checking whether or not the constraints are complied. The constraint-based rules not only simplify the representation and organisation of facets of data quality but also acts as an effective common language between the persona involved in the data quality management process.

Secondly, we need an effective representation of the level of data fitness. This representation needs to be able to describe the degree of the data quality changes and also the distances to the different data quality status (e.g., good, ok, bad, very bad) expected by the business. Yes, nothing is better than a numerical value that is normally referred to as a DQ metric. DQ metrics are widely used and studied in both commercial and research communities and they are often interpreted from different angles. Below is my definition of DQ metrics:

  • It represents the quality of a specific attribute of a set of data
  • It is a numerical number in a fixed range (e.g. 0..1 or 0..100)
  • It can represent data quality at a different level

As mentioned before, I personally consider data quality assessment as the most challenging activity in the data quality management process. At the same time, I think defining DQ metrics is the most challenging activity in the data quality assessment process. To ensure the DQ metrics you defined accurately representing the data quality, not only do you need to find a good formula for calculating the metrics, but also you need to take all sorts of business factors into consideration, such as business impacts, urgencies, and criticalities.

Figure 1. Pre-defined DQ Rules in dqops DQ studio
Elements of Data Quality Rules

In the rest of this blog post, I am going to explore the core elements of data quality rules and how they could be defined to support data quality assessments. The dqops DQ studio (the app I have been building for one of my personal R&D initiatives) will be used as examples for discussing those elements of DQ rules.

A data quality rule needs to contain the following five core elements:

  • Business Constraints
  • Metric Formula
  • Alert Thresholds
  • DQ Dimension
  • Metric Aggregation Algorithm

Business constraints specify the conditions for a set of data to qualify to be fit to use in a business scenario. Let’s take the Regular Expression constraint rule in the dqops DQ Studio as an example (as shown in the snapshot below). This rule is used to evaluate whether a column value in a dataset complies with a format specified by a regular expression. For example, a column that stores France postcodes is fit to use only when its values comply with the specified France postcode format.

Figure 2. Regular Expressions Constraint Rule in dqops DQ Studio

DQ metrics are the numerical representations of an aspect of the quality of a dataset. The Metric formula defines the calculation to determine the number. In the Regular Expression constraint rule example, the metric could be defined as the count of column values that comply with the regex divided by the count of rows in the dataset.

In the example above, the result calculated from the metric formula is a number between 0 to 1 or in the format of 0% to 100%. However, a number alone cannot tell whether the set of data is fit or not in the given business context. Alert Threshold is the tool to make the cut. In the Regular Expression constraint rule example, as the snapshot above shown, a warning threshold could be set as 90% while a critical threshold could be set as 70%. That means the quality of the data set (for the format validity aspect defined in regular expressions) is ‘Good’ when the metric number is over 90% and is ‘OK’ when it is between 70% and 90% and is not fit when the number is under 70%. Alert thresholds can be manually defined based on the business/DQ analysts experiences or be automatically defined based on the historic distribution of the observed values.

Figure 3. DQ Monitoring Panel in dqops DQ studio

Another element of a DQ rule is the Dimension that the DQ rule is evaluating into. As I mentioned in my previous blog post, I have a rather mixed feeling about DQ dimension. On one hand, DQ dimensions are context-dependent that could be interpreted differently in a different context by different people. On the other hand, DQ dimensions create a common language to express and classify the quality of data when quality itself is an abstract concept that could represent many aspects. To solve this dilemma, as I suggested before, instead of pursuing a global consensus on the universal DQ dimensions in any context indifferently, the DQ dimensions need to be defined and interpreted based on the expected purposes of the data and the consensus on the DQ dimension meanings need to be made in an internal, domain-specific environment. In the dqops DQ Studio, the dimension of a DQ rule is not pre-defined, but instead, business/DQ analysts are allowed to specify dimension for a DQ rule based on their interpretations of the dimension in their organisation.

The last but not least element of a DQ rule is the Metric Aggregation Algorithm. Data quality needs to be assessed and analysed at different granularity levels, such as column-level, dataset-level, database-level, domain-level, up to organisation-level. To represent the data quality using metrics at different levels, the low-level metrics need to be aggregated into the higher level. The algorithm for aggregating the metrics needs to take the business impacts as weighted variables, such as severity level and priority level of DQ issues.

Figure 4. DQ Dashboard in dqops DQ Studio

What is Data Management, actually? – DAMA-DMBOK Framework

What is Data Management, actually? – DAMA-DMBOK Framework

“What is data management?”. I guess many people will (at least I think I will) answer “em… data management is managing data, right?” at the same time swearing in their heads that “what a stupid question!”.

However, if I was asked this question in a job interview, I guess I’d better to provide a bit longer answer, such as the one given by DAMA cited below if I could ever memorise it.

Data Management is the development, execution, and supervision of plans, policies, programs, and practices that deliver, control, protect, and enhance the value of data and information assets throughout their lifecycles.

DAMA-DMBOK

If the interviewers asked me to elaborate in further detail, it could be a challenge as there are so many facets and aspects of Data Management. Many interdependent functions with their own goals, activities, and responsibilities are required in data management. For a data management professional, it is difficult to keep track of all those components and activities involved in data management.

Fortunately, DAMA developed the DMBOK framework, organising data knowledge areas in a structured form, that enables data professionals to understand data management comprehensively.

I have recently been diving into the heavy reading DAMA DMBOK book (“heavy” is in its literal manner, the book weighs 1.65kg!). I actually recommend all data professionals to give a read of this book. It is not only able to connect the dots in your knowledge network to have a comprehensive understanding of data management but also to provide a common language enabling you to communicate in the data world (instead of just nodding and smiling in a meeting when hearing some data jargon.

DAMA DMBOK framework defines 11 functional areas of data management.

As you can see from the DAMA Wheel above, Data Governance is at the heart of all the data management functional areas. Data governance provides direction and oversight for data management to ensure the secure, effective, and efficient use of data within the organisation. The other functional areas include:

  • Data Architecture – defines the structure of an organisation’s logical and physical data assets and data management processes through the data lifecycle
  • Data Modelling and Design – the process of discovering, analysing, representing and communicating data requirements in the form of data model
  • Data Storage and Operation – the process of designing, implementing, and supporting data storage
  • Date Security – ensuring data is accessed and used properly with data privacy and confidentiality are maintained
  • Data Integration and Interoperability – the process of designing and implementing data movement and consolidation within and between data sources
  • Document and Content Management – the process of managing data stored in unstructured medias
  • Reference and Master Data – the process of maintaining the core critical shared data within the organisation
  • Data Warehousing and Business Intelligence – the process of planning, implementing and controlling the processes for managing the decision supporting data
  • Metadata – managing the information of the data in the organisation
  • Data Quality – the process of ensuring data to be fit to use.

Based on the functional areas defined by the DAMA Wheel, Peter Aiken developed the DMBOK pyramid that defines the relation between those functional areas.

From the DMBOK pyramid, we can see the top of the pyramid is the golden function that is the most value-added for the business. However, the DMBOK pyramid reveals that the data analytics is just a very small part in an organisation’s data system. To make the data analytics workable, a lot of other functions need to work and collaborate seamlessly to build the foundation.

As the DMBOK pyramid shows, data governance is at the bottom that makes the ground foundation for the whole data system. Data architecture, data quality and metadata make up another layer of logical foundation on top of the data governance. The next level of upper layer includes data security, data modelling & design, and data storage & operations. Based on that layer, data integration & interoperability function can work on moving and consolidating data for enabling the functions at the upper layer, data warehousing/business intelligence, reference & master data, and documents & contents. From this layer, the functions start to be business faced. That also means the business cannot see the functions that need to run underneath.

The practical contribution from DMBOK pyramid is to reveal the logical progression of steps for constructing a data system. The DMBOK pyramid defines four phases for an organisation’s data management journey:

Phase 1 (Blue layer) – An organisation’s data management journey starts from purchasing applications that include database capabilities. That means the data modelling & design, data storage, and data security are the functions to be in place at first.

Phase 2 (Orange layer) – The organisation starts to feel the pains from bad data quality. To improve data quality, reliable metadata and consistent data architecture are essential.

Phase 3 (Green layer) – Along with the developments of the data quality, metadata and data architecture, the needs of structural supports for data management activities starts to appear that reveals the importance of a proper data governance practice. At the same time, data governance enables the execution of strategic initiatives, such as data warehousing, document management, and master data management.

Phase 4 (Red layer) – well-managed data enables advanced analytics.

The four phases of DMBOK seem to make sense, however, from my own experiences, organisations often go on different routes, mostly starting to rush on high-profiling initiatives such as data warehousing, BI, machine learning when no data governance, data quality, metadata etc. is in place.

How Azure Storage Cheats Over the CAP Theorem

Microsoft claims Azure Storage providing both high availability and strong consistency. It sounds good but obviously violates the CAP theorem as the ‘P’ (network partitioning) is not avoidable in the real world. In theory, you can only achieve either high availability or strong consistency in a distributed storage system. I have done a bit of research and found out the way how Azure Storage walks around the CAP theorem.

The figure below shows the high-level architecture of Azure Storage (from the SOSP paper authored by Microsoft product team)

Azure Storage stores data in Storage Stamps. Each Storage Stamp physically consists of a cluster of N racks of storage nodes. Each rack is built as a separate fault domain. There are three functional layers in each Storage Stamps, Front-Ends, Partition Layer and Stream Layer. The Stream Layer is where the data is stored and it can be thought of as a distributed file system layer with “Streams” can be understood as ordered lists of storage chunks, namely Extents. Extents are the units of replication that are distributed in different nodes in a storage stamp to support fault tolerance. Similar to Hadoop, the default replication policy in Azure Storage is to keep three replicas for each extent.

For Azure Storage to successfully cheat over the CAP theorem to provide high availability and strong consistency at the same time under the condition of network partitioning, the stream layer needs to be able to immediately response requests from client and the reads from any replica of the extent needs to return the same data. In the SOSP paper, Microsoft describes their solution implemented in Azure Storage as layered design with stream layer responsible for high availability and partition layer responsible for strong consistency. However, I have to admit I was confused with the explanations presented in the SOSP paper.

Instead, I found the core idea behind Azure Storage implementation is rather simple, i.e. offering strong consistency at the cost of higher latency, in the other words, enforcing data replication on the critical path of writing requests. The figure below shows the journey of a write request at the stream layer.

In a nutshell, extents are append-only and immutable. For an extent, every append is replicated three times across the extent replicas in a synchronous mode. A write request can only marked as success when all of the writes to all replicas are successful. Compared to asynchronous replication that is often used to offer eventual consistency, the synchronous replication will no doubt cause higher write latency. However, the append-only data model design should be able to compensate that.

It is worth to mention that the synchronous replication is only supported within a storage stamp. The inter-stamp replication still takes asynchronous approach. This design decision is understandable as having geo-replication on the synchronous critical write path can cause much worse write latency.