Lakehouse is faster than Data Warehouse! This is a big deal! Now, what could stop Lakehouse from replacing Data Warehouse? This is the news I have been waiting for since Databricks SQL, the full suite of data warehousing capabilities, was announced last year.
In the 100TB TPC-DS performance benchmark test for data warehousing, Databricks SQL outperformed the previous record by 2.2x. In a separate test conducted by Barcelona Supercomputing Center, Databricks SQL demonstrates 2.7x faster and 12x cheaper than snowflake on the same data warehousing workloads.
Now, what could stop Lakehouse from replacing Data Warehouse?
Databricks SQL, powered by Delta Lake, offers the full suite of data warehousing capabilities such as ACID transactions, fine-grained data security, scalable metadata handling, first-class SQL support, and BI reporting. In addition, new capabilities are being continuously added at a fast pace, such as the latest Low Shuffle Merge feature and SQL custom function feature. This is impressive to see those data warehousing capabilities on top of data lakes. Just a few years ago, we had to write rather awkward code as a workaround for the lack of merge capability when updating data in the data lake.
Regarding the ‘Lake’ side workloads, there is no doubt that Lakehouse outperformed the data warehouses (of course, that is why it is called “lakehouse”) and offers the capabilities that cannot be achieved by the data warehouses, such as native supports of unstructured or semi-structured data and machine learning type of workloads.
The born of the data lake is driven by the difficulties that businesses face to handle data at greater volume, variety and speed with a classic data warehouse. Lakehouse is capable of storing and processing very large volumes of data. It is impossible or is too expensive to handle that scale for a classic data warehouse. Lakehouse natively supports unstructured and semistructured data. Lakehouse natively supports streaming data.
I personally consider Lakehouse being more flexible than the classic data warehouse. Databricks Lakehouse originated from open-source initiatives and adopts an open architecture instead of building into a closed black-box as most of the classic data warehouses. It offers data engineers more options and flexibility to integrate into or extend their lakehouse.
The data warehouse has been around for more than 30 years. It is no doubt that the data warehouse is normally more reliable and robust than Lakehouse under most of the conditions. However, the situation is changing at a fast pace.
Databricks Lakehouse is not cheap, especially when you need to pay for both the Databricks Units and the VMs provisioned for supporting it. However, for similar data warehousing workloads and data volumes, Lakehouse has the advantages: low-cost cloud-based storage, elastic and pay as you go computing powers, and the latest serverless SQL feature (Databricks claims a 40% cost saving).
Now, get to the decisive factor, the performance, more specifically the interactive query performance. Manageability/Reliability and interactive query performance are two of the biggest hurdles for Lakehouse to be competent for data warehousing workloads. With the rapid advance of Delta Lake capabilities, the hurdle of manageability/reliability is not that formidable.
Databricks and its open-source cousin, Apache Spark, were originally designed for offline processing of big data workloads. In the other words, this is a design favouring high throughput over low latency and people don’t have that high expectation for their performance with interactive queries. However, with the complete rewrite of its processing engine and performance optimisation techniques (caching, cost-based query optimizer, data skipping, data compaction, and so on), Databricks Lakehouse gets to the same performance level as data warehouse.
How about Snowflake, the Most Promising Data Warehouse?
Yes, Snowflake started as a data warehousing company, however, it has been adding more and more data lake features. Even though it takes the opposite route of Databricks, which started as a big data company but has been adding more and more data warehousing features, they are becoming more and more alike. Eventually, a new name would be given to them. It might be “Lakehouse” or not (at the end of the day, they might be called back as “Data Warehouse 2.0” or something else), however, we could expect the new “thing” is capable to make the “big” data (big volume, variety of data structures, and high-velocity) not “big” in future.