What is a Data Lakehouse?

What is a data lakehouse?

Data lakehouses seek to resolve the core challenges across both data warehouses and data lakes to yield a more ideal data management solution for organizations. They represent the next evolution of data management solutions in the market.

A data lakehouse is a data platform, which merges the best aspects of data warehouses and data lakes into one data management solution. Data warehouses tend to be more performant than data lakes, but they can be more expensive and limited in their ability to scale. A data lakehouse attempts to solve for this by leveraging cloud object storage to store a broader range of data types—that is, structured data, unstructured data and semi-structured data. By bringing these benefits under one data architecture, data teams can accelerate their data processing as they no longer need to straddle two disparate data systems to complete and scale more advanced analytics, such as machine learning.

Why AI governance is a business imperative for scaling enterprise Artificial Intelligence

Learn about barriers to AI adoptions, particularly lack of AI governance and risk management solutions.

Related content

Data warehouse vs. data lake vs. data lakehouse

Since data lakehouses emerged from the challenges of both data warehouses and data lakes, it’s worth defining these different data repositories and understanding how they differ.

Data warehouse

A data warehouse gathers raw data from multiple sources into a central repository and organizes it into a relational database infrastructure. This data management system primarily supports data analytics and business intelligence applications, such as enterprise reporting. The system uses ETL processes to extract, transform, and load data to its destination. However, it is limited by its inefficiency and cost, particularly as the number of data sources and quantity of data grow over time.

Data lake

Data lakes are commonly built on big data platforms such as Apache Hadoop. They are known for their low cost and storage flexibility as they lack the predefined schemas of traditional data warehouses. They also house different types of data, such as audio, video, and text. Since data producers largely generate unstructured data, this is an important distinction as this also enables more data science and artificial intelligence (AI) projects, which in turn drives more novel insights and better decision-making across an organization. However, data lakes are not without their own set of challenges. The size and complexity of data lakes can require more technical resources, such as data scientists and data engineers, to navigate the amount of data that it stores. Additionally, since data governance is implemented more downstream in these systems, data lakes tend to be more prone to more data silos, which can subsequently evolve into a data swamp. When this happens, the data lake can be unusable.

Data lakes and data warehouses are typically used in tandem. Data lakes act as a catch-all system for new data, and data warehouses apply downstream structure to specific data from this system. However, coordinating these systems to provide reliable data can be costly in both time and resources. Long processing times contribute to data staleness and additional layers of ETL introduce more risk to data quality.

Data lakehouse

The data lakehouse optimizes for the flaws within data warehouses and data lakes to form a better data management system. It provides organizations with fast, low-cost storage for their enterprise data while also delivering enough flexibility to support both data analytics and machine learning workloads.

Related solutions

Data management solutions

Data lake solutions

Key features of a data lakehouse

As previously noted, data lakehouses combine the best features within data warehousing with the most optimal ones within data lakes. It leverages similar data structures from data warehouses and pairs it with the low cost storage and flexibility of data lakes, enabling organizations to store and access big data quickly and more efficiently while also allowing them to mitigate potential data quality issues. It supports diverse data datasets, i.e. both structured and unstructured data, meeting the needs of both business intelligence and data science workstreams. It typically supports programming languages like Python, R, and high performance SQL.

Data lakehouses also support ACID transactions on larger data workloads. ACID stands for atomicity, consistency, isolation, and durability; all of which are key properties that define a transaction to ensure data integrity. Atomicity can be defined as all changes to data are performed as if they are a single operation. Consistency is when data is in a consistent state when a transaction starts and when it ends. Isolation refers to the intermediate state of transaction being invisible to other transactions. As a result, transactions that run concurrently appear to be serialized. Durability is after a transaction successfully completes, changes to data persist and are not undone, even in the event of a system failure. This feature is critical in ensuring data consistency as multiple users read and write data simultaneously.

Data lakehouse architecture

A data lakehouse typically consists of five layers: ingestion layer, storage layer, metadata layer, API layer, and consumption layer. These make up the architectural pattern of data lakehouses.

Ingestion layer

This first layer gathers data from a range of different sources and transforms it into a format that can be stored and analyzed in a lakehouse. The ingestion layer can use protocols to connect with internal and external sources such as database management systems, NoSQL databases, social media, and others. As the name suggests, this layer is responsible for the ingestion of data.

Storage layer

In this layer, the structured, unstructured, and semi-structured data is stored in open-source file formats, such as such as Parquet or Optimized Row Columnar (ORC). The real benefit of a lakehouse is the system’s ability to accept all data types at an affordable cost.

Metadata layer

The metadata layer is the foundation of the data lakehouse. It’s a unified catalog that delivers metadata for every object in the lake storage, helping organize and provide information about the data in the system. This layer also gives user the opportunity to use management features such as ACID transactions, file caching, and indexing for faster query. Users can implement predefined schemas within this layer, which enable data governance and auditing capabilities.

API layer

A data lakehouse uses APIs, to increase task processing and conduct more advanced analytics. Specifically, this layer gives consumers and/or developers the opportunity to use a range of languages and libraries, such as TensorFlow, on an abstract level. The APIs are optimized for data asset consumption.

Data consumption layer

This final layer of the data lakehouse architecture hosts client apps and tools, meaning it has access to all metadata and data stored in the lake. Users across an organization can make use of the lakehouse and carry out analytical tasks such as business intelligence dashboards, data visualization, and other machine learning jobs.

Benefits of a data lakehouse

Since data lakehouse was designed to bring together best features of a data warehouse and a data lake, it yields specific key benefits to its users. This includes:

Reduced data redundancy: The single data storage system allows for a streamlined platform to carry out all business data demands. Data lakehouses also simplify data observability by reducing the amount of data moving through the data pipelines into multiple systems.
Cost-effective: Since data lakehouses capitalize off of the lower costs of cloud object storage, the operational costs of a data lakehouse are comparatively lower than data warehouses. Additionally, the hybrid architecture of a data lakehouse eliminates the need to maintain multiple data storage systems, making it less expensive to operate.
Supports wide variety of workloads: Data lakehouses can address different use cases across the data management lifecycle. It also can support both business intelligence and data visualization workstreams or more complex data science ones.
Better governance: The data lakehouse architecture mitigates the standard governance issues that come with data lakes. For example, as data is ingested and uploaded, it can ensure that the data meets the defined schema requirements, reducing downstream data quality issues.
More scale: In traditional data warehouses, compute and storage were coupled together, which drove up operational costs. Data lakehouses separate storage and compute, allowing data teams to access the same data storage while use different computing nodes for different applications. This results in more scalability and flexibility.
Streaming support: The data lakehouse is built for business and technology of today and many data sources use real-time streaming directly from devices. The lakehouse system supports this real-time ingestion, which will only become more popular in the future.

Footnotes

¹ Lakehouse: A New Generation of Open Platforms that Unify
Data Warehousing and Advanced Analytics (link resides outside of ibm.com), Stanford, 2021