What is a data lakehouse?

By Alexandra Jonker , Alice Gomstyn

What is a data lakehouse?

A data lakehouse is a modern data platform that combines the low-cost, flexible data storage of a data lake with the high-performance analytics and data management capabilities of a data warehouse.

Historically, organizations often used data lakes and data warehouses in tandem. Data lakes acted as a catch-all system for raw structured, semi-structured and unstructured data, which was then moved using ETL/ELT pipelines to a data warehouse for downstream use cases such as business intelligence (BI) and predictive analytics.

However, coordinating these systems to provide reliable data can be costly in both time and resources, especially for data analytics and AI workloads. Data movement can contribute to data staleness and redundancy, while additional layers of ETL/ELT can introduce data quality and consistency risks.

Data lakehouses alleviate these challenges by bringing warehouse-style data management and analytics capabilities directly to data stored in data lakes. This arrangement helps data teams unify data management, accelerate data processing, improve data quality, and support scalable artificial intelligence (AI) and machine learning (ML) workloads.

How does a data lakehouse work?

Like a data lake, a data lakehouse uses low-cost cloud object storage. This approach enables them to store data in almost any format (structured, semi-structured and unstructured).

What makes it a lakehouse is the warehouse-style data management layer built on top of that storage, which adds data structure and governance to support analytics and BI workloads.

Most data lakehouses rely on open table formats (OTFs)—typically:

Apache Hudi (originally built at Uber and designed for incremental data processing)
Apache Iceberg (a high-performance format for massive analytic tables)
Delta Lake (a popular option developed by Databricks and open-sourced in 2019)

These technologies act as metadata layers that organize open data files (such as those stored in Apache Parquet) into logical, database-like tables.

This approach allows organizations to work with raw lake data as if it were structured warehouse data, supporting key capabilities such as time travel, versioning, schema evolution, data manipulation and transactional consistency (ACID).

(“ACID” stands for atomicity, consistency, isolation and durability. These properties help ensure the integrity and reliability of data transactions.)

With these added layers and features, lakehouses make data lakes more reliable and intuitive to use. They also allow users to run structured query language (SQL) queries, analytics workloads and other advanced use cases directly on a data lake, streamlining BI, AI, ML and data intelligence (DI).

The latest tech news, backed by expert insights

Stay up to date on the most important—and intriguing—industry trends on AI, automation, data and beyond with the Think Newsletter. See the IBM Privacy Statement.

Layers of the data lakehouse architecture

The architecture of a data lakehouse typically consists of five layers:

Ingestion layer
Storage layer
Metadata layer
API layer
Consumption layer

Ingestion layer

This first layer collects data from a range of internal and external sources and prepares it for storage and analysis. The ingestion layer can use connectors to integrate with sources such as database management systems, NoSQL databases, SaaS applications and social media feeds. Ingestion can be either batch or in real time.

Storage layer

The storage layer holds structured, unstructured and semi-structured datasets in low-cost cloud object storage. Common services include Amazon Simple Storage Service (Amazon S3), Microsoft Azure Blob Storage, Google Cloud Storage and IBM Cloud Object Storage.

Data is typically stored in columnar storage formats optimized for large analytical workloads, such as Apache Parquet or Optimized Row Columnar (ORC). This layer provides a major benefit of the data lakehouse—its ability to cost-efficiently accommodate virtually all data types.

Metadata layer

The metadata layer is a unified catalog that organizes and provides information about data in the lake. It is typically powered by open table formats such as Apache Iceberg, Apache Hudi or Delta Lake.

The capabilities of this layer enable ACID transactions, time travel and schema enforcement, which help improve data governance. Robust access controls at this layer are critical for organizations that handle sensitive data and valuable for tracking data access and modifications to maintain audit trails.¹

API Layer

Application programming interfaces (APIs) provide standardized access to lakehouse data and metadata. Specifically, this layer gives data consumers and developers the opportunity to use a range of analytics engines and machine learning frameworks (such as TensorFlow) to run advanced analytics and model training directly on lakehouse data.

Consumption Layer

The final layer of data lakehouse architecture hosts apps and tools that have access to all data stored in the lake. This opens data access to users across an organization, who can use the lakehouse to perform tasks such as creating business intelligence dashboards, data visualizations and machine learning jobs.

What is a medallion lakehouse architecture?

Medallion data architecture (MDA) is a multi-layered, quality-focused data design approach that ensures lakehouse data is progressively cleansed, validated and trustworthy as it moves from ingestion to consumption. It can help organizations build a scalable, governed data lakehouse suitable for everyday business reporting as well as advanced analytics and machine learning workloads.

That scalability is critical for maintaining quality as data volumes grow. According to a January 2025 benchmark study, 87.4% of organizations found legacy data quality frameworks became operationally unsustainable beyond seven petabytes.²

The framework organizes data into three distinct layers throughout its lifecycle: bronze, silver and gold—improving data quality at each step.

The bronze layer is for raw data. It preserves original data exactly as it was when received from source systems. This guarantees there is always an immutable source file, removing the risk of data getting lost or overwritten during transformation.
The silver layer is where data is actively cleansed, structured and enriched. It unifies conflicting or duplicate records into a single data source for analytics and operational reporting.
The gold layer contains refined, business-ready data—a trusted, single source of truth ideal for strategic decision-making. All critical business metrics are defined and pre-calculated in this layer.

The gold layer also strengthens AI readiness. It provides a high-quality stream of AI-ready data directly to ML pipelines, which can help improve model accuracy and reduce data-preparation efforts.

This structured data progression ensures that any final data file can be traced backward through its transformation to its original state. It also provides more predictable and often lower costs, as data storage and compute resources can be optimized according to each layer’s purpose.

Learn more about medallion data architecture

What are the key features of a data lakehouse?

Data lakehouses offer several key features:

Open file formats
ACID transactions
Unified data
Cost-effective storage
Workload flexibility
Strong data governance
Scalability
Real-time streaming support

Open file formats

Open, columnar storage formats (or open data formats) such as Apache Parquet or ORC improve query performance and reduce storage costs through efficient compression, column pruning and predicate pushdown. These formats are compatible with popular analytics engines which allow organizations to access the same data, at the same time. This functionality helps them avoid vendor lock-in and achieve interoperability across their different tools.

ACID transactions

Most data lakehouses use open table formats such as Apache Iceberg, Apache Hudi and Delta Lake to provide ACID transactions. These transactions, such as inserts, updates and deletes, guarantee that data remains consistent and reliable during and after data operations.

Unified data

A single data storage system creates a centralized platform that can meet all business data demands, reducing data silos and duplication across systems and teams. This unification also simplifies end-to-end data observability as data movement through various data pipelines and systems is significantly reduced.

Cost-effective storage

Data lakehouses capitalize on low-cost cloud object storage, making them more cost-effective for large data volumes and workloads than traditional data warehouses. The hybrid architecture of a data lakehouse also eliminates the need to maintain multiple data storage systems, often reducing operational expenses.

Workload flexibility

Data lakehouses can address different use cases across the data management lifecycle. They can support business intelligence and data-driven visualization workflows, or more complex data science projects (such as machine learning model training or real-time analytics)—all on the same data.

Strong data governance and security

Data lakehouse architecture mitigates the governance issues of data lakes with centralized metadata catalogs, schema enforcement and built-in data quality management tools. Data security can be strengthened using access controls, monitoring and audits, data anonymization, blockchain and even quantum computing.^3,4

Scalability

Data lakehouses separate storage and compute, allowing data teams to scale them separately. This decoupling also provides the flexibility to access the same data while using different computing engines or nodes for different applications.

Real-time streaming support

Modern data lakehouses are built for today’s businesses and technology. Many data sources contain real-time streaming data from sources such as Internet of Things devices. The lakehouse system supports these sources through real-time data ingestion and incremental processing.

How is a lakehouse different from a data warehouse or data lake?

A data lakehouse isn’t simply a data warehouse combined with a data lake. It’s a unified architecture that brings the best parts of both together into a single platform.

Data warehouses: Strong governance and performance, less flexibility

Data warehouses are built for structured analytics. They deliver excellent performance for business intelligence applications and reporting by storing and transforming enterprise data.

However, data warehouses lack the flexibility of data lakes. They are limited by their inefficiency and costs as data volumes and workloads grow. Data warehousing also requires strict schemas, meaning data must conform to a predefined model before ingestion to the data repository (schema-on-write). Due to these constraints, they don’t work well with unstructured or semi-structured data, which are critical for AI and ML use cases.

Data lakes: Greater flexibility, weak governance and analytics

Data lakes allow organizations to store all types of data—structured, unstructured and semi-structured—from diverse sources in one location. They use a schema-on-read approach, so data models are applied when data is used rather than when it’s stored. They also typically have more scalable and affordable data storage (often cloud object storage).

However, they do not have built-in data processing tools and rely on external capabilities to perform analytics. Their size and complexity can also require the expertise of more technical users, such as data scientists and data engineers. And, because data governance occurs downstream, data lakes can be prone to data silos, subsequently devolving into data swamps (where good data is inaccessible due to poor management).

Data lakehouses: Data lake flexibility with warehouse-like management and performance

Data lakehouses are designed to resolve the challenges of data warehouses and data lakes, bringing their benefits under one platform. They leverage flexible, low-cost storage that supports a broad range of data types while also delivering data management and high-performance capabilities to support BI, analytics and AI/ML workloads in a single architecture.

Anson Kokkat, Principal Product Manager of IBM Software, emphasizes the importance of lakehouses for modern AI programs:

“AI models are only as good as the governed, scalable data platform beneath them. The right data lakehouse becomes the foundation that turns raw enterprise data into production-ready AI. When built on open architecture, that translates to AI flexibility—you are not locked into one engine, you can integrate with existing open source tools such as Presto, Apache Spark, OpenSearch and Cassandra.”

Another major benefit: Organizations can often implement data lakehouses alongside their existing data lakes and data warehouses without a full teardown and rebuild.

Deep dive: Data warehouses vs. data lakes vs. data lakehouses

Mixture of Experts | 6 March, episode 97

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

Watch all episodes of Mixture of Experts

Frequently asked questions about data lakehouses

What is an open data lakehouse?

Today, many providers offer open data lakehouses. This architecture supports open data and open formats for storing vast amounts of data in vendor-agnostic formats, such as Parquet, Avro and Apache ORC. It can also leverage Apache Iceberg to share large volumes of data through an open table format.

What are common problems with lakehouses?

Common data lakehouse challenges include complex implementations (including migrations from existing data platforms); balancing data governance and security with unified data access; and ensuring that query performance remains optimal as data volumes grow.

Can you run AI and ML on data lakehouse architecture?

Yes. Data lakehouses support AI and ML workloads by providing unified access to large volumes of diverse data with strong governance. They use open data and open table formats to prevent vendor lock-in and enable direct integration between the storage layer and ML frameworks.

Can a data lakehouse completely replace my data warehouse?

It can, but whether it should depends on your data priorities. Lakehouses are a strong choice for storing diverse, big data and supporting AI/ML workloads, while warehouses remain useful for more structured or high-performance, low-latency data needs. Many organizations use both platforms.

How do you prevent a lakehouse from becoming a "data swamp?”

Avoiding a data swamp requires strong data governance, data quality and data security practices. Additionally, a tiered (medallion) storage architecture keeps data organized, and open table formats with ACID transactions help ensure data integrity, consistency and reliability.

Techsplainers | Podcast | What is a data lakehouse?

Listen to: 'What is a data lakehouse?'

Follow Techsplainers: Spotify, Apple Podcasts, and Casted.

Find more episodes

Authors

Alexandra Jonker

Staff Editor

IBM Think

Alice Gomstyn

Staff Writer

IBM Think

Four steps to better business forecasting with analytics

Use the power of analytics and business intelligence to plan, forecast and shape future outcomes that best benefit your company and customers.

Resources

The hybrid, open data lakehouse for AI

Simplify data access and automate data governance. Discover the power of integrating a data lakehouse strategy into your data architecture, including cost-optimizing your workloads and scaling AI and analytics, with all your data, anywhere.

From data chaos to AI clarity: Activating AI through high-quality enterprise data

Understand how focusing on well-governed, secure and collaborative access to data at scale empowers enterprises to maximize their AI investments

Decision intelligence: Thoughtful, data-driven choices

Learn how data intelligence helps leaders make sense of data, use generative AI wisely and make decisions based on what truly matters.

Streamlining and evolving fraud investigations with AI

Discover how Cogniware leverages AI solutions from IBM to drive efficiency in the financial crime space.

Turning data strategy into AI impact

Discover how to scale AI with a strong data foundation, deliver explainable and governed outcomes, and apply real-world lessons to your own AI roadmap.

How the C-suite is turning information into impact

Explore insights from 1,700 CDOs in this cross-industry report for data leaders.

Unify and access your data to help scale your AI

Learn why the path to AI-ready data often starts with effective access to both structured and unstructured data and the challenges that can impede data leaders.

Unleash the power of AI for seamless data integration

Understand why organizations need to adopt a unified approach that lets them manage the full spectrum of integration capabilities from a single pane of glass, eliminating the need to rely on numerous tools.

Footnotes

¹ Data Lakehouse Architecture: The Evolution of Enterprise Data Management, Journal of Computer Science and Technology Studies, 23 June 2025.

² Data Lakehouse Implementation: A Journey From Traditional Data Warehouses, World Journal of Advanced Engineering Technology and Sciences, 26 February 2025.

³ Data Lakehouse: A Survey and Experimental Study, Science Direct, 26 September 2024.

⁴ Minimizing Incident Response Time in Real-World Scenarios Using Quantum Computing, Springer Nature Link, 26 May 2023.

What is a data lakehouse?