The latest tech news, backed by expert insights
Stay up to date on the most important—and intriguing—industry trends on AI, automation, data and beyond with the Think Newsletter. See the IBM Privacy Statement.
A data lakehouse is a modern data platform that combines the low-cost, flexible data storage of a data lake with the high-performance analytics and data management capabilities of a data warehouse.
Historically, organizations often used data lakes and data warehouses in tandem. Data lakes acted as a catch-all system for raw structured, semi-structured and unstructured data, which was then moved using ETL/ELT pipelines to a data warehouse for downstream use cases such as business intelligence (BI) and predictive analytics.
However, coordinating these systems to provide reliable data can be costly in both time and resources, especially for data analytics and AI workloads. Data movement can contribute to data staleness and redundancy, while additional layers of ETL/ELT can introduce data quality and consistency risks.
Data lakehouses alleviate these challenges by bringing warehouse-style data management and analytics capabilities directly to data stored in data lakes. This arrangement helps data teams unify data management, accelerate data processing, improve data quality, and support scalable artificial intelligence (AI) and machine learning (ML) workloads.
Like a data lake, a data lakehouse uses low-cost cloud object storage. This approach enables them to store data in almost any format (structured, semi-structured and unstructured).
What makes it a lakehouse is the warehouse-style data management layer built on top of that storage, which adds data structure and governance to support analytics and BI workloads.
Most data lakehouses rely on open table formats (OTFs)—typically:
These technologies act as metadata layers that organize open data files (such as those stored in Apache Parquet) into logical, database-like tables.
This approach allows organizations to work with raw lake data as if it were structured warehouse data, supporting key capabilities such as time travel, versioning, schema evolution, data manipulation and transactional consistency (ACID).
(“ACID” stands for atomicity, consistency, isolation and durability. These properties help ensure the integrity and reliability of data transactions.)
With these added layers and features, lakehouses make data lakes more reliable and intuitive to use. They also allow users to run structured query language (SQL) queries, analytics workloads and other advanced use cases directly on a data lake, streamlining BI, AI, ML and data intelligence (DI).
Stay up to date on the most important—and intriguing—industry trends on AI, automation, data and beyond with the Think Newsletter. See the IBM Privacy Statement.
The architecture of a data lakehouse typically consists of five layers:
This first layer collects data from a range of internal and external sources and prepares it for storage and analysis. The ingestion layer can use connectors to integrate with sources such as database management systems, NoSQL databases, SaaS applications and social media feeds. Ingestion can be either batch or in real time.
The storage layer holds structured, unstructured and semi-structured datasets in low-cost cloud object storage. Common services include Amazon Simple Storage Service (Amazon S3), Microsoft Azure Blob Storage, Google Cloud Storage and IBM Cloud Object Storage.
Data is typically stored in columnar storage formats optimized for large analytical workloads, such as Apache Parquet or Optimized Row Columnar (ORC). This layer provides a major benefit of the data lakehouse—its ability to cost-efficiently accommodate virtually all data types.
The metadata layer is a unified catalog that organizes and provides information about data in the lake. It is typically powered by open table formats such as Apache Iceberg, Apache Hudi or Delta Lake.
The capabilities of this layer enable ACID transactions, time travel and schema enforcement, which help improve data governance. Robust access controls at this layer are critical for organizations that handle sensitive data and valuable for tracking data access and modifications to maintain audit trails.1
Application programming interfaces (APIs) provide standardized access to lakehouse data and metadata. Specifically, this layer gives data consumers and developers the opportunity to use a range of analytics engines and machine learning frameworks (such as TensorFlow) to run advanced analytics and model training directly on lakehouse data.
The final layer of data lakehouse architecture hosts apps and tools that have access to all data stored in the lake. This opens data access to users across an organization, who can use the lakehouse to perform tasks such as creating business intelligence dashboards, data visualizations and machine learning jobs.
Medallion data architecture (MDA) is a multi-layered, quality-focused data design approach that ensures lakehouse data is progressively cleansed, validated and trustworthy as it moves from ingestion to consumption. It can help organizations build a scalable, governed data lakehouse suitable for everyday business reporting as well as advanced analytics and machine learning workloads.
That scalability is critical for maintaining quality as data volumes grow. According to a January 2025 benchmark study, 87.4% of organizations found legacy data quality frameworks became operationally unsustainable beyond seven petabytes.2
The framework organizes data into three distinct layers throughout its lifecycle: bronze, silver and gold—improving data quality at each step.
The gold layer also strengthens AI readiness. It provides a high-quality stream of AI-ready data directly to ML pipelines, which can help improve model accuracy and reduce data-preparation efforts.
This structured data progression ensures that any final data file can be traced backward through its transformation to its original state. It also provides more predictable and often lower costs, as data storage and compute resources can be optimized according to each layer’s purpose.
Data lakehouses offer several key features:
Open, columnar storage formats (or open data formats) such as Apache Parquet or ORC improve query performance and reduce storage costs through efficient compression, column pruning and predicate pushdown. These formats are compatible with popular analytics engines which allow organizations to access the same data, at the same time. This functionality helps them avoid vendor lock-in and achieve interoperability across their different tools.
Most data lakehouses use open table formats such as Apache Iceberg, Apache Hudi and Delta Lake to provide ACID transactions. These transactions, such as inserts, updates and deletes, guarantee that data remains consistent and reliable during and after data operations.
A single data storage system creates a centralized platform that can meet all business data demands, reducing data silos and duplication across systems and teams. This unification also simplifies end-to-end data observability as data movement through various data pipelines and systems is significantly reduced.
Data lakehouses capitalize on low-cost cloud object storage, making them more cost-effective for large data volumes and workloads than traditional data warehouses. The hybrid architecture of a data lakehouse also eliminates the need to maintain multiple data storage systems, often reducing operational expenses.
Data lakehouses can address different use cases across the data management lifecycle. They can support business intelligence and data-driven visualization workflows, or more complex data science projects (such as machine learning model training or real-time analytics)—all on the same data.
Data lakehouse architecture mitigates the governance issues of data lakes with centralized metadata catalogs, schema enforcement and built-in data quality management tools. Data security can be strengthened using access controls, monitoring and audits, data anonymization, blockchain and even quantum computing.3,4
Data lakehouses separate storage and compute, allowing data teams to scale them separately. This decoupling also provides the flexibility to access the same data while using different computing engines or nodes for different applications.
Modern data lakehouses are built for today’s businesses and technology. Many data sources contain real-time streaming data from sources such as Internet of Things devices. The lakehouse system supports these sources through real-time data ingestion and incremental processing.
A data lakehouse isn’t simply a data warehouse combined with a data lake. It’s a unified architecture that brings the best parts of both together into a single platform.
Data warehouses are built for structured analytics. They deliver excellent performance for business intelligence applications and reporting by storing and transforming enterprise data.
However, data warehouses lack the flexibility of data lakes. They are limited by their inefficiency and costs as data volumes and workloads grow. Data warehousing also requires strict schemas, meaning data must conform to a predefined model before ingestion to the data repository (schema-on-write). Due to these constraints, they don’t work well with unstructured or semi-structured data, which are critical for AI and ML use cases.
Data lakes allow organizations to store all types of data—structured, unstructured and semi-structured—from diverse sources in one location. They use a schema-on-read approach, so data models are applied when data is used rather than when it’s stored. They also typically have more scalable and affordable data storage (often cloud object storage).
However, they do not have built-in data processing tools and rely on external capabilities to perform analytics. Their size and complexity can also require the expertise of more technical users, such as data scientists and data engineers. And, because data governance occurs downstream, data lakes can be prone to data silos, subsequently devolving into data swamps (where good data is inaccessible due to poor management).
Data lakehouses are designed to resolve the challenges of data warehouses and data lakes, bringing their benefits under one platform. They leverage flexible, low-cost storage that supports a broad range of data types while also delivering data management and high-performance capabilities to support BI, analytics and AI/ML workloads in a single architecture.
Anson Kokkat, Principal Product Manager of IBM Software, emphasizes the importance of lakehouses for modern AI programs:
“AI models are only as good as the governed, scalable data platform beneath them. The right data lakehouse becomes the foundation that turns raw enterprise data into production-ready AI. When built on open architecture, that translates to AI flexibility—you are not locked into one engine, you can integrate with existing open source tools such as Presto, Apache Spark, OpenSearch and Cassandra.”
Another major benefit: Organizations can often implement data lakehouses alongside their existing data lakes and data warehouses without a full teardown and rebuild.
Today, many providers offer open data lakehouses. This architecture supports open data and open formats for storing vast amounts of data in vendor-agnostic formats, such as Parquet, Avro and Apache ORC. It can also leverage Apache Iceberg to share large volumes of data through an open table format.
Common data lakehouse challenges include complex implementations (including migrations from existing data platforms); balancing data governance and security with unified data access; and ensuring that query performance remains optimal as data volumes grow.
Yes. Data lakehouses support AI and ML workloads by providing unified access to large volumes of diverse data with strong governance. They use open data and open table formats to prevent vendor lock-in and enable direct integration between the storage layer and ML frameworks.
Avoiding a data swamp requires strong data governance, data quality and data security practices. Additionally, a tiered (medallion) storage architecture keeps data organized, and open table formats with ACID transactions help ensure data integrity, consistency and reliability.
Watsonx.data enables you to scale analytics and AI with all your data, wherever it resides, through an open, hybrid and governed data store.
Design a data strategy that eliminates data silos, reduces complexity and improves data quality for exceptional customer and employee experiences.
Unlock the value of enterprise data with IBM Consulting®, building an insight-driven organization that delivers business advantage.
1 Data Lakehouse Architecture: The Evolution of Enterprise Data Management, Journal of Computer Science and Technology Studies, 23 June 2025.
2 Data Lakehouse Implementation: A Journey From Traditional Data Warehouses, World Journal of Advanced Engineering Technology and Sciences, 26 February 2025.
3 Data Lakehouse: A Survey and Experimental Study, Science Direct, 26 September 2024.
4 Minimizing Incident Response Time in Real-World Scenarios Using Quantum Computing, Springer Nature Link, 26 May 2023.