Organizations are dealing with large volumes of data from an array of different data sources. These datasets vary in type and quality. At the same time, they are looking to minimize the cost of data processing and insight extraction while maximizing the efficiency and value. To satisfy these somewhat opposing requirements, they are storing data in a complex, messy landscape of data lakes, data warehouses and data marts.

It requires effort, time, and money to maintain this siloed and complex data-analytic ecosystem of questionable data quality and varying data structure to form a source of truth that can be relied upon for analytics and decision making. This ecosystem evolved over decades from bandages applied to existing data management investments without consideration to a holistic approach to the data management lifecycle.

All that is changing.

Watch a video on data lakehouse architecture

The emergence of data lakehouse architecture

To address the challenge of this distributed data landscape, data lakehouse emerged to combine the enterprise features and high-performance of a data warehouse with the openness, flexibility, and scalability of data lakes.

The current generation of lakehouse solutions mitigate maintaining and managing multiple systems by consolidating data stored in data warehouses and lakes to a single data storage location on cheap commoditized S3 object storage. These lakehouses address the performance issue with modern distributed SQL engines and the openness issue with open data and table formats. And the issue of consistency and data quality is addressed through the advent of modern table formats such as Iceberg, HUDI, and Delta, also bringing data warehousing qualities, such as ACID. 

Here is an overview of the major components of a lakehouse:

Storage: This is the layer that physically stores the data. The most common data lake/lakehouse storage types are AWS S3-compatible object storage or HDFS. In this layer, data is stored as files and could be stored in open data file formats such as Parque, Avro and more.

Technical Metadata storage/service:  This component is required to understand what data is available in the storage layer. The query engine needs the unstructured data and table metadata to understand where the data is located, what it looks like, and how to read it. The de-facto open metadata storage solution is the Hive metadata store.

SQL Query Engine: This component is at the heart of the data lakehouse. It executes queries against the data and is often referred to as the “compute” component. There are many open-source query engines for lakehouse in the market, such as Presto and Apache Spark. In a lakehouse architecture, the query engine is fully modular and ephemeral, meaning the engine can be dynamically scaled to meet big data workload demands and concurrency. SQL query engines can attach to any number of catalogs and storage.

Although lakehouse offers a lot of promise, a few questions remain. Most vendors in the market are optimizing a single SQL engine to tackle a range of workloads, which is often insufficient as some applications demand greater performance while others require greater language flexibility.

While a lakehouse is open by design and many in the market have touted the ability to prevent vendor lock-in at the data store layer with support for open data and table formats, metadata portability can still be lacking, requiring customers to perform significant rework when onboarding and leaving a solution.

Data lakehouse architecture is getting attention, and organizations will want to optimize the components most critical to their business. A lakehouse architecture can bring the flexibility, modularity, and cost-effective extensibility that your modern data engineering, data science and analytics use cases demand and can simplify taking advantage of future enhancements. However, there is still much that can be done to further optimize and provide greater openness and flexibility – the industry is looking for an open data lakehouse approach.

Learn about IBM’s new approach to scale AI workloads with watsonx.data, a fit-for-purpose data store built on a open lakehouse architecture and optimized for all data, analytics, and AI workloads.

Download the open lakehouse eBook

More from Analytics

IBM to help businesses scale AI workloads, for all data, anywhere

4 min read - IBM today announced the coming launch of IBM watsonx.data, a data store built on an open lakehouse architecture, to help enterprises easily unify and govern their structured and unstructured data, wherever it resides, for high-performance AI and analytics. The solution is currently in a closed beta phase and is expected to be generally available in July 2023. What is watsonx.data? Watsonx.data will be core to IBM’s coming AI and Data platform, IBM watsonx, announced today at IBM Think. With watsonx, IBM…

4 min read

Jabil is building reports with IBM Business Analytics Portfolio

3 min read - Jabil isn’t just a manufacturer, they are experts on global supply chain, logistics, automation, product design and engineering solutions. They are also interested and involved in the holistic application of emerging technologies like additive manufacturing, autonomous technologies, and artificial intelligence. They are a technologically motivated enterprise, so it’s no surprise that they would apply this forward-thinking view to their finance reporting as well. Jabil is a sizable operation with over 260,000 employees across 100 locations in 30 countries. The world's…

3 min read

Why optimize your warehouse with a data lakehouse strategy

3 min read - In a prior blog, we pointed out that warehouses, known for high-performance data processing for business intelligence, can quickly become expensive for new data and evolving workloads. We also made the case that query and reporting, provided by big data engines such as Presto, need to work with the Spark infrastructure framework to support advanced analytics and complex enterprise data decision-making. To do so, Presto and Spark need to readily work with existing and modern data warehouse infrastructures. Now, let’s…

3 min read

Why companies need to accelerate data warehousing solution modernization

4 min read - Unexpected situations like the COVID-19 pandemic and the ongoing macroeconomic atmosphere are wake-up calls for companies worldwide to exponentially accelerate digital transformation. During the pandemic, when lockdowns and social-distancing restrictions transformed business operations, it quickly became apparent that digital innovation was vital to the survival of any organization. The dependence on remote internet access for business, personal, and educational use elevated the data demand and boosted global data consumption. Additionally, the increase in online transactions and web traffic generated mountains…

4 min read