Data Lakehouse

Overview

A data lakehouse is a data platform, which merges the best aspects of data warehouses and data lakes into one data management solution.

IBM's Data Lakehouse and governance architecture for hybrid cloud environments are anchored on its watsonx.data platform. This platform enables enterprises to scale analytics and AI, providing a robust data store built on an open lakehouse architecture. The architecture amalgamates the performance and usability attributes of a data warehouse with the flexibility and scalability of a data lake, offering a balanced solution for data management and analytics tasks.

Deployment

watsonx.data platform is offered both as SaaS offering and on-premises solution. for clients in a geography without a SaaS offering, or require the Lakehouse platform to remain on-premises due to regulatory or others constraints, IBM provides flexibility through the following deployment options to enable data lakehouse capabilities anywhere:

Provision watsonx.data SaaS on IBM Cloud or AWS.
Deploy watsonx.data as standalone solution on top of OpenShift on-prem or other hyperscalers with managed OpenShift.
Deploy watsonx.data part of IBM CloudPak for Data (CP4D) cluster.

A data lakehouse architecture enabling use multiple fit-for-purpose query engines while providing simultaneous access to the same data across all engines

Data sources - This includes structured data from databases and applications as well as unstructured data from files, social media, IoT devices etc., as well as enterprise data warehouses, and other unstructured data stores, both from client on-prem application and SaaS.
Client Applications - Clients may have applications on-premises, or SaaS with their own data stores (structured and unstructured) whose data may not be in the data lake, and clients may wish to bring that data it the lakehouse for easy querying.
Data Lakehouse - watsonx.data is the next-gen data store architecture that balances the capabilities of data lakes and data warehouses. This is foundational to IBM's Data Lakehouse approach, facilitating the scaling of AI and Machine Learning (ML) workloads while ensuring efficient data governance.
GenAI Platform - The data lakehouse may optionally be connected to a GenAI platform for augmenting queries with LLMs. Users may input a prompt, which is sent to a fine-tuned LLM to generate retrieval queries which may be executed by the engines supported in the data lakehouse.

Lakehouse Patterns

Lakehouse Pattern 1: Multiple Fit-for-purpose Query Engines

Use fit for purpose compute to optimize cost by leveraging the right engine for the right workload, while simultaneously share data and metadata between all engines, shared metastore (i.e. Data Catalog) and same environment.

A data lakehouse architecture enabling use of multiple query engines to optimize cost and performance.

Lakehouse Pattern 2: Single pane of glass for all your data

Data Lakehouse enables a modern approach to current data architectures, where enterprises have built over the years several silos of data stores to cater to different needs, from structured, high-performance enterprise data warehouses (EDW) to high volume, unstructured/semi-structured data lakes, that most times turn into data swamps (duplication, data quality, lack of governance). A Data Lakehouse with watsonx.data will enable a single layer of access to a variety of data stores through multiple query engines, open data formats and governance, without the need for data movement.

A data lakehouse architecture to provide a single access layer (single pane of glass) for all of an enterprise's data stores including object storage, relational data, and data lakes.

Lakehouse Pattern 3: Optimize Data Warehouse workloads to optimize cost

Reduce warehousing cost while still maintaining temporal query capabilities by leveraging lakehouse cheap storage and compute, and allowing multiple query engines to consume the same data set. Query engines like Spark enables performing Vacuumed/Materialized query of data in its current state (e.g. not all data change history) which reduces data query size and query compute cost. Also, Lakehouse preprocessing and selective transformations capabilities allows for optimal distribution of Data Warehouse workloads, thus reducing cost.

A data lakehouse architecture to minimize data warehouse costs and optimize warehouse query performance.

Lakehouse Pattern 4: Hybrid Multi Cloud Deployment

Connect to and access data remotely across hybrid cloud with the ability to cache remote sources.

A data lakehouse architecture to integrate on-premise and on-cloud data across multiple providers.

Lakehouse Pattern 5: Integrating Mainframe Data with analytical ecosystem

Synchronize and incorporate Db2 for z/OS data for Lakehouse analytics, and perform real-time analytics on Mainframe across VSAM and Db2 data. Data virtualization will always query data directly from the mainframe with additional load considerations, while CDC will capture information in iceberg format based on frequency defined by the administration (not adding load to your mainframe but also not providing real-time data)

A data lakehouse architecture using a Data Gateway and Data Virtualization to integrate mainframe data with non-mainframe sourced data.

Other Lakehouse Usecases

Storage tier for new data assets Modern applications often rely on new data sets and advanced data processing techniques to provide more efficient, scalable, and data-driven services. Data Lakehouse can provide the required data/storage tier, integration, performance, scalability and cost efficiency.
Natural language data prompt and response Data Lakehouse (watsonx.data) in conjunction with generative AI and large language model (LLM) capabilities (watsonx.ai) enable an analyst that doesn't know the technical structure of the information, doesn't master SQL, to use natural language prompts to conduct a cross-analysis across the different data stores and get responses from the LLM.

Architecture Decisions

Selection of Query Engine

The selection of which query engine to use is generally driven by the type of data to be queried.

The Presto query engine is best suited for use with Hive and Parquet tables/buckets.
The Spark query engine is best suited for use when SCALA coding is used within an existing Hadoop/Cloudera environment.
The DB2 query engine is best suited for use with DB2 data stores.
The Netezza query engine is best suited for querying the Netezza data warehouse

Data Lakehouse Characteristics

Unified Data Management: Ensuring that the Data Lakehouse serves as a single source of truth is crucial for consistency and reliability in data analytics and decision-making.
Data Integration: Integration of data from diverse sources and in various formats should be seamless, with support for real-time and batch data ingestion.
Query Performance: Optimized query performance to support analytics and reporting needs in line with enterprise SLAs/SLOs.
Data Governance: Successful data lakehouse implementations require a robust data governance framework to ensure data quality, metadata management, and lineage tracing.
Security: Ensure data encryption, access control, and audit trails to comply with organizational and regulatory requirements.
Deployment Flexibility: Support for on-premises, hybrid, and multi-cloud deployments provides flexibility and aids in optimizing costs and performance.
Data Sensitivity: Ensure easy data movement across different environments while maintaining data consistency and integrity.
Monitoring and Management: Implement monitoring, logging, and management tools for visibility into data movement, job completion times & rates, and performance tuning.