A data lakehouse is a data platform, which merges the best aspects of data warehouses and data lakes into one data management solution.
IBM's Data Lakehouse and governance architecture for hybrid cloud environments are anchored on its watsonx.data platform. This platform enables enterprises to scale analytics and AI, providing a robust data store built on an open lakehouse architecture. The architecture amalgamates the performance and usability attributes of a data warehouse with the flexibility and scalability of a data lake, offering a balanced solution for data management and analytics tasks.
watsonx.data platform is offered both as SaaS offering and on-premises solution. for clients in a geography without a SaaS offering, or require the Lakehouse platform to remain on-premises due to regulatory or others constraints, IBM provides flexibility through the following deployment options to enable data lakehouse capabilities anywhere:
Data Lakehouse - watsonx.data is the next-gen data store architecture that balances the capabilities of data lakes and data warehouses. This is foundational to IBM's Data Lakehouse approach, facilitating the scaling of AI and Machine Learning (ML) workloads while ensuring efficient data governance.
GenAI Platform - The data lakehouse may optionally be connected to a GenAI platform for augmenting queries with LLMs. Users may input a prompt, which is sent to a fine-tuned LLM to generate retrieval queries which may be executed by the engines supported in the data lakehouse.
Lakehouse Pattern 1: Multiple Fit-for-purpose Query Engines
Use fit for purpose compute to optimize cost by leveraging the right engine for the right workload, while simultaneously share data and metadata between all engines, shared metastore (i.e. Data Catalog) and same environment.
Lakehouse Pattern 2: Single pane of glass for all your data
Data Lakehouse enables a modern approach to current data architectures, where enterprises have built over the years several silos of data stores to cater to different needs, from structured, high-performance enterprise data warehouses (EDW) to high volume, unstructured/semi-structured data lakes, that most times turn into data swamps (duplication, data quality, lack of governance). A Data Lakehouse with watsonx.data will enable a single layer of access to a variety of data stores through multiple query engines, open data formats and governance, without the need for data movement.
Lakehouse Pattern 3: Optimize Data Warehouse workloads to optimize cost
Reduce warehousing cost while still maintaining temporal query capabilities by leveraging lakehouse cheap storage and compute, and allowing multiple query engines to consume the same data set. Query engines like Spark enables performing Vacuumed/Materialized query of data in its current state (e.g. not all data change history) which reduces data query size and query compute cost. Also, Lakehouse preprocessing and selective transformations capabilities allows for optimal distribution of Data Warehouse workloads, thus reducing cost.
Lakehouse Pattern 4: Hybrid Multi Cloud Deployment
Connect to and access data remotely across hybrid cloud with the ability to cache remote sources.
Lakehouse Pattern 5: Integrating Mainframe Data with analytical ecosystem
Synchronize and incorporate Db2 for z/OS data for Lakehouse analytics, and perform real-time analytics on Mainframe across VSAM and Db2 data. Data virtualization will always query data directly from the mainframe with additional load considerations, while CDC will capture information in iceberg format based on frequency defined by the administration (not adding load to your mainframe but also not providing real-time data)
The selection of which query engine to use is generally driven by the type of data to be queried.
IBM's Generative AI Architecture is the complete IBM Generative AI Architecture in IBM IT Architect Assistant (IIAA), an architecture development and management tool. Using IIAA, architects can elaborate and customize the architecture to create their own generative AI solutions.
This repository contains a Tekton pipeline to deploy IBM watsonx.data onto a Red Hat Openshift Cluster
This repository contains assets to run a lab and workshop for watsonx.data enablement.