Medallion Accelerator

Try the Medallion Accelerator sample project to build scalable, governed lakehouse pipelines on watsonx.data by implementing the bronze–silver–gold medallion pattern. The pipelines include batch and streaming ingestion, enrichment, and analytics.

A medallion pattern logically organizes data into the following layers that have progressively better data quality and structure:

Bronze: The data is ingested into the lakehouse as raw data, without enforcing data types and consistency.
Silver: The data is filtered, cleaned, and organized or partitioned. Logical verifications are run, such as handling null values, removing duplicates, and enforcing constraints. Silver layer data is suitable for deep analytics, training machine learning models, and generating reports. This layer provides building blocks for further data refining.
Gold: The data is aligned with specific business use cases, such as data marts, high-performance dashboards, and other business intelligence tools. The data is organized, pre-aggregated, and enriched with metadata semantics.

Obtaining and running the accelerator

Ask your client team for a copy of the accelerator sample project archive file.

Required services

You need the following services to run the Medallion Accelerator:

watsonx.data
watsonx.ai
Amazon S3 or another object storage service on your IBM Software Hub cluster
Optional: IBM watsonx.data intelligence
Optional: Kafka

Running the accelerator

To run the accelerator, open the README - Medallion Accelerator PDF file on the project Assets page and follow the instructions.

Overview

The Medallion Accelerator contains a set of notebooks, parameter sets, and Python scripts that demonstrate the following tasks:

Ingest flat files into Iceberg format.
Run ingestion, transformation, enrichment, and analysis jobs with a Spark engine.
Create a streaming data pipeline from a Kafka topic.
Validate and transform data.
Enrich data with insights from natural language processing models.
Generate visualizations of insights.
Publish data assets in a IBM watsonx.data intelligence catalog.

alt=

Static structured data ingestion flow

The dataset for this flow is a CSV file that contains customer complaint data.

The data pipeline that processes the static data has the following steps:

The dataset file is ingested into the lakehouse in Iceberg format and processed. For example, data types are assigned to columns.
The data is copied to Silver tables and refined. For example, primary keys are determined and partitions are created.
The data is copied to Gold tables and summary tables are created.
Optional. The Gold tables are enriched with natural language processing to analyze customer complaints.
Optional. Visualizations for the enriched Gold tables are created.

Streaming data processing flow

The data for this flow is a Kafka topic that contains customer complaint data.

The data pipeline that processes the streaming data has the following steps:

A REST API call pulls records from the data source and publishes them to the Kafka topic.
The records from the Kafka topic are ingested into a staging table in the lakehouse and marked with timestamps.
The records are merged into the Bronze table and validated.
The records are merged into the Silver table and refined to assign data types.
The records are merged into the Gold table.

Publish to catalog flow

You can optionally publish the Bronze, Silver, and Gold tables as data assets in an existing IBM watsonx.data intelligence catalog.

The flow for publishing tables to a catalog is the same for static and streaming data:

The Bronze tables are published to the catalog.
The Silver tables are published to the catalog.
The Gold tables are published to the catalog.