Apache Kafka

Apache Kafka is an open-source distributed event streaming platform used for many different use cases such as messaging, website activity tracking, and stream processing.

IBM Manta Data Lineage can either connect to Confluent Platform’s Schema Registry and extract the schemas contained in Kafka topics in an automated way or allow the user to describe the Kafka environment on their own to visualize it and benefit from integrations with other scanners. The Kafka visualization includes objects such as a cluster, topics, schemas, and columns.

Main scanner features include:

Check out the guides below for more details on setting up this scanner.

Extraction and Analysis Phase Scenarios

Extraction Phase

For the extraction phase for Kafka, there are three scenarios.

  1. Kafka dictionary mapping scenario — creates the mapping between the dictionary ID and broker URLs for each configured Kafka cluster

  2. Kafka extractor scenario — connects to each configured Kafka Schema Registry server and extracts the schemas

  3. Apache Kafka ingestion scenario - pulls inputs from git Manta Flow Agent Configuration for Extraction:Git Source or a remote agent filesystem location Manta Flow Agent Configuration for Extraction:Agent Source

Analysis Phase

For the analysis phase for Kafka clusters, there is only one scenario.

  1. Kafka dictionary dataflow scenario — analyzes metadata from the extracted Kafka dictionaries and stores it in the internal metadata repository