Apache Kafka

Apache Kafka is an open-source distributed event streaming platform used for many different use cases such as messaging, website activity tracking, and stream processing.

IBM Automatic Data Lineage can either connect to Confluent Platform’s Schema Registry and extract the schemas contained in Kafka topics in an automated way or allow the user to describe the Kafka environment on their own to visualize it and benefit from integrations with other scanners. The Kafka visualization includes objects such as a cluster, topics, schemas, and columns.

Main scanner features include:

Metadata extraction from Confluent Platform Schema Registry
Option to define the elements in Kafka manually by providing a simple JSON file
Schema definitions in JSON schema and Avro format
Integrations with DataStage and StreamSets scanners
Schema definitions using “raw” JSON files or payloads

Check out the guides below for more details on setting up this scanner.

Extraction and Analysis Phase Scenarios

Extraction Phase

For the extraction phase for Kafka, there are three scenarios.

Kafka dictionary mapping scenario — creates the mapping between the dictionary ID and broker URLs for each configured Kafka cluster
Kafka extractor scenario — connects to each configured Kafka Schema Registry server and extracts the schemas
Apache Kafka ingestion scenario - pulls inputs from git Manta Flow Agent Configuration for Extraction:Git Source or a remote agent filesystem location Manta Flow Agent Configuration for Extraction:Agent Source

Analysis Phase

For the analysis phase for Kafka clusters, there is only one scenario.

Kafka dictionary dataflow scenario — analyzes metadata from the extracted Kafka dictionaries and stores it in the internal metadata repository