Apache Kafka
Apache Kafka is an open-source distributed event streaming platform used for many different use cases such as messaging, website activity tracking, and stream processing.
IBM Manta Data Lineage can either connect to Confluent Platform’s Schema Registry and extract the schemas contained in Kafka topics in an automated way or allow the user to describe the Kafka environment on their own to visualize it and benefit from integrations with other scanners. The Kafka visualization includes objects such as a cluster, topics, schemas, and columns.
Main scanner features include:
-
Metadata extraction from Confluent Platform Schema Registry
-
Option to define the elements in Kafka manually by providing a simple JSON file
-
Schema definitions in JSON schema and Avro format
-
Integrations with DataStage and StreamSets scanners
-
Schema definitions using “raw” JSON files or payloads
Check out the guides below for more details on setting up this scanner.
Extraction and Analysis Phase Scenarios
Extraction Phase
For the extraction phase for Kafka, there are three scenarios.
-
Kafka dictionary mapping scenario — creates the mapping between the dictionary ID and broker URLs for each configured Kafka cluster
-
Kafka extractor scenario — connects to each configured Kafka Schema Registry server and extracts the schemas
-
Apache Kafka ingestion scenario - pulls inputs from git Manta Flow Agent Configuration for Extraction:Git Source or a remote agent filesystem location Manta Flow Agent Configuration for Extraction:Agent Source
Analysis Phase
For the analysis phase for Kafka clusters, there is only one scenario.
- Kafka dictionary dataflow scenario — analyzes metadata from the extracted Kafka dictionaries and stores it in the internal metadata repository