StreamSets

StreamSets Data Collector is an open-source execution engine for fast data ingestion and light transformations. The engine is designed to execute smart data pipelines for streaming, change data capture (CDC), and batch data without hand coding. The IBM Automatic Data Lineage StreamSets scanner includes but is not limited to support for Hadoop, JDBC, and Google BigQuery, both as origin and destination stages, as well as processor stages such as fields, expressions, schemas, and data parsers.

Automatic Data Lineage currently scans:

Pipelines and their stages
Database connections

Check out the guides below for more details on setting up this scanner.

Extraction and Analysis Phase Scenarios

Extraction Phase

For the extraction phase for StreamSets Data Collector, there are two scenarios.

StreamSets extractor scenario — connects to the configured StreamSets Data Collector and extracts the configured pipelines
StreamSets ingestion scenario - pulls inputs from git Manta Flow Agent Configuration for Extraction:Git Source or a remote agent filesystem location Manta Flow Agent Configuration for Extraction:Agent Source

Analysis Phase

For the analysis phase for the StreamSets pipeline, there is only one scenario.

StreamSets dataflow scenario — harvests metadata and lineage from the provided StreamSets pipelines and saves it in your Automatic Data Lineage metadata repository