GitHubContribute in GitHub: Edit online

StreamSets

StreamSets Data Collector is an open-source execution engine for fast data ingestion and light transformations. The engine is designed to execute smart data pipelines for streaming, change data capture (CDC), and batch data without hand coding. The IBM Manta Data Lineage StreamSets scanner includes but is not limited to support for Hadoop, JDBC, and Google BigQuery, both as origin and destination stages, as well as processor stages such as fields, expressions, schemas, and data parsers.

Manta Data Lineage currently scans:

Check out the guides below for more details on setting up this scanner.

Extraction and Analysis Phase Scenarios

Extraction Phase

For the extraction phase for StreamSets Data Collector, there is only one scenario.

  1. StreamSets extractor scenario — connects to the configured StreamSets Data Collector and extracts the configured pipelines

Analysis Phase

For the analysis phase for the StreamSets pipeline, there is only one scenario.

  1. StreamSets dataflow scenario — harvests metadata and lineage from the provided StreamSets pipelines and saves it in your Manta Data Lineage metadata repository