StreamSets Resource Configuration

Before you configure your scanner, make sure you meet the prerequisites. Read our guide on StreamSets integration requirements to double-check.

Source System Properties

This configuration can be setup by creating a new connection on Admin UI > Connections tab or editing an existing connection in Admin UI / Connections / Data Integration Tools / Streamsets / specific connection. New connection can also be created via Manta Orchestration API.

One IBM Manta Data Lineage connection for StreamSets corresponds to one StreamSets server that will be analyzed.

Property name

Description

Example

streamsets.extractor.server

Custom name used to identify this StreamSets connection in Manta Data Lineage

template

streamsets.extractor.service

Type of StreamSets service used for the extraction; values to pick from are:

  • “Data Collector”, “Control Hub Cloud”, and “Control Hub On-Premises”

If not set, the default value is “Data Collector”

Data Collector

streamsets.extractor.address

Address of the server used for the actual connection to the StreamSets repository; only considered for the extraction from Data Collector or Control Hub On-Premises

192.168.0.16
prod.getmanta.com

streamsets.extractor.port

Port of the server used for the actual connection to the StreamSets repository; only considered for the extraction from Data Collector or Control Hub On-Premises

80
443

streamsets.extractor.scheme

Scheme of the server used for the actual connection to the StreamSets repository; only considered for the extraction from Data Collector or Control Hub On-Premises

http

streamsets.extractor.user

Name of the user used for the connection to SDC or SCH

guest

streamsets.extractor.password

Password of the user used for the connection to SDC or SCH

guest

streamsets.extractor.use.pipeline.extraction

Whether pipelines should be extracted as a part of the extraction process; if true, the following four properties should be set

true

streamsets.extractor.include.pipelines

Comma-separated list of pipeline IDs to include in the extraction

Note that if both the streamsets.extractor.include.pipelines and streamsets.extractor.include.labels properties are left empty, all pipelines will be extracted (except those that have been excluded)

pipelineId01,pipelineId02,pipelineId03

streamsets.extractor.include.labels

Comma-separated list of pipeline labels to include in the extraction

Note that if both the streamsets.extractor.include.pipelines and streamsets.extractor.include.labels properties are left empty, all pipelines will be extracted (except those that have been excluded)

label01,label02,label03

streamsets.extractor.exclude.pipelines

Comma-separated list of pipeline IDs to exclude from the extraction

pipelineId01,pipelineId04

streamsets.extractor.exclude.labels

Comma-separated list of pipeline labels to exclude from the extraction

label01,label04

streamsets.extractor.use.job.extraction

Whether jobs should be extracted as part of the extraction process; job extraction is only supported for SCH; if true, the following four properties should be set

true

streamsets.extractor.include.jobs

Comma-separated list of job IDs to include in the extraction

Note that if both the streamsets.extractor.include.jobs and streamsets.extractor.include.job.tags properties are left empty, all jobs will be extracted (except those that have been excluded)

jobId01,jobId02,jobId03

streamsets.extractor.include.job.tags

Comma-separated list of job tags to include in the extraction

Note that if both the streamsets.extractor.include.jobs and streamsets.extractor.include.job.tags properties are left empty, all jobs will be extracted (except those that have been excluded)

tag01,tag02,tag03

streamsets.extractor.exclude.jobs

Comma-separated list of job IDs to exclude from the extraction

jobId01,jobId04

streamsets.extractor.exclude.job.tags

Comma-separated list of job tags to exclude from the extraction

tag01,tag04

streamsets.input.encoding

Encoding of extracted pipelines. See Encodings for applicable values.

UTF-8

streamsets.extractor.verifyHostname

When using HTTPS, whether the hostname of the server's certificate should be validated to match the hostname of the server

true

Common Scanner Properties

This configuration is common for all Streamsets source systems and for all Streamsets scenarios, and is configure in Admin UI > Configuration > CLI > Streamsets> Streamsets Common. It can be overridden on individual connection level.

Property name

Description

Example

streamsets.input.dir

Directory with pipelines extracted from the StreamSets server

${manta.dir.temp}/streamsets/${streamsets.extractor.server}

streamsets.runtime.values.dir

Directory with manually provided runtime values that are used in SDC pipelines; the properties file and TXT files used in the pipelines should be stored here

${manta.dir.input}/streamsets/${streamsets.extractor.server}

filepath.lowercase

Whether paths to files should be lowercase (false for case-sensitive file systems, true otherwise)

false
true

streamsets.data.collectors.settings.file

Path to the automatically generated file with Data Collector settings

${manta.dir.temp}/streamsets/${streamsets.extractor.server}/dataCollectors.csv

streamsets.data.collectors.manual.settings.file

Path to the optional file with manual Data Collector settings

${manta.dir.input}/streamsets/${streamsets.extractor.server}/dataCollectors.csv

streamsets.data.collectors.manual.settings.encoding

Encoding of the manual Data Collector settings file. See Encodings for applicable values.

UTF-8

streamsets.extractor.itemsPerRequest

Number of items (pipelines/jobs) extracted per HTTP request

100

streamsets.extraction.method

Set to Agent:default when the desired extraction method is the default Manta Extractor Agent, set to Agent:{remote_agent_name} when a remote Agent is the desired extraction method, set to Git:{git.dictionary.id} when the Git ingest method is the desired extraction method. For more information on setting up a remote extractor Agent please refer to the Manta Flow Agent Configuration for Extraction documentation. For additional details on configuring a Git ingest method, please refer to the Manta Flow Agent Configuration for Extraction:Git Source documentation.

default

Git

agent

Data Collector Settings

If Control Hub is enabled, a Data Collector settings file is created automatically during extraction. This file is used during lineage analysis to assign hostnames to Data Collectors. Alternatively, these hostnames can be configured manually as follows.

Example of a configuration file:

Data Collector ID,Hostname
fb2debf5-ad7d-11eb-9136-1929a68f9fe7,sdc.getmanta.com
24216e30-b83c-11e9-9d92-93f90b5c1ed8,streamsets.int.getmanta.com