StreamSets Resource Configuration
Before you configure your scanner, make sure you meet the prerequisites. Read our guide on StreamSets integration requirements to double-check.
Source System Properties
This configuration can be setup by creating a new connection on Admin UI > Connections tab or editing an existing connection in Admin UI / Connections / Data Integration Tools / Streamsets / specific connection. New connection can also be created via Manta Orchestration API.
One IBM Manta Data Lineage connection for StreamSets corresponds to one StreamSets server that will be analyzed.
Property name |
Description |
Example |
---|---|---|
streamsets.extractor.server |
Custom name used to identify this StreamSets connection in Manta Data Lineage |
template |
streamsets.extractor.service |
Type of StreamSets service used for the extraction; values to pick from are:
If not set, the default value is “Data Collector” |
Data Collector |
streamsets.extractor.address |
Address of the server used for the actual connection to the StreamSets repository; only considered for the extraction from Data Collector or Control Hub On-Premises |
192.168.0.16 |
streamsets.extractor.port |
Port of the server used for the actual connection to the StreamSets repository; only considered for the extraction from Data Collector or Control Hub On-Premises |
80 |
streamsets.extractor.scheme |
Scheme of the server used for the actual connection to the StreamSets repository; only considered for the extraction from Data Collector or Control Hub On-Premises |
http |
streamsets.extractor.user |
Name of the user used for the connection to SDC or SCH |
guest |
streamsets.extractor.password |
Password of the user used for the connection to SDC or SCH |
guest |
streamsets.extractor.use.pipeline.extraction |
Whether pipelines should be extracted as a part of the extraction process; if true, the following four properties should be set |
true |
streamsets.extractor.include.pipelines |
Comma-separated list of pipeline IDs to include in the extraction Note that if both the
|
pipelineId01,pipelineId02,pipelineId03 |
streamsets.extractor.include.labels |
Comma-separated list of pipeline labels to include in the extraction Note that if both the
|
label01,label02,label03 |
streamsets.extractor.exclude.pipelines |
Comma-separated list of pipeline IDs to exclude from the extraction |
pipelineId01,pipelineId04 |
streamsets.extractor.exclude.labels |
Comma-separated list of pipeline labels to exclude from the extraction |
label01,label04 |
streamsets.extractor.use.job.extraction |
Whether jobs should be extracted as part of the extraction process; job extraction is only supported for SCH; if true, the following four properties should be set |
true |
streamsets.extractor.include.jobs |
Comma-separated list of job IDs to include in the extraction Note that if both the
|
jobId01,jobId02,jobId03 |
streamsets.extractor.include.job.tags |
Comma-separated list of job tags to include in the extraction Note that if both the |
tag01,tag02,tag03 |
streamsets.extractor.exclude.jobs |
Comma-separated list of job IDs to exclude from the extraction |
jobId01,jobId04 |
streamsets.extractor.exclude.job.tags |
Comma-separated list of job tags to exclude from the extraction |
tag01,tag04 |
streamsets.input.encoding |
Encoding of extracted pipelines. See Encodings for applicable values. |
UTF-8 |
streamsets.extractor.verifyHostname |
When using HTTPS, whether the hostname of the server's certificate should be validated to match the hostname of the server |
true |
Common Scanner Properties
This configuration is common for all Streamsets source systems and for all Streamsets scenarios, and is configure in Admin UI > Configuration > CLI > Streamsets> Streamsets Common. It can be overridden on individual connection level.
Property name |
Description |
Example |
---|---|---|
streamsets.input.dir |
Directory with pipelines extracted from the StreamSets server |
${manta.dir.temp}/streamsets/${streamsets.extractor.server} |
streamsets.runtime.values.dir |
Directory with manually provided runtime values that are used in SDC pipelines; the properties file and TXT files used in the pipelines should be stored here |
${manta.dir.input}/streamsets/${streamsets.extractor.server} |
filepath.lowercase |
Whether paths to files should be lowercase (false for case-sensitive file systems, true otherwise) |
false |
streamsets.data.collectors.settings.file |
Path to the automatically generated file with Data Collector settings |
${manta.dir.temp}/streamsets/${streamsets.extractor.server}/dataCollectors.csv |
streamsets.data.collectors.manual.settings.file |
Path to the optional file with manual Data Collector settings |
${manta.dir.input}/streamsets/${streamsets.extractor.server}/dataCollectors.csv |
streamsets.data.collectors.manual.settings.encoding |
Encoding of the manual Data Collector settings file. See Encodings for applicable values. |
UTF-8 |
streamsets.extractor.itemsPerRequest |
Number of items (pipelines/jobs) extracted per HTTP request |
100 |
streamsets.extraction.method |
Set to Agent:default when the desired extraction method is the default Manta Extractor Agent, set to Agent:{remote_agent_name} when a remote Agent is the desired extraction method, set to Git:{git.dictionary.id} when the Git ingest method is the desired extraction method. For more information on setting up a remote extractor Agent please refer to the Manta Flow Agent Configuration for Extraction documentation. For additional details on configuring a Git ingest method, please refer to the Manta Flow Agent Configuration for Extraction:Git Source documentation. |
default Git agent |
Data Collector Settings
If Control Hub is enabled, a Data Collector settings file is created automatically during extraction. This file is used during lineage analysis to assign hostnames to Data Collectors. Alternatively, these hostnames can be configured manually as follows.
-
Create or open the file referenced by the
streamsets.data.collectors.settings.file
property (e.g.,<MANTA_DIR_HOME>/temp/streamsets/{streamsets.extractor.server}/dataCollectors.csv
). -
It is also possible to override the automatically extracted connection settings stored in this file by providing a manual configuration to the file referenced by the
streamsets.data.collectors.manual.settings.file
property (e.g.,<MANTA_DIR_HOME>/input/streamsets/{streamsets.extractor.server}/dataCollectors.csv
). -
Add a new line for each Data Collector with the following values.
-
Data Collector ID — ID of the Data Collector. This can be found in Control Hub under https://cloud.streamsets.com/sch/jobRunner/dataCollectors by clicking on Individual Data Collector. Details about the Data Collector will appear, including its ID.
-
Hostname — Hostname of the Data Collector
-
Example of a configuration file:
Data Collector ID,Hostname
fb2debf5-ad7d-11eb-9136-1929a68f9fe7,sdc.getmanta.com
24216e30-b83c-11e9-9d92-93f90b5c1ed8,streamsets.int.getmanta.com