StreamSets Integration Requirements

The following are the prerequisites necessary for IBM Automatic Data Lineage to connect to this third-party system, which you may choose to do at your sole discretion. Note that while these are usually sufficient to connect to this third-party system, we cannot guarantee the success of the connection or integration since we have no control, liability, or responsibility for third-party products or services, including for their performance.

StreamSets Data Collector (SDC): version 3.10.1 (other versions greater than 3.0.0 should work but haven't been tested)
If Control Hub is not enabled
- Connection parameters to StreamSets Data Collector
  - Address
  - Port
  - Scheme
  - User name
  - Password
- StreamSets Data Collector must be accessible via network
If Control Hub Cloud is enabled
- Credentials to the Control Hub account
  - User ID
  - Password
If Control Hub On-Premises is enabled
- Connection parameters to StreamSets Control Hub
  - Address
  - Port
  - Scheme
- Credentials to the Control Hub account
  - User ID
  - Password
StreamSets objects can also be exported manually. In such cases, the appropriate *.json StreamSets export files must be placed in the Automatic Data Lineage CLI temp folder in the subfolder with the name matching the model system. A best practice is to create a subdirectory with the same name as the StreamSets project. ${manta.dir.temp}/streamsets/${streamsets.extractor.server}. Inside this directory, two subdirectories have to be created by the user:
- /pipelines—this directory should contain all exported pipelines
- /jobs—this directory should contain all exported jobs
If you use the runtime values in the pipelines and you want to include these values in Automatic Data Lineage analysis, supply the properties file and other files with the runtime values to ${manta.dir.input}/streamsets/${streamsets.extractor.server}.
- The runtime values are called with the functions ${runtime:conf(<property name>)} and ${runtime:loadResource(<file name>, <restricted: true | false>)}.
- The path to the runtime properties is $SDC_CONF/sdc.properties (Manta expects *.properties) and to the SDC runtime resources is $SDC_RESOURCES. For more information about SDC Directories go to Administration → SDC Directories in SDC.

Supported Features

This is a list of the StreamSets features that Automatic Data Lineage supports. There may be features that aren't explicitly named but are included in the named items, as this list aims to be a high-level overview. Otherwise, features that are not listed are primarily considered not supported.

Here is a list of supported stages. (Default visualization without any data flow analysis is provided for unlisted stages.)

Sources: Stage name	Supported
Directory
Hadoop FS Standalone
JDBC Query Consumer
Kafka Multitopic Consumer	Automatic Data Lineage doesn't extract metadata from Kafka, so the analysis might be incomplete.
Salesforce	Automatic Data Lineage doesn’t support Salesforce Object Query Language (SOQL). Default SQL analysis is provided.
Oracle CDC Client
PosgreSQL CDC Client
Google BigQuery

Destinations: Stage name	Supported
Hadoop FS
Hive Metastore
HTTP Client
Kafka Producer	Automatic Data Lineage doesn't extract metadata from Kafka, so the analysis might be incomplete.
Local FS
Trash
JDBC Producer
Google BigQuery
Snowflake

Processors: Stage name	Supported
Expression Evaluator
Field Hasher
Field Masker
Field Order
Field Remover
Field Renamer
Field Replacer
Field Splitter
Field Type Converter
Hive Metadata
Schema Generator
Stream Selector
Field Pivoter
Data Parser

Executors: Stage name	Supported
Shell	Automatic Data Lineage doesn’t support script analysis.

Shell

(warning)

Automatic Data Lineage doesn’t support script analysis.

JSP 2.0 Expression Language (EL)

Evaluation of the EL functions with runtime values (str:trim function included)
- Runtime parameters (defined in the pipeline parameters)
- Runtime properties (defined in sdc.properties or in a separate runtime properties file)
- Runtime resources (defined file(s) in the $SDC_RESOURCES directory, by default /opt/streamsets-datacollector/resources; each file must contain one piece of information to be used when the resource is called)
Evaluation of the EL functions, where the field path can be stored, to determine the field (column) dependencies in the Expression Evaluator stage

Known Unsupported Features

Automatic Data Lineage does not support the following StreamSets features. This list includes all of the features that IBM is aware are unsupported, but it might not be comprehensive.

Evaluation of the extended regular expressions (Field Path Expression Syntax) that are used in field definitions
Following features from the StreamSets Control Hub: topologies, fragments
Evaluation of all JSP 2.0 Expression Language (EL) functions excluding the abovementioned supported features

Extraction of Jobs

Job extraction from StreamSets Control Hub is supported. Pipeline extraction and job extraction can be independently turned on and off using the properties streamsets.extractor.use.pipeline.extraction and streamsets.extractor.use.job.extraction. If streamsets.extractor.service is set to “Data Collector”, streamsets.extractor.use.job.extraction must not be set to true.

Each job is built on a pipeline, and job dataflow is typically similar to underlying pipeline dataflow. However, there can be differences caused by job configuration. It can affect the values of various node attributes, make the job connect to a different database, and change the names of tables and columns. In such cases, job dataflow better reflects reality. If the job is extracted, the extraction of the corresponding pipeline gives no additional information and slows down the analysis. It is therefore recommended, if possible, to use job extraction and disable pipeline extraction.