StreamSets Integration Requirements

The following are the prerequisites necessary for IBM Automatic Data Lineage to connect to this third-party system, which you may choose to do at your sole discretion. Note that while these are usually sufficient to connect to this third-party system, we cannot guarantee the success of the connection or integration since we have no control, liability, or responsibility for third-party products or services, including for their performance.

Supported Features

This is a list of the StreamSets features that Automatic Data Lineage supports. There may be features that aren't explicitly named but are included in the named items, as this list aims to be a high-level overview. Otherwise, features that are not listed are primarily considered not supported.

Here is a list of supported stages. (Default visualization without any data flow analysis is provided for unlisted stages.)

Sources: Stage name

Supported

Directory

(tick)

Hadoop FS Standalone

(tick)

JDBC Query Consumer

(tick)

Kafka Multitopic Consumer

(warning)

Automatic Data Lineage doesn't extract metadata from Kafka, so the analysis might be incomplete.

Salesforce

(warning)

Automatic Data Lineage doesn’t support Salesforce Object Query Language (SOQL). Default SQL analysis is provided.

Oracle CDC Client

(tick)

PosgreSQL CDC Client

(tick)

Google BigQuery

(tick)

Destinations: Stage name

Supported

Hadoop FS

(tick)

Hive Metastore

(tick)

HTTP Client

(tick)

Kafka Producer

(warning)

Automatic Data Lineage doesn't extract metadata from Kafka, so the analysis might be incomplete.

Local FS

(tick)

Trash

(tick)

JDBC Producer

(tick)

Google BigQuery

(tick)

Snowflake

(tick)

Processors: Stage name Supported
Expression Evaluator (tick)
Field Hasher (tick)
Field Masker (tick)
Field Order (tick)
Field Remover (tick)
Field Renamer (tick)
Field Replacer (tick)
Field Splitter (tick)
Field Type Converter (tick)
Hive Metadata (tick)
Schema Generator (tick)
Stream Selector (tick)
Field Pivoter (tick)
Data Parser (tick)

Executors: Stage name

Supported

Shell

(warning)

Automatic Data Lineage doesn’t support script analysis.

JSP 2.0 Expression Language (EL)

Known Unsupported Features

Automatic Data Lineage does not support the following StreamSets features. This list includes all of the features that IBM is aware are unsupported, but it might not be comprehensive.

Extraction of Jobs

Job extraction from StreamSets Control Hub is supported. Pipeline extraction and job extraction can be independently turned on and off using the properties streamsets.extractor.use.pipeline.extraction and streamsets.extractor.use.job.extraction. If streamsets.extractor.service is set to “Data Collector”, streamsets.extractor.use.job.extraction must not be set to true.

Each job is built on a pipeline, and job dataflow is typically similar to underlying pipeline dataflow. However, there can be differences caused by job configuration. It can affect the values of various node attributes, make the job connect to a different database, and change the names of tables and columns. In such cases, job dataflow better reflects reality. If the job is extracted, the extraction of the corresponding pipeline gives no additional information and slows down the analysis. It is therefore recommended, if possible, to use job extraction and disable pipeline extraction.