StreamSets Integration Requirements
The following are the prerequisites necessary for IBM Automatic Data Lineage to connect to this third-party system, which you may choose to do at your sole discretion. Note that while these are usually sufficient to connect to this third-party system, we cannot guarantee the success of the connection or integration since we have no control, liability, or responsibility for third-party products or services, including for their performance.
-
StreamSets Data Collector (SDC): version 3.10.1 (other versions greater than 3.0.0 should work but haven't been tested)
-
If Control Hub is not enabled
-
Connection parameters to StreamSets Data Collector
-
Address
-
Port
-
Scheme
-
User name
-
Password
-
-
StreamSets Data Collector must be accessible via network
-
-
If Control Hub Cloud is enabled
-
Credentials to the Control Hub account
-
User ID
-
Password
-
-
-
If Control Hub On-Premises is enabled
-
Connection parameters to StreamSets Control Hub
-
Address
-
Port
-
Scheme
-
-
Credentials to the Control Hub account
-
User ID
-
Password
-
-
-
StreamSets objects can also be exported manually. In such cases, the appropriate
*.jsonStreamSets export files must be placed in the Automatic Data Lineage CLI temp folder in the subfolder with the name matching the model system. A best practice is to create a subdirectory with the same name as the StreamSets project.${manta.dir.temp}/streamsets/${streamsets.extractor.server}. Inside this directory, two subdirectories have to be created by the user:-
/pipelines—this directory should contain all exported pipelines -
/jobs—this directory should contain all exported jobs
-
-
If you use the runtime values in the pipelines and you want to include these values in Automatic Data Lineage analysis, supply the properties file and other files with the runtime values to
${manta.dir.input}/streamsets/${streamsets.extractor.server}.-
The runtime values are called with the functions
${runtime:conf(<property name>)}and${runtime:loadResource(<file name>, <restricted: true | false>)}. -
The path to the runtime properties is
$SDC_CONF/sdc.properties(Manta expects*.properties) and to the SDC runtime resources is$SDC_RESOURCES. For more information about SDC Directories go to Administration → SDC Directories in SDC.
-
Supported Features
Here is a list of supported stages. (Default visualization without any data flow analysis is provided for unlisted stages.)
|
Sources: Stage name |
Supported |
|---|---|
|
Directory |
|
|
Hadoop FS Standalone |
|
|
JDBC Query Consumer |
|
|
Kafka Multitopic Consumer |
Automatic Data Lineage doesn't extract metadata from Kafka, so the analysis might be incomplete. |
|
Salesforce |
Automatic Data Lineage doesn’t support Salesforce Object Query Language (SOQL). Default SQL analysis is provided. |
|
Oracle CDC Client |
|
|
PosgreSQL CDC Client |
|
|
Google BigQuery |
|
|
Destinations: Stage name |
Supported |
|---|---|
|
Hadoop FS |
|
|
Hive Metastore |
|
|
HTTP Client |
|
|
Kafka Producer |
Automatic Data Lineage doesn't extract metadata from Kafka, so the analysis might be incomplete. |
|
Local FS |
|
|
Trash |
|
|
JDBC Producer |
|
|
Google BigQuery |
|
|
Snowflake |
|
| Processors: Stage name | Supported |
|---|---|
| Expression Evaluator | |
| Field Hasher | |
| Field Masker | |
| Field Order | |
| Field Remover | |
| Field Renamer | |
| Field Replacer | |
| Field Splitter | |
| Field Type Converter | |
| Hive Metadata | |
| Schema Generator | |
| Stream Selector | |
| Field Pivoter | |
| Data Parser |
|
Executors: Stage name |
Supported |
|---|---|
|
Shell |
Automatic Data Lineage doesn’t support script analysis. |
JSP 2.0 Expression Language (EL)
-
Evaluation of the EL functions with runtime values (
str:trimfunction included)-
Runtime parameters (defined in the pipeline parameters)
-
Runtime properties (defined in
sdc.propertiesor in a separate runtime properties file) -
Runtime resources (defined file(s) in the
$SDC_RESOURCESdirectory, by default/opt/streamsets-datacollector/resources; each file must contain one piece of information to be used when the resource is called)
-
-
Evaluation of the EL functions, where the field path can be stored, to determine the field (column) dependencies in the Expression Evaluator stage
Known Unsupported Features
Automatic Data Lineage does not support the following StreamSets features. This list includes all of the features that IBM is aware are unsupported, but it might not be comprehensive.
-
Evaluation of the extended regular expressions (Field Path Expression Syntax) that are used in field definitions
-
Following features from the StreamSets Control Hub: topologies, fragments
-
Evaluation of all JSP 2.0 Expression Language (EL) functions excluding the abovementioned supported features
Extraction of Jobs
Job extraction from StreamSets Control Hub is supported. Pipeline extraction and job extraction can be independently turned on and off using the properties streamsets.extractor.use.pipeline.extraction and
streamsets.extractor.use.job.extraction. If
streamsets.extractor.service is set to “Data Collector”,
streamsets.extractor.use.job.extraction must not be set to true.
Each job is built on a pipeline, and job dataflow is typically similar to underlying pipeline dataflow. However, there can be differences caused by job configuration. It can affect the values of various node attributes, make the job connect to a different database, and change the names of tables and columns. In such cases, job dataflow better reflects reality. If the job is extracted, the extraction of the corresponding pipeline gives no additional information and slows down the analysis. It is therefore recommended, if possible, to use job extraction and disable pipeline extraction.