InfoSphere DataStage Scanner Guide

Follow these steps to configure a connection to IBM InfoSphere DataStage (standalone). Note that there is a separate guide for IBM DataStage Next Generation / IBM DataStage on Cloud Pak.

Step 1: Configure the Connection

Create a new connection in Admin UI 'http://localhost:8181/manta-admin-gui/app/index.html?#/platform/connections/' to enable lineage analysis of IBM DataStage jobs by IBM Manta Data Lineage. A best practice is to create a separate connection for each DataStage project; so, make the "DataStage server name" the name of the DataStage project. (The wording for the property can be misleading. It should be considered a project name.) The connection requirements and privileges are listed in DataStage Integration Requirements.

Properties that must be configured:

Connection information for the DataStage instance: datastage.extractor.server, datastage.edition; the property datastage.edition must be set to Standalone DataStage

Optional properties:

To control the scope of the lineage analysis, use datastage.value.files, datastage.design.time.analysis.jobs.included, and datastage.design.time.analysis.jobs.excluded to restrict the analysis of design-time lineage.

See DataStage Resource Configuration for the full list and a detailed explanation of the properties that can be configured for the scanner.

Step 2: Provide DataStage Export Files

Manta Data Lineage does not provide an extractor for InfoSphere DataStage, which means that the jobs must be exported and provided to Manta Data Lineage by the Manta Administrator prior to or upon execution of the lineage analysis.

The appropriate *.xml DataStage export files must be placed in the folder specified by the configurable datastage.input.dir property, which by default is set to ${manta.dir.input}/datastage/${datastage.extractor.server}, or provided via Process Manager or Orchestration API during workflow execution.

Step 3: Providing Parameter Files

DataStage offers several types of parameters to ease changes in the configuration properties of jobs such as DB connections, names of databases and schemas, and parts or even the complete SQL to be used within the job. These are not included in XML job exports and need to be provided separately. DataStage has three types of parameters. See https://www.ibm.com/support/knowledgecenter/SSZJPZ_11.7.0/com.ibm.swg.im.iis.ds.design.doc/topics/c_ddesref_Parameter_Sets.html for more details. The three types of parameters, along with how they can be identified in the DataStage job XML, are as follows.

(Regular) parameters have the format #parameter_name#.
These are provided as a file with the name DSParams and should be placed in the location specified by the datastage.dsparams.file property of the DataStage scanner. In the transformer stage, parameters are used without the # qualifier; for example, parameter_name.
Parameter set parameters are in the format #parameter_set.parameter_name#.
These are in XML format and should be exported as part of a DataStage job.
Environment variables $ENV_VARIABLE have the format #$ENV_VARIABLE# or, in the transformer stage, simply $ENV_VARIABLE.

Project/Environment-Level Parameters

If your jobs use project/environment-level parameters, provide Manta Data Lineage with a DSParams file, which you can get from the project directory or from the administrator client in the environment variable windows using the Export to File... function. After that, rename the file to DSParams, if necessary, and add it to the path ${manta.dir.input}/datastage/${datastage.extractor.server}/DSParams (and create folders on the path, if necessary).

Parameter Sets

If your jobs use parameter sets (references to parameter sets in jobs look like #PARAMETERSET.PARAMETER#), provide them in the datastageParameterOverride.txt file. (This file is optional and is expected in the path defined by the common property datastage.parameter.override.file.) This file overrides previously-loaded parameter values, so you can set them manually or add new ones.

The File Format in Standalone DataStage

[ENVIRONMENT]
PARAM1_NAME = "param1_value"
PARAM2_NAME = "param2_value"
PARAM3_NAME = "param3_value"

[PARAMETER_SET/parameter_set_name]
param4_name  = "default_param4_value"
param5_name  = "default_param5_value"
$PARAM3_NAME = "$PROJDEF"

[VALUE_FILE/parameter_set_name/value_file1_name]
param4_name  = "some_param4_value"
param5_name  = "some_param5_value"
$PARAM3_NAME = "some_param3_value"

[VALUE_FILE/parameter_set_name/value_file2_name]
param4_name  = "other_param4_value"
param5_name  = "other_param5_value"
$PARAM3_NAME = "other_param3_value"

[JOB/job1_name]
param6_name = "param6_value"
param7_name = "param7_value"

[JOB/job2_name]
param7_name = "param8_value"

Four scopes of parameters can be added to the file.

Project/environment-level parameters — You can add any number of these under the [ENVIRONMENT] heading. Note that the values from the DSParam file сan be overridden.
Parameter set default parameters — You can add any number of these under the [PARAMETER_SET/parameter_set_name] where parameter_set_name is replaced with the name of the parameter set. You can also refer to some default environment parameters, as in the example above, referring to the value $PROJDEF.
Parameter set value file parameters — You can add any number of these under the [VALUE_FILE/parameter_set_name/value_file_name] heading where parameter_set_name is replaced with the name of the parameter set and value_file_name with the name of the value file. Referring to the value $PROJDEF is also allowed here. Note: Don't forget to enter the names of the value files in the datastage.value.files property.
Job parameters — You can add any number of these under the [JOB/job_name] heading where job_name is replaced by the name of the job. Referring to the value $PROJDEF is allowed. When dealing with "parameter_resolving_errors" on the workflow with the following description: "Unknown [ParameterNamer] parameter of [JobName] job.", consider adding a section [JOB/JobName] with a line with each of the missing parameters for that specific job. Example: [JOB/JobName] ParameterName = "[value]"

Ensure that you format the file properly. Spaces and tabs are allowed, but each scope definition and parameter entry must be on a separate line.

Parameter Override Helper File

Manta Data Lineage understands which parameters must be defined by the user (i.e., where there is no default value) and logs any unresolved parameters found during analysis into a "helper" file, datastageParameterOverride.txt, in the location defined by the datastage.parameter.override.helper.file property. The hint that the helper file gives can also be found in the log file and in logs inside Admin UI Log Viewer relevant to the current DataStage dataflow analysis. After the analysis, you need to perform the following steps.

Copy the helper file to the location defined by the datastage.parameter.override property. If such a file already exists in the destination folder, you can safely replace it. (All the properties and values that were there are also included in the helper file.) You can also copy the contents of the helper file from the log or Log view logs and create the file in the above location. If such a file already exists, you can simply replace its contents.
Open the file in the text editor and fill in the values of all unresolved parameters; that is, parameters with a blank value ( my_param = ""). The parameter value must be enclosed in double quotes; for example, my_param = "my_value".
Save and close the file.
Run the analysis again.

Operational Metadata

If you want to use operational metadata (OMD files) collected during job runs to resolve any type of parameter in your jobs, put the OMD files in the folder defined in the datastage.omd.files.directory property.

Step 4: Setting Memory Allocation

To successfully analyze DataStage jobs, the minimum memory allocation for DatastageDataflowScenario should be 20x the size of the largest export file. You can change the maximum allowed allocated memory for a lineage analysis scenario as per Configure Runtime and Limitations. For example, for an XML export of 200MB, the SCENARIO_LOAD_MEMORY should be set to 4096 (4GB); the default setting is 3GB, so the change is only required for XML exports larger than 150MB.

Optional Additional Steps

Applicable only in certain situations, as described in each case.

Step 5: ODBC Connection Definition Settings

It is very common to use ODBC connections in InfoSphere DataStage jobs to connect to the database. ODBC connections may contain critical pieces of information for successful lineage analysis such as database and schema names. Definitions of ODBC connections are not included in the DataStage XML exports and need to be specified separately.

Create or open the file referenced by the datastage.odbc.connection.definition.file property (e.g., <MANTA_DIR_HOME>/input/datastage/${datastage.extractor.server}/connection_definition/odbcConnectionDefinition.ini).
Follow the instructions in Manually Define a Database Connection and see How to Convert the odbc.ini File into a Manta Connections File for automation options.

Step 6: Overriding Component Data Lineage (Optional)

It is possible to specify or override the lineage for components between their inputs and outputs. This is useful in cases where the component lineage is not analyzed or the result of the analysis may not be complete. A typical case is when there are unsupported components that by default have component inputs connected to outputs by name (if there is a match) or everything to everything, which may provide inaccurate lineage. Another good example is when lineage for programmable components is overridden with, for example, C# code that Manta Data Lineage does not analyze. Overriding the default lineage for such components allows admins to make the lineage more accurate.

Note that this feature is not suitable for handling unsupported sources and targets. For example, if there is an issue connecting the DataStage job to the underlying database, this feature is not applicable as it overrides the lineage within a component between its input and output ports.

To override component lineage, create a file as specified in the common property datastage.manual.mapping.file. The format of the CSV is:

"Full path to Stage";"Input Link name";"Input Column name";"Output Link name";"Output Column name";"Edge Type (DIRECT | FILTER)";"Description (optional)"
"manual_mapping_job/Generic_3";"DSLink2";"a";"DSLink5";"b";"DIRECT";""

The path to the stage is in the format Job/[Shared and Local containers optional]/Stage.