InfoSphere DataStage Scanner Guide
Follow these steps to configure a connection to IBM InfoSphere DataStage (standalone). Note that there is a separate guide for IBM DataStage Next Generation / IBM DataStage on Cloud Pak.
Step 1: Configure the Connection
Create a new connection in Admin UI 'http://localhost:8181/manta-admin-gui/app/index.html?#/platform/connections/' to enable lineage analysis of IBM DataStage jobs by IBM Manta Data Lineage. A best practice is to create a separate connection for each DataStage project; so, make the "DataStage server name" the name of the DataStage project. (The wording for the property can be misleading. It should be considered a project name.) The connection requirements and privileges are listed in DataStage Integration Requirements.
Properties that must be configured:
- Connection information for the DataStage instance:
datastage.extractor.server
,datastage.edition
; the propertydatastage.edition
must be set toStandalone DataStage
Optional properties:
- To control the scope of the lineage analysis, use
datastage.value.files
,datastage.design.time.analysis.jobs.included
, anddatastage.design.time.analysis.jobs.excluded
to restrict the analysis of design-time lineage.
See DataStage Resource Configuration for the full list and a detailed explanation of the properties that can be configured for the scanner.
Step 2: Provide DataStage Export Files
Manta Data Lineage does not provide an extractor for InfoSphere DataStage, which means that the jobs must be exported and provided to Manta Data Lineage by the Manta Administrator prior to or upon execution of the lineage analysis.
The appropriate *.xml
DataStage export files must be placed in the folder specified by the configurable datastage.input.dir
property, which by default is set to ${manta.dir.input}/datastage/${datastage.extractor.server}
,
or provided via Process Manager or Orchestration API during workflow execution.
Step 3: Providing Parameter Files
DataStage offers several types of parameters to ease changes in the configuration properties of jobs such as DB connections, names of databases and schemas, and parts or even the complete SQL to be used within the job. These are not included in XML job exports and need to be provided separately. DataStage has three types of parameters. See https://www.ibm.com/support/knowledgecenter/SSZJPZ_11.7.0/com.ibm.swg.im.iis.ds.design.doc/topics/c_ddesref_Parameter_Sets.html for more details. The three types of parameters, along with how they can be identified in the DataStage job XML, are as follows.
-
(Regular) parameters have the format
#parameter_name#
.
These are provided as a file with the nameDSParams
and should be placed in the location specified by thedatastage.dsparams.file
property of the DataStage scanner. In the transformer stage, parameters are used without the # qualifier; for example,parameter_name
. -
Parameter set parameters are in the format
#parameter_set.parameter_name#
.
These are in XML format and should be exported as part of a DataStage job. -
Environment variables
$ENV_VARIABLE
have the format#$ENV_VARIABLE#
or, in the transformer stage, simply$ENV_VARIABLE
.
Project/Environment-Level Parameters
If your jobs use project/environment-level parameters, provide Manta Data Lineage with a DSParams
file, which you can get from the project directory or from the administrator client in the environment variable windows using the Export to File... function. After that, rename the file to
DSParams
, if necessary, and add it to the path ${manta.dir.input}/datastage/${datastage.extractor.server}/DSParams
(and create folders on the path, if necessary).
Parameter Sets
If your jobs use parameter sets (references to parameter sets in jobs look like #PARAMETERSET.PARAMETER#
), provide them in the
datastageParameterOverride.txt
file. (This file is optional and is expected in the path defined by the common property datastage.parameter.override.file
.) This file overrides previously-loaded parameter values, so you
can set them manually or add new ones.
The File Format in Standalone DataStage
[ENVIRONMENT]
PARAM1_NAME = "param1_value"
PARAM2_NAME = "param2_value"
PARAM3_NAME = "param3_value"
[PARAMETER_SET/parameter_set_name]
param4_name = "default_param4_value"
param5_name = "default_param5_value"
$PARAM3_NAME = "$PROJDEF"
[VALUE_FILE/parameter_set_name/value_file1_name]
param4_name = "some_param4_value"
param5_name = "some_param5_value"
$PARAM3_NAME = "some_param3_value"
[VALUE_FILE/parameter_set_name/value_file2_name]
param4_name = "other_param4_value"
param5_name = "other_param5_value"
$PARAM3_NAME = "other_param3_value"
[JOB/job1_name]
param6_name = "param6_value"
param7_name = "param7_value"
[JOB/job2_name]
param7_name = "param8_value"
Four scopes of parameters can be added to the file.
-
Project/environment-level parameters — You can add any number of these under the
[ENVIRONMENT]
heading. Note that the values from the DSParam file сan be overridden. -
Parameter set default parameters — You can add any number of these under the
[PARAMETER_SET/parameter_set_name]
whereparameter_set_name
is replaced with the name of the parameter set. You can also refer to some default environment parameters, as in the example above, referring to the value$PROJDEF
. -
Parameter set value file parameters — You can add any number of these under the
[VALUE_FILE/parameter_set_name/value_file_name]
heading whereparameter_set_name
is replaced with the name of the parameter set andvalue_file_name
with the name of the value file. Referring to the value$PROJDEF
is also allowed here. Note: Don't forget to enter the names of the value files in thedatastage.value.files
property. -
Job parameters — You can add any number of these under the
[JOB/job_name]
heading wherejob_name
is replaced by the name of the job. Referring to the value$PROJDEF
is allowed.
Ensure that you format the file properly. Spaces and tabs are allowed, but each scope definition and parameter entry must be on a separate line.
Parameter Override Helper File
Manta Data Lineage understands which parameters must be defined by the user (i.e., where there is no default value) and logs any unresolved parameters found during analysis into a "helper" file, datastageParameterOverride.txt
,
in the location defined by the datastage.parameter.override.helper.file
property. The hint that the helper file gives can also be found in the log file and in logs
inside Admin UI Log Viewer relevant to the current DataStage dataflow analysis. After the analysis, you need to perform the following steps.
-
Copy the helper file to the location defined by the
datastage.parameter.override
property. If such a file already exists in the destination folder, you can safely replace it. (All the properties and values that were there are also included in the helper file.) You can also copy the contents of the helper file from the log or Log view logs and create the file in the above location. If such a file already exists, you can simply replace its contents. -
Open the file in the text editor and fill in the values of all unresolved parameters; that is, parameters with a blank value (
my_param = ""
). The parameter value must be enclosed in double quotes; for example,my_param = "my_value"
. -
Save and close the file.
-
Run the analysis again.
Operational Metadata
If you want to use operational metadata (OMD files) collected during job runs to resolve any type of parameter in your jobs, put the OMD files in the folder defined in the datastage.omd.files.directory
property.
Step 4: Setting Memory Allocation
To successfully analyze DataStage jobs, the minimum memory allocation for DatastageDataflowScenario should be 20x the size of the largest export file. You can change the maximum allowed allocated memory for a lineage analysis scenario as per
Configure Runtime and Limitations. For example, for an XML export of 200MB, the SCENARIO_LOAD_MEMORY
should be set to 4096
(4GB); the default setting is 3GB, so the change is only required for XML exports larger than
150MB.
Optional Additional Steps
Applicable only in certain situations, as described in each case.
Step 5: ODBC Connection Definition Settings
It is very common to use ODBC connections in InfoSphere DataStage jobs to connect to the database. ODBC connections may contain critical pieces of information for successful lineage analysis such as database and schema names. Definitions of ODBC connections are not included in the DataStage XML exports and need to be specified separately.
-
Create or open the file referenced by the
datastage.odbc.connection.definition.file
property (e.g.,<MANTA_DIR_HOME>/input/datastage/${datastage.extractor.server}/connection_definition/odbcConnectionDefinition.ini
). -
Follow the instructions in Manually Define a Database Connection and see How to Convert the odbc.ini File into a Manta Connections File for automation options.
Step 6: Overriding Component Data Lineage (Optional)
It is possible to specify or override the lineage for components between their inputs and outputs. This is useful in cases where the component lineage is not analyzed or the result of the analysis may not be complete. A typical case is when there are unsupported components that by default have component inputs connected to outputs by name (if there is a match) or everything to everything, which may provide inaccurate lineage. Another good example is when lineage for programmable components is overridden with, for example, C# code that Manta Data Lineage does not analyze. Overriding the default lineage for such components allows admins to make the lineage more accurate.
Note that this feature is not suitable for handling unsupported sources and targets. For example, if there is an issue connecting the DataStage job to the underlying database, this feature is not applicable as it overrides the lineage within a component between its input and output ports.
To override component lineage, create a file as specified in the common property datastage.manual.mapping.file
. The format of the CSV is:
"Full path to Stage";"Input Link name";"Input Column name";"Output Link name";"Output Column name";"Edge Type (DIRECT | FILTER)";"Description (optional)"
"manual_mapping_job/Generic_3";"DSLink2";"a";"DSLink5";"b";"DIRECT";""
The path to the stage is in the format
Job/[Shared and Local containers optional]/Stage
.