IBM DataStage for Cloud Pak for Data lineage configuration
To import lineage metadata from IBM DataStage for Cloud Pak for Data, create a connection, data source definition and metadata import job.
To import lineage metadata for IBM DataStage for Cloud Pak for Data, complete these steps:
- Create a data source definition.
- Create a connection to the data source in a project.
- Create a metadata import.
Creating a data source definition
Create a data source definition. Select IBM DataStage for Cloud Pak for Data as the data source type.
Creating a connection to DataStage for Cloud Pak for Data
Create a connection to the data source in a project. For connection details, see DataStage for Cloud Pak for Data connection.
- To connect to DataStage for Cloud Pak for Data, provide a username and a password.
- To connect to DataStage for Cloud Pak for Data as a Service, provide the API key. If you do not have one, from the navigation menu go to Administration > Access (IAM) > Manage identities > API keys and create a new API key. Use the token value in the connection details.
- Specify a certificate if your DataStage for Cloud Pak for Data instance is not on the same cluster as the project where you want to create a metadata import job.
Creating a metadata import
Create a metadata import. Learn more about options that are specific to DataStage for Cloud Pak for Data data source:
Include and exclude lists
You can include or exclude assets up to the flow level. Provide databases and schemas in the format project/flow. Each part is evaluated as a regular expression. Assets which are added later in the data source will also be included or excluded if they match the conditions specified in the lists. Example values:
myProject/
: all flows inmyProject
project.myProject3/myFlow1
:myFlow1
flow frommyProject3
project.
External inputs
Optionally, you can provide external input in the form of a .zip file. You add this file in the Add inputs from file field. You can decide to add external input in addition to defined scope of extracted data, or you can import data from the external input only. To add an external input, complete these steps:
- Prepare a .zip file as an external input.
- Upload the .zip file to the project.
- Configure the import to use only the external input.
Prepare a .zip file as an external input
You can provide DataStage flows as external inputs in a .zip file. The folder must have the following structure:
<project_export.zip>
- A DataStage project exported to a .zip file.DSParams
- A file that contains the project- or environment-level parameters if applicable. You can get this file from the project directory.datastageParameterOverride.txt
- A file with parameter-set overrides if your jobs use parameter sets.connection_definition/odbcConnectionDefinition.ini
- A file with connection definitions for ODBC connections. Definitions of ODBC connections are not included in the DataStage XML exports and must be specified separately.datastageComponentOverrides.csv
- A file with component-lineage overrides.
The format of the .zip file with the exported DataStage project When you export a DataStage project, it must have the following structure:
assets
- required folder..METADATA
- required folder.data_intg_flow.*.json
- required files that contain information about flows.connection.*.json
- optional files that contain information about connections.parameter_set.*.json
- optional files that contain information about parameter sets.job.*.json
- optional files that contain information about jobs.job_run.*.json
- optional files that contain information about particular executions of the job.data_intg_flow
- required folder.- At least one file that contains the string
"schemas":[{
, but does not end inpx_executables
.
assettypes
- required folder.project.json
- required file. There might be multiple instances of this file as a result of ZIP decompression, which is correct.
The datastageParameterOverride.txt
file format The datastageParameterOverride.txt
file has the following content:
[ENVIRONMENT]
PARAM1_NAME = "param1_value"
PARAM2_NAME = "param2_value"
PARAM3_NAME = "param3_value"
[PARAMETER_SET/parameter_set_name]
param4_name = "default_param4_value"
param5_name = "default_param5_value"
$PARAM3_NAME = "$PROJDEF"
[VALUE_FILE/parameter_set_name/value_file1_name]
param4_name = "some_param4_value"
param5_name = "some_param5_value"
$PARAM3_NAME = "some_param3_value"
[VALUE_FILE/parameter_set_name/value_file2_name]
param4_name = "other_param4_value"
param5_name = "other_param5_value"
$PARAM3_NAME = "other_param3_value"
[JOB/job1_name]
param6_name = "param6_value"
param7_name = "param7_value"
[JOB/job2_name]
param7_name = "param8_value"
The connection_definition/odbcConnectionDefinition.ini
file format The connection_definition/odbcConnectionDefinition.ini
file has the following content. Create a separate [Shortcut_Name]
secion for each connection.
[<Shortcut_Name>]
Type=<connection_type>
Connection_String=<connection_string>
Server_Name=<server_name>
Database_Name=<database_name>
Schema_Name=<schema_name>
User_Name=<user_name>
- Shortcut_Name: The name of the connection or data server that is used by the data integration tool.
- connection_type: The type of data source.
- connection_string: A JDBC connection string or any identification of the database such as the system ID (SID) or the host name.
- server_name: The value depends on the type of data source:
- Db2, Microsoft SQL Server, Netezza Performance Server, SAP ASE (formerly Sybase), or Teradata: The server name.
- FTP: The hostname.
- Oracle and other databases: The value is ignored.
- database_name: The value depends on the type of data source:
- Oracle: The global database name.
- Db2, Microsoft SQL Server, Netezza Performance Server, SAP ASE (formerly Sybase), Teradata, and other databases: The name of the default database.
- user_name: The name of the user that logs in to the database.
Add a new line at the end of the parameters for each section.
The datastageComponentOverrides.csv
file format The datastageComponentOverrides.csv
file has the following content:
"Full path to Stage";"Input Link name";"Input Column name";"Output Link name";"Output Column name";"Edge Type (DIRECT | FILTER)";"Description (optional)"
"manual_mapping_job/Generic_3";"DSLink2";"a";"DSLink5";"b";"DIRECT";""
The path to the stage is in the format Job/[Shared and Local containers optional]/Stage
.
Upload the .zip file to the project
To use the .zip file in the metadata import, you must add it to the project where you create the metadata import.
- In the project, click Import assets.
- In the Local file section, click Data asset.
- Add the .zip file with DataStage project.
When you create the metadata import, you will be able to select this file in the Add inputs from file step.
Configure the import to use only the external input
If you want to import metadata only from the provided external input, and not directly from the connected DataStage for Cloud Pak for Data instance, complete these steps:
- Add the .zip file in the Add inputs from file section and click Next.
- Expand the Lineage import phases list, and disable the Transformations extraction phase.
Advanced import options
- Analyze job runs
- Specifies whether job runs are analyzed.
- Analyze job runs since
- Specifies the date after which runs are analyzed. If the value is empty, all runs are analyzed. Example value:
1970/01/01 00:00:00.000
. - Analyze jobs separately
- Specifies whether to analyze job separately, even when other runs are associated with them.
- Analyze flows without jobs
- Specifies whether flows without jobs are analyzed.
- Oracle proxy user authentication
- You can use Oracle proxy user authentication. Set the value to
true
to change Oracle usernames in\"USERNAME[SCHEMA_OWNER]\"
format to\"SCHEMA_OWNER\"
format. In other cases, set the value tofalse
. - Value files
- Specify the names of value files to use in Parameter Sets in order of priority. For example,
DEV1,TEST,PROD
.
Learn more
Parent topic: Supported connectors for lineage import