Migrating DataStage jobs

Edit online

You migrate DataStage jobs by using ISX files that contain the job information, completing other post-migration tasks where applicable.

Procedure

Complete the following tasks to migrate DataStage® jobs:

Creating and exporting an ISX file
Importing the ISX file
If you have connections that use JDBC JAR files, reload the JAR files to IBM Cloud Pak® for Data. See Reloading JAR files for connections that use JDBC JAR files
Migrating JDBC connections
Specifying hostname and port number information
Updating job settings for missing properties
Updating credential information
Configuring DataStage jobs with Kerberos connection properties
Handling migrated jobs that contain the APT_CONFIG_FILE environment variable
Enabling flow connections
Enabling user-defined functions
Manually creating a new surrogate key file

Creating and exporting an ISX file

Create and export an ISX file by using one of the methods that are listed in the following table:

Option	Instructions
ISTOOL	Use ISTOOL to create an ISX file and export the file. For instructions, see Export command for InfoSphere DataStage and QualityStage assets and How to use ISTOOL for EXPORT IMPORT Information Server Components.
MettleCI	Use MettleCI, which is a third-party service, to convert a server job design into an equivalent parallel job design, then create an ISX file and export the file to your system. For more information, see MettleCLI docs.
InfoSphere Information Server Manager GUI client	Use the Information Server Manager GUI client to export the ISX file. For detailed instructions, see Exporting assets

Note: Make sure that the ISX file export includes any dependencies, such as parameter sets and table definitions.

Importing the ISX file

Complete the following steps to import the ISX file:

Open an existing project or create a new one.
From the Assets tab of the project, click New asset > Graphical builders > DataStage.
Click the Local file tab, then upload the ISX file from your local computer. Then, click Create.
Note: The ISX file must exist on your desktop or network drive. Do not drag the file as an attachment from another application.

Asset import report

After the import process is complete, the DataStage asset import report is available. You can review this report in your browser, download the report as a .csv file, or obtain the report by using the command-line interface (CLI).

Browser

To view the report in your browser:

Click the bell notifications icon .
From the list of notifications, select the ISX file that you imported by clicking View import summary.

To download the report as a CSV file:

Open the report in your browser.
Click Download as CSV.

CLI

To download the report by using the Cloud Pak for Data CLI (cpdctl), use the dsjob migrate command. Use the --status command option to retrieve the status report. See the following example:

cpdctl dsjob migrate {--project PROJECT | --project-id PROJID} [--on-failure ONFAILURE] [--conflict-resolution CONFLICT-RESOLUTION] [--attachment-type ATTACHMENT-TYPE] [--import-only] [--create-missing] [--file-name FILENAME] [--status IMPORT-ID --format csv/json] [--stop IMPORT-ID]

Asset import report: status information

The import report contains status information about the assets that are included in the ISX file:

Imported: The total number of assets imported. These assets were created. However, especially for DataStage flows, some compilation errors might need attention.
Failed: The total number of assets not imported. Each failed import is associated with one or multiple errors. The failed count does not include any compilation failures that are included by the "Compilation failed" status.
Compilation failed: The compilation process failed. This is frequently caused by missing parameter sets or other assets that are referenced in the DataStage flow.
Currently unsupported: The asset is not supported at the time of the import, but might be supported in a future release.
Deprecated: The asset is deprecated in the modern version of DataStage.
Dependency missing: Some assets that are referenced by a DataStage flow or pipeline do not exist in the project, or were not imported earlier. A missing dependency is likely to contribute to a compilation error. A DataStage flow might have multiple missing dependency objects. A missing dependency might appear in multiple DataStage flows or pipelines.

Common messages during the import process

During the process of importing jobs, you might encounter the following common messages:

Parallel job imported with warnings. The import created the following DataStage flow asset but could not create its associated job.: Due to some internal error in the flow or pipeline job creation step, a flow or pipeline job might not be created. An error message with more detailed information is included in the import status report.
Unsupported stage type: Some traditional DataStage stage types (or stage types with certain features) might not be supported in the modern version of DataStage. An “unsupported stage type” error is given for these stage types.
Unsupported secure gateway: The secure gateway feature is only available in the Cloud Pak for Data as a service (cloud) environment. For Cloud Pak for Data, an “Unsupported secure gateway” warning is given if the flow contains secure gateway parameters.
Connection parameters replaced with defaults: When a required connection property is not specified in the flow, it is replaced by a job parameter or a parameter set parameter. A warning is given when the substitution happens in a flow import.
Deprecated (server jobs): Server jobs are deprecated and not supported in the modern version of DataStage. In the import status report, server jobs are included in the asset list and marked with “deprecated.”
Empty Stage List: When a flow or container contains no stages, an empty stage list warning is given.
Unbound Reference: A reference to an activity variable (for pipelines only) can be made only to an activity that was defined in the execution path before the activity node that contains the reference. Otherwise, it is called an unbound reference. For all unbound references, DataStage creates more user variables for these references so that the import can succeed and functionality is not impacted. When this situation happens, an unbound reference warning is also given that some additional user variables have been automatically created by DataStage.

Assets that are created from parallel job and sequence job imports

The following table lists assets that are created when you parallel jobs and sequence jobs are imported.

Asset type	Type label	Assets created on import
Parallel job (pjb)	px_job	DataStage flow + job + connection (from connection properties that are embedded in the job definition)
Sequence job	sequence_job	Watson pipeline flow + job
Connection (dcn)	connection	Connection (named)
Parameter set	parameter_set	Parameter set
Table definition	xx	Data definition

Reloading JAR files for connections that use JDBC JAR files

If you have connections that use JDBC JAR files, you must reload the JAR files to IBM Cloud Pak for Data.

To upload the JAR files, then complete the following steps:

Locate a JDBC connection in a project, then open it.
Click Upload new file to upload JDBC drivers JAR files. See Importing JDBC drivers for the procedure and required permissions to upload a JAR file to IBM Cloud Pak for Data.

Next, edit the Generic JDBC connection to use the JAR files. Go to Add to project > connection. Select the Generic JDBC connection. For more information, see Generic JDBC connection.

Migrating JDBC connections

After you migrate a DataStage job that has a JDBC connection, you must configure the connection to ensure the job works correctly.

Typically, the JDBC connection in the traditional version of DataStage contains the following properties:

URL
Username
Password
Attributes
JDBC driver classpath
JDBC driver classname

After migration, the flow in DataStage will have the use_datastage property set to "true."

After you migrate the job, complete the following steps:

Upload the required JDBC drivers to the Cloud Pak for Data cluster.

Check the import summary for any instructions that you need to follow. If the original flow did not have the JDBC Driver classname property, then during the import process a job parameter is created so you can enter the driver classname.

See the following example of an import summary. The warnings section describes a parameter that needs to be updated.

{
    "id": "010c11d2-03d4-4a1d-af02-db5d2de43cd1",
    "name": "jdbc_oracle_peek",
    "original_name": "jdbc_oracle_peek",
    "type": "px_job",
    "job_type": "px_job",
    "job_id": "1a90fadb-53e0-4d05-a080-856f5e804e5c",
    "job_name": "jdbc_oracle_peek.DataStage job",
    "status": "completed",
    "conflict_resolution_status": "flow_replacement_succeeded",
    "end_time": "2023-02-14T18:47:48.071Z",
    "warnings": [
        {
            "type": "default_connection_parameter",
            "name": "JDBC_Connector_0",
            "description": "Parameter values could not be resolved and default values have been added for parameters [jdbc_driver] in JDBC_Connector_0. Resolution: replace default values with actual values."
        }
    ]
}

Provide the classname as the default value for this job parameter or provide the classname in the properties panel.

Specifying hostname and port number information

In jobs in the traditional version of DataStage, data connection information might be specified in the form of a parameter reference.

When you import one of these jobs into the modern version of DataStage, the connection information is not recognized, and the import process produces warnings. To address the warnings, manually specify the information in the DataStage job settings. Then, recompile the job.

The following table shows the connections that might have this issue, the default values of the parameters, and what fields in traditional DataStage that the parameter information is derived from.

Note: Some traditional connection names map to multiple modern connection names and some modern connection names map to multiple traditional connection names.

Table 1.
Connection name in traditional DataStage	Connection name in modern DataStage	Field that the original value was derived from in traditional DataStage	Parameter that receives the parameter reference	Default value	Notes
Sybase Enterprise	SAP ASE	Server	host	port = 5000
Sybase OC	SAP ASE	Server	host	port = 5000
Db2®	Db2 (optimized)	Instance	advanced.host	advanced.port = 50000
Db2 Enterprise	Db2 (optimized)	Server	advanced.host	advanced.port = 50000
Oracle	Oracle (optimized)	Server	oracle_service_name	oracle_db_port = 1522
Oracle Enterprise	Oracle (optimized)	Server	oracle_service_name	oracle_db_port = 1522
Hive	Apache Hive	Server	host	port = 10000 database = default
Netezza® Enterprise	Netezza (optimized)	Database	database	hostname = <hostname> port = 5480
FTP Enterprise	FTP	URI	host	connection_mode = SSH port = 22	The file_name property will also be set to a default in the associated flow stage. The default value will be <file_name>. In addition to manually updating the connection properties and parameter value, you must also change this usage property within the stage, then compile the job.
Azure Storage Connector	Microsoft Azure File Storage	accountName, accountKey	N/A	https;AccountName=accountName; AccountKey=accountKey	If either accountName or accountKey are parameter references, the connection string is replaced with a parameter reference for that field.

Updating job settings for missing properties

When you import certain jobs from the traditional version of DataStage into the modern version, some properties are not specified in the original job and you might need to specify them manually. DataStage automatically parameterizes some properties as job parameters. For example, host is parameterized to #ConnProp_[connection type]_host#.

You can also set the --create-connection-paramsets flag on the cpdctl dsjob migrate command to parameterize connection details as part of a parameter set asset instead of as job parameters. For example, host is parameterized to #ConnParamSet_[connection type].host#. This option allows you to set parameter values on the project level by modifying the parameter set asset instead of specifying the value for each individual job. For more information on the CLI command, see DataStage command-line tools.

A special case for the parameter naming is ODBCConnectorPx and PxODBC, which might insert dsn_type into the name similar to ConnProp_[connectorName]_[dsn_type]_hostname. This pattern applies to JDBC, also (ConnProp_JDBC_[protocol]_[property]). If the string portion protocol is readable it will be carried over, otherwise it be cut from the string and take the form ConnProp_JDBC_[property]. This case applies if dsn_type or protocol are available. This scenario also applies to ODBC.

The following values have defaults:

Apache Cassandra: local_datacenter = #ConnProp_Cassandra_local_datacenter# will have a default value of "datacenter1"
Apache Hive: database = #ConnProp_Hive_database# will have a default value of "default"
DRS as Oracle: service_name = #ConnProp_Oracle_service_name# will have a default value of "orcl"
Informix® CLI: database = #ConnProp_Informix_database# will have a default value of "default"
Informix Enterprise: database = #ConnProp_Informix_database# will have a default value of "default"
Informix Load: database = #ConnProp_Informix_database# will have a default value of "default"
Informix XPS Load: database = #ConnProp_Informix_database# will have a default value of "default"
Oracle: service_name = #ConnProp_Oracle_service_name# will have a default value of "orcl"
Oracle Enterprise: service_name = #ConnProp_Oracle_service_name# will have a default value of "orcl"

For Db2, the field names have periods in them. These periods are dropped when the names are converted into parameter references.

If the missing property is port, the import process produces warnings. To address the warnings, manually specify the information in the DataStage job settings. Then, recompile the job.

The following table shows the connections that might have missing properties, the properties that cannot be retrieved, the default value that will be given, and the field that the original value was derived from in traditional DataStage.

Note: Some traditional connection names map to multiple modern connection names and some modern connection names map to multiple traditional connection names.

Table 2.
Connection name in traditional DataStage	Connection name in modern DataStage	Missing properties that are automatically given a default parameter	Missing properties that are automatically given a default value	Notes
ODBC Connector	ODBC	hostname database	port = 0 dsn_type = DB2	If the data source is parameterized or if the value does not contain an accepted data source, it will default to Db2. Specify the correct data source type. You can specify a different value for the database property if the correct data source does not use database. For example, Oracle uses service_name instead.
ODBC Enterprise Connector	ODBC Enterprise	hostname database	port = 0 dsn_type = DB2	If dsn_type is not in the list of allowed values, the default is DB2.
JDBC Connector	Generic JDBC	jdbc_driver		You must edit the connection for this property and ensure that the driver JAR file is in the runtime pod.
DRS Connector	ODBC, Oracle, Db2 (optimized) (not available in the modern version of DataStage)	advanced.hostname	advanced.port = 0	The connection, by default, will be migrated as a Db2 connector. If you use a parameter for Database type, DRS is migrated as Db2 by default.
Informix Load	Informix	host password username	port = 1526
Informix Enterprise (if remote server = true)	Informix	host server	port = 1526
Informix Enterprise (If remote server = false)	Informix	username password host server	port = 1526
Informix CLI	Informix	host username password	port = 1526
Informix XPS Load	Informix	host	port = 1526
Db2 Z Bulk Load (PxDB2Z)	Db2 for z/OS®	host	port = 446	In migrated jobs that contain this connection, DataStage automatically converts other properties to comparable ones in modern DataStage. After you migrate jobs, make sure to check all connection properties for their proper values and provide them where it is necessary.

Updating credential information

For the following stages, you must replace some properties after the import process is finished:

Google BigQuery
Google Cloud Storage
Google Cloud Pub/Sub

After the import process is finished, specify the correct values for the following fields :

client_id
client_secret
access_ token
refresh_token

Configuring DataStage jobs with Kerberos connection properties

After you migrate a job from legacy DataStage, if the job contains connections that have Kerberos authentication enabled, you must configure the job to work with the authentication.

This task applies to the following connectors:

Apache Hive
Apache HDFS
Apache Kafka

To configure an applicable job with Kerberos authentication:

Provide the Kerberos configuration file by following the instructions in Enabling platform connections to use Kerberos authentication, in the section "Providing the Kerberos configuration file (krb5.conf) to the connections microservice."
Upload the keytab.
1. Open the DataStage flow that is associated with the job, then double-click the connection on the DataStage designer canvas to open the connection properties panel.
2. Drag the keytab file from your local computer onto the Keytab file field or click in the field to browse for the file and upload it.

Handling migrated jobs that contain the APT_CONFIG_FILE environment variable

The modern version of DataStage uses dynamic configuration file generation by default. If a migrated flow contains the APT_CONFIG_FILE environment variable, you must review this variable setting and remove it, or create it on /px-storage or other accessible persistent volumes to run the original job. For more information, see Creating and setting the APT_CONFIG_FILE environment variable in DataStage.

Enabling flow connections

If you have a job in traditional DataStage that you want to import that has a connection, you can have the connection properties included in the flow definition when you migrate the job to modern DataStage. To migrate the job this way, you must use the dsjob CLI command to migrate the ISX file that contains the jobs. You must also enable flow connections by adding the parameter --enable-local-connection to the command.

Enabling user-defined functions

If you are migrating a job that contains a parallel routine from traditional DataStage, you must create a function library to enable user-defined functions in the Transformer stage of modern DataStage.

Create the new function library from the existing .so file that the parallel routine points to. Then, configure the library by setting the return data type for each function that you want to use.

Manually creating a new surrogate key file

The stage PxSurrogateKeyGenerator is a type of Surrogate Key Generator (SKG) stage that is deprecated in traditional DataStage. When you migrate a job from traditional DataStage that has this stage, and the stage has both input and output links, the stage is automatically converted to the PxSurrogateKeyGeneratorN type of SKG stage. After migration, you must manually create a new surrogate key file before you run the migrated stage.

To manually create a new surrogate key file, see the following procedure with example file names:

Migrate the job from traditional DataStage.
Compile the job in modern DataStage. An error is produced that says "Could not find key state file: Surrogate_Key_Generator_1." Record the file name for use in the next step.
Create a separate DataStage flow that contains an SKG stage to create a key file. Use the name that you recorded for the value of the Source name field in the properties for the SKG stage in the new flow.
Compile and run the new flow to create the key file Surrogate_Key_Generator_1.
Compile and run the flow from Step 2 again. The flow compiles and runs successfully because the required key file Surrogate_Key_Generator_1 exists now.