Migrating DataStage jobs

You migrate DataStage jobs by using ISX files that contain the job information, completing other post-migration tasks where applicable.

Procedure

Creating and exporting an ISX file

Create and export an ISX file by using one of the methods that are listed in the following table:
Option Instructions
ISTOOL Use ISTOOL to create an ISX file and export the file. For instructions, see Export command for InfoSphere DataStage and QualityStage assets and How to use ISTOOL for EXPORT IMPORT Information Server Components.
MettleCI Use MettleCI, which is a third-party service, to convert a server job design into an equivalent parallel job design, then create an ISX file and export the file to your system. For more information, see MettleCLI docs.
InfoSphere Information Server Manager GUI client Use the Information Server Manager GUI client to export the ISX file. For detailed instructions, see Exporting assets
Note: Make sure that the ISX file export includes any dependencies, such as parameter sets and table definitions.

Importing the ISX file

Complete the following steps to import the ISX file:
  1. Open an existing project or create a new one.
  2. From the Assets tab of the project, click New asset > Graphical builders > DataStage.
  3. Click the Local file tab, then upload the ISX file from your local computer. Then, click Create.
    Note: The ISX file must exist on your desktop or network drive. Do not drag the file as an attachment from another application.
Asset import report
After the import process is complete, the DataStage asset import report is available. You can review this report in your browser, download the report as a .csv file, or obtain the report by using the command-line interface (CLI).
Browser
To view the report in your browser:
  1. Click the bell notifications icon bell icon.
  2. From the list of notifications, select the ISX file that you imported by clicking View import summary.
To download the report as a CSV file:
  1. Open the report in your browser.
  2. Click Download as CSV.
CLI
To download the report by using the Cloud Pak for Data CLI (cpdctl), use the dsjob migrate command. Use the --status command option to retrieve the status report. See the following example:
cpdctl dsjob migrate {--project PROJECT | --project-id PROJID} [--on-failure ONFAILURE] [--conflict-resolution CONFLICT-RESOLUTION] [--attachment-type ATTACHMENT-TYPE] [--import-only] [--create-missing] [--file-name FILENAME] [--status IMPORT-ID --format csv/json] [--stop IMPORT-ID]
Asset import report: status information
The import report contains status information about the assets that are included in the ISX file:
Imported
The total number of assets imported. These assets were created. However, especially for DataStage flows, some compilation errors might need attention.
Failed
The total number of assets not imported. Each failed import is associated with one or multiple errors. The failed count does not include any compilation failures that are included by the "Compilation failed" status.
Compilation failed
The compilation process failed. This is frequently caused by missing parameter sets or other assets that are referenced in the DataStage flow.
Currently unsupported
The asset is not supported at the time of the import, but might be supported in a future release.
Deprecated
The asset is deprecated in the modern version of DataStage.
Dependency missing
Some assets that are referenced by a DataStage flow or pipeline do not exist in the project, or were not imported earlier. A missing dependency is likely to contribute to a compilation error. A DataStage flow might have multiple missing dependency objects. A missing dependency might appear in multiple DataStage flows or pipelines.
Common messages during the import process
During the process of importing jobs, you might encounter the following common messages:
Parallel job imported with warnings. The import created the following DataStage flow asset but could not create its associated job.
Due to some internal error in the flow or pipeline job creation step, a flow or pipeline job might not be created. An error message with more detailed information is included in the import status report.
Unsupported stage type
Some traditional DataStage stage types (or stage types with certain features) might not be supported in the modern version of DataStage. An “unsupported stage type” error is given for these stage types.
Unsupported secure gateway
The secure gateway feature is only available in the Cloud Pak for Data as a service (cloud) environment. For Cloud Pak for Data, an “Unsupported secure gateway” warning is given if the flow contains secure gateway parameters.
Connection parameters replaced with defaults
When a required connection property is not specified in the flow, it is replaced by a job parameter or a parameter set parameter. A warning is given when the substitution happens in a flow import.
Deprecated (server jobs)
Server jobs are deprecated and not supported in the modern version of DataStage. In the import status report, server jobs are included in the asset list and marked with “deprecated.”
Empty Stage List
When a flow or container contains no stages, an empty stage list warning is given.
Unbound Reference
A reference to an activity variable (for pipelines only) can be made only to an activity that was defined in the execution path before the activity node that contains the reference. Otherwise, it is called an unbound reference. For all unbound references, DataStage creates more user variables for these references so that the import can succeed and functionality is not impacted. When this situation happens, an unbound reference warning is also given that some additional user variables have been automatically created by DataStage.
Assets that are created from parallel job and sequence job imports
The following table lists assets that are created when you parallel jobs and sequence jobs are imported.
Asset type Type label Assets created on import
Parallel job (pjb) px_job DataStage flow + job + connection (from connection properties that are embedded in the job definition)
Sequence job sequence_job Watson pipeline flow + job
Connection (dcn) connection Connection (named)
Parameter set parameter_set Parameter set
Table definition xx Data definition

Reloading JAR files for connections that use JDBC JAR files

If you have connections that use JDBC JAR files, you must reload the JAR files to IBM Cloud Pak for Data.
To upload the JAR files, then complete the following steps:
  1. Locate a JDBC connection in a project, then open it.
  2. Click Upload new file to upload JDBC drivers JAR files. See Importing JDBC drivers for the procedure and required permissions to upload a JAR file to IBM Cloud Pak for Data.

Next, edit the Generic JDBC connection to use the JAR files. Go to Add to project > connection. Select the Generic JDBC connection. For more information, see Generic JDBC connection.

Migrating JDBC connections

After you migrate a DataStage job that has a JDBC connection, you must configure the connection to ensure the job works correctly.

Typically, the JDBC connection in the traditional version of DataStage contains the following properties:
  • URL
  • Username
  • Password
  • Attributes
  • JDBC driver classpath
  • JDBC driver classname

After migration, the flow in DataStage will have the use_datastage property set to "true."

After you migrate the job, complete the following steps:
  1. Upload the required JDBC drivers to the Cloud Pak for Data cluster.
  2. Check the import summary for any instructions that you need to follow. If the original flow did not have the JDBC Driver classname property, then during the import process a job parameter is created so you can enter the driver classname.
    See the following example of an import summary. The warnings section describes a parameter that needs to be updated.
    {
        "id": "010c11d2-03d4-4a1d-af02-db5d2de43cd1",
        "name": "jdbc_oracle_peek",
        "original_name": "jdbc_oracle_peek",
        "type": "px_job",
        "job_type": "px_job",
        "job_id": "1a90fadb-53e0-4d05-a080-856f5e804e5c",
        "job_name": "jdbc_oracle_peek.DataStage job",
        "status": "completed",
        "conflict_resolution_status": "flow_replacement_succeeded",
        "end_time": "2023-02-14T18:47:48.071Z",
        "warnings": [
            {
                "type": "default_connection_parameter",
                "name": "JDBC_Connector_0",
                "description": "Parameter values could not be resolved and default values have been added for parameters [jdbc_driver] in JDBC_Connector_0. Resolution: replace default values with actual values."
            }
        ]
    }
    

    Provide the classname as the default value for this job parameter or provide the classname in the properties panel.

Specifying hostname and port number information

In jobs in the traditional version of DataStage, data connection information might be specified in the form of a parameter reference.

When you import one of these jobs into the modern version of DataStage, the connection information is not recognized, and the import process produces warnings. To address the warnings, manually specify the information in the DataStage job settings. Then, recompile the job.

The following table shows the connections that might have this issue, the default values of the parameters, and what fields in traditional DataStage that the parameter information is derived from.

Note: Some traditional connection names map to multiple modern connection names and some modern connection names map to multiple traditional connection names.
Table 1.
Connection name in traditional DataStage Connection name in modern DataStage Field that the original value was derived from in traditional DataStage Parameter that receives the parameter reference Default value Notes
Sybase Enterprise SAP ASE Server host port = 5000  
Sybase OC  SAP ASE Server host port = 5000  
Db2® Db2 (optimized) Instance advanced.host advanced.port = 50000  
Db2 Enterprise Db2 (optimized) Server advanced.host advanced.port = 50000  
Oracle Oracle (optimized) Server oracle_service_name oracle_db_port = 1522  
Oracle Enterprise Oracle (optimized) Server oracle_service_name oracle_db_port = 1522  
Hive Apache Hive Server host
  • port = 10000
  • database = default
 
Netezza® Enterprise Netezza (optimized) Database database
  • hostname = <hostname>
  • port = 5480
 
FTP Enterprise FTP URI host
  • connection_mode = SSH
  • port = 22
The file_name property will also be set to a default in the associated flow stage. The default value will be <file_name>.

In addition to manually updating the connection properties and parameter value, you must also change this usage property within the stage, then compile the job.

Azure Storage Connector Microsoft Azure File Storage accountName, accountKey N/A https;AccountName=accountName;
AccountKey=accountKey
If either accountName or accountKey are parameter references, the connection string is replaced with a parameter reference for that field.

Updating job settings for missing properties

When you import certain jobs from the traditional version of DataStage into the modern version, some properties are not specified in the original job and you might need to specify them manually. DataStage automatically parameterizes some properties as job parameters. For example, host is parameterized to #ConnProp_[connection type]_host#.

You can also set the --create-connection-paramsets flag on the cpdctl dsjob migrate command to parameterize connection details as part of a parameter set asset instead of as job parameters. For example, host is parameterized to #ConnParamSet_[connection type].host#. This option allows you to set parameter values on the project level by modifying the parameter set asset instead of specifying the value for each individual job. For more information on the CLI command, see DataStage command-line tools.

A special case for the parameter naming is ODBCConnectorPx and PxODBC, which might insert dsn_type into the name similar to ConnProp_[connectorName]_[dsn_type]_hostname. This pattern applies to JDBC, also (ConnProp_JDBC_[protocol]_[property]). If the string portion protocol is readable it will be carried over, otherwise it be cut from the string and take the form ConnProp_JDBC_[property]. This case applies if dsn_type or protocol are available. This scenario also applies to ODBC.

The following values have defaults:
Apache Cassandra
local_datacenter = #ConnProp_Cassandra_local_datacenter# will have a default value of "datacenter1"
Apache Hive
database = #ConnProp_Hive_database# will have a default value of "default"
DRS as Oracle
service_name = #ConnProp_Oracle_service_name# will have a default value of "orcl"
Informix® CLI
database = #ConnProp_Informix_database# will have a default value of "default"
Informix Enterprise
database = #ConnProp_Informix_database# will have a default value of "default"
Informix Load
database = #ConnProp_Informix_database# will have a default value of "default"
Informix XPS Load
database = #ConnProp_Informix_database# will have a default value of "default"
Oracle
service_name = #ConnProp_Oracle_service_name# will have a default value of "orcl"
Oracle Enterprise
service_name = #ConnProp_Oracle_service_name# will have a default value of "orcl"

For Db2, the field names have periods in them. These periods are dropped when the names are converted into parameter references.

If the missing property is port, the import process produces warnings. To address the warnings, manually specify the information in the DataStage job settings. Then, recompile the job.

The following table shows the connections that might have missing properties, the properties that cannot be retrieved, the default value that will be given, and the field that the original value was derived from in traditional DataStage.

Note: Some traditional connection names map to multiple modern connection names and some modern connection names map to multiple traditional connection names.
Table 2.
Connection name in traditional DataStage Connection name in modern DataStage Missing properties that are automatically given a default parameter Missing properties that are automatically given a default value Notes
ODBC Connector ODBC
  • hostname
  • database
  • port = 0
  • dsn_type = DB2
If the data source is parameterized or if the value does not contain an accepted data source, it will default to Db2. Specify the correct data source type. You can specify a different value for the database property if the correct data source does not use database. For example, Oracle uses service_name instead.
ODBC Enterprise Connector ODBC Enterprise
  • hostname
  • database
  • port = 0
  • dsn_type = DB2
If dsn_type is not in the list of allowed values, the default is DB2.
JDBC Connector Generic JDBC jdbc_driver   You must edit the connection for this property and ensure that the driver JAR file is in the runtime pod.
DRS Connector ODBC, Oracle, Db2 (optimized) (not available in the modern version of DataStage) advanced.hostname advanced.port = 0
  • The connection, by default, will be migrated as a Db2 connector.
  • If you use a parameter for Database type, DRS is migrated as Db2 by default.
Informix Load Informix
  • host
  • password
  • username
port = 1526  
Informix Enterprise (if remote server = true) Informix
  • host
  • server
port = 1526  
Informix Enterprise (If remote server = false) Informix
  • username
  • password
  • host
  • server
port = 1526  
Informix CLI Informix
  • host
  • username
  • password
port = 1526  
Informix XPS Load Informix host port = 1526  
Db2 Z Bulk Load (PxDB2Z) Db2 for z/OS® host port = 446 In migrated jobs that contain this connection, DataStage automatically converts other properties to comparable ones in modern DataStage. After you migrate jobs, make sure to check all connection properties for their proper values and provide them where it is necessary.

Updating credential information

For the following stages, you must replace some properties after the import process is finished:
  • Google BigQuery
  • Google Cloud Storage
  • Google Cloud Pub/Sub
After the import process is finished, specify the correct values for the following fields :
  • client_id
  • client_secret
  • access_ token
  • refresh_token

Configuring DataStage jobs with Kerberos connection properties

After you migrate a job from legacy DataStage, if the job contains connections that have Kerberos authentication enabled, you must configure the job to work with the authentication.

This task applies to the following connectors:
  • Apache Hive
  • Apache HDFS
  • Apache Kafka
To configure an applicable job with Kerberos authentication:
  1. Provide the Kerberos configuration file by following the instructions in Enabling platform connections to use Kerberos authentication, in the section "Providing the Kerberos configuration file (krb5.conf) to the connections microservice."
  2. Upload the keytab.
    1. Open the DataStage flow that is associated with the job, then double-click the connection on the DataStage designer canvas to open the connection properties panel.
    2. Drag the keytab file from your local computer onto the Keytab file field or click in the field to browse for the file and upload it.

Handling migrated jobs that contain the APT_CONFIG_FILE environment variable

The modern version of DataStage uses dynamic configuration file generation by default. If a migrated flow contains the APT_CONFIG_FILE environment variable, you must review this variable setting and remove it, or create it on /px-storage or other accessible persistent volumes to run the original job. For more information, see Creating and setting the APT_CONFIG_FILE environment variable in DataStage.

Enabling flow connections

If you have a job in traditional DataStage that you want to import that has a connection, you can have the connection properties included in the flow definition when you migrate the job to modern DataStage. To migrate the job this way, you must use the dsjob CLI command to migrate the ISX file that contains the jobs. You must also enable flow connections by adding the parameter --enable-local-connection to the command.

Enabling user-defined functions

If you are migrating a job that contains a parallel routine from traditional DataStage, you must create a function library to enable user-defined functions in the Transformer stage of modern DataStage.

Create the new function library from the existing .so file that the parallel routine points to. Then, configure the library by setting the return data type for each function that you want to use.

Manually creating a new surrogate key file

The stage PxSurrogateKeyGenerator is a type of Surrogate Key Generator (SKG) stage that is deprecated in traditional DataStage. When you migrate a job from traditional DataStage that has this stage, and the stage has both input and output links, the stage is automatically converted to the PxSurrogateKeyGeneratorN type of SKG stage. After migration, you must manually create a new surrogate key file before you run the migrated stage.

To manually create a new surrogate key file, see the following procedure with example file names:
  1. Migrate the job from traditional DataStage.
  2. Compile the job in modern DataStage. An error is produced that says "Could not find key state file: Surrogate_Key_Generator_1." Record the file name for use in the next step.
  3. Create a separate DataStage flow that contains an SKG stage to create a key file. Use the name that you recorded for the value of the Source name field in the properties for the SKG stage in the new flow.
  4. Compile and run the new flow to create the key file Surrogate_Key_Generator_1.
  5. Compile and run the flow from Step 2 again. The flow compiles and runs successfully because the required key file Surrogate_Key_Generator_1 exists now.