Migrating DataStage jobs
You migrate DataStage jobs by using ISX files that contain the job information, completing other post-migration tasks where applicable.
Procedure
- Creating and exporting an ISX file
- Importing the ISX file
- If you have connections that use JDBC JAR files, reload the JAR files to IBM Cloud Pak® for Data. See Reloading JAR files for connections that use JDBC JAR files
- Migrating JDBC connections
- Specifying hostname and port number information
- Updating job settings for missing properties
- Updating credential information
- Configuring DataStage jobs with Kerberos connection properties
- Handling migrated jobs that contain the APT_CONFIG_FILE environment variable
- Enabling flow connections
- Enabling user-defined functions
- Manually creating a new surrogate key file
Creating and exporting an ISX file
Option | Instructions |
---|---|
ISTOOL | Use ISTOOL to create an ISX file and export the file. For instructions, see Export command for InfoSphere DataStage and QualityStage assets and How to use ISTOOL for EXPORT IMPORT Information Server Components. |
MettleCI | Use MettleCI, which is a third-party service, to convert a server job design into an equivalent parallel job design, then create an ISX file and export the file to your system. For more information, see MettleCLI docs. |
InfoSphere Information Server Manager GUI client | Use the Information Server Manager GUI client to export the ISX file. For detailed instructions, see Exporting assets |
Importing the ISX file
- Open an existing project or create a new one.
- From the Assets tab of the project, click .
- Click the Local file tab, then upload the ISX file from your local
computer. Then, click Create.Note: The ISX file must exist on your desktop or network drive. Do not drag the file as an attachment from another application.
- Asset import report
- After the import process is complete, the DataStage
asset import report is available. You can review this report in your browser, download the report as
a .csv file, or obtain the report by using the command-line interface (CLI).
- Browser
- To view the report in your browser:
- Click the bell notifications icon .
- From the list of notifications, select the ISX file that you imported by clicking View import summary.
- CLI
- To download the report by using the Cloud Pak for Data CLI
(cpdctl), use the dsjob migrate command. Use the --status
command option to retrieve the status report. See the following
example:
cpdctl dsjob migrate {--project PROJECT | --project-id PROJID} [--on-failure ONFAILURE] [--conflict-resolution CONFLICT-RESOLUTION] [--attachment-type ATTACHMENT-TYPE] [--import-only] [--create-missing] [--file-name FILENAME] [--status IMPORT-ID --format csv/json] [--stop IMPORT-ID]
- Asset import report: status information
- The import report contains status information about the assets that are included in the ISX file:
- Imported
- The total number of assets imported. These assets were created. However, especially for DataStage flows, some compilation errors might need attention.
- Failed
- The total number of assets not imported. Each failed import is associated with one or multiple errors. The failed count does not include any compilation failures that are included by the "Compilation failed" status.
- Compilation failed
- The compilation process failed. This is frequently caused by missing parameter sets or other assets that are referenced in the DataStage flow.
- Currently unsupported
- The asset is not supported at the time of the import, but might be supported in a future release.
- Deprecated
- The asset is deprecated in the modern version of DataStage.
- Dependency missing
- Some assets that are referenced by a DataStage flow or pipeline do not exist in the project, or were not imported earlier. A missing dependency is likely to contribute to a compilation error. A DataStage flow might have multiple missing dependency objects. A missing dependency might appear in multiple DataStage flows or pipelines.
- Common messages during the import process
- During the process of importing jobs, you might encounter the following common messages:
- Parallel job imported with warnings. The import created the following DataStage flow asset but could not create its associated job.
- Due to some internal error in the flow or pipeline job creation step, a flow or pipeline job might not be created. An error message with more detailed information is included in the import status report.
- Unsupported stage type
- Some traditional DataStage stage types (or stage types with certain features) might not be supported in the modern version of DataStage. An “unsupported stage type” error is given for these stage types.
- Unsupported secure gateway
- The secure gateway feature is only available in the Cloud Pak for Data as a service (cloud) environment. For Cloud Pak for Data, an “Unsupported secure gateway” warning is given if the flow contains secure gateway parameters.
- Connection parameters replaced with defaults
- When a required connection property is not specified in the flow, it is replaced by a job parameter or a parameter set parameter. A warning is given when the substitution happens in a flow import.
- Deprecated (server jobs)
- Server jobs are deprecated and not supported in the modern version of DataStage. In the import status report, server jobs are included in the asset list and marked with “deprecated.”
- Empty Stage List
- When a flow or container contains no stages, an empty stage list warning is given.
- Unbound Reference
- A reference to an activity variable (for pipelines only) can be made only to an activity that was defined in the execution path before the activity node that contains the reference. Otherwise, it is called an unbound reference. For all unbound references, DataStage creates more user variables for these references so that the import can succeed and functionality is not impacted. When this situation happens, an unbound reference warning is also given that some additional user variables have been automatically created by DataStage.
- Assets that are created from parallel job and sequence job imports
- The following table lists assets that are created when you parallel jobs and sequence jobs are imported.
Asset type Type label Assets created on import Parallel job (pjb) px_job DataStage flow + job + connection (from connection properties that are embedded in the job definition) Sequence job sequence_job Watson pipeline flow + job Connection (dcn) connection Connection (named) Parameter set parameter_set Parameter set Table definition xx Data definition
Reloading JAR files for connections that use JDBC JAR files
If you have connections that use JDBC JAR files, you must reload the JAR files to IBM Cloud Pak for Data.- Locate a JDBC connection in a project, then open it.
- Click Upload new file to upload JDBC drivers JAR files. See Importing JDBC drivers for the procedure and required permissions to upload a JAR file to IBM Cloud Pak for Data.
Next, edit the Generic JDBC connection to use the JAR files. Go to . Select the Generic JDBC connection. For more information, see Generic JDBC connection.
Migrating JDBC connections
After you migrate a DataStage job that has a JDBC connection, you must configure the connection to ensure the job works correctly.
- URL
- Username
- Password
- Attributes
- JDBC driver classpath
- JDBC driver classname
After migration, the flow in DataStage will have the use_datastage property set to "true."
- Upload the required JDBC drivers to the Cloud Pak for Data cluster.
- Check the import summary for any instructions that you need to follow. If the original flow did
not have the JDBC Driver classname property, then during the import process a
job parameter is created so you can enter the driver classname.See the following example of an import summary. The
warnings
section describes a parameter that needs to be updated.{ "id": "010c11d2-03d4-4a1d-af02-db5d2de43cd1", "name": "jdbc_oracle_peek", "original_name": "jdbc_oracle_peek", "type": "px_job", "job_type": "px_job", "job_id": "1a90fadb-53e0-4d05-a080-856f5e804e5c", "job_name": "jdbc_oracle_peek.DataStage job", "status": "completed", "conflict_resolution_status": "flow_replacement_succeeded", "end_time": "2023-02-14T18:47:48.071Z", "warnings": [ { "type": "default_connection_parameter", "name": "JDBC_Connector_0", "description": "Parameter values could not be resolved and default values have been added for parameters [jdbc_driver] in JDBC_Connector_0. Resolution: replace default values with actual values." } ] }
Provide the classname as the default value for this job parameter or provide the classname in the properties panel.
Specifying hostname and port number information
In jobs in the traditional version of DataStage, data connection information might be specified in the form of a parameter reference.
When you import one of these jobs into the modern version of DataStage, the connection information is not recognized, and the import process produces warnings. To address the warnings, manually specify the information in the DataStage job settings. Then, recompile the job.
The following table shows the connections that might have this issue, the default values of the parameters, and what fields in traditional DataStage that the parameter information is derived from.
Connection name in traditional DataStage | Connection name in modern DataStage | Field that the original value was derived from in traditional DataStage | Parameter that receives the parameter reference | Default value | Notes |
---|---|---|---|---|---|
Sybase Enterprise | SAP ASE | Server | host | port = 5000 | |
Sybase OC | SAP ASE | Server | host | port = 5000 | |
Db2® | Db2 (optimized) | Instance | advanced.host | advanced.port = 50000 | |
Db2 Enterprise | Db2 (optimized) | Server | advanced.host | advanced.port = 50000 | |
Oracle | Oracle (optimized) | Server | oracle_service_name | oracle_db_port = 1522 | |
Oracle Enterprise | Oracle (optimized) | Server | oracle_service_name | oracle_db_port = 1522 | |
Hive | Apache Hive | Server | host |
|
|
Netezza® Enterprise | Netezza (optimized) | Database | database |
|
|
FTP Enterprise | FTP | URI | host |
|
The file_name property will also be set to a default in the associated flow stage. The
default value will be <file_name>. In addition to manually updating the connection properties and parameter value, you must also change this usage property within the stage, then compile the job. |
Azure Storage Connector | Microsoft Azure File Storage | accountName, accountKey | N/A | https;AccountName=accountName; AccountKey=accountKey |
If either accountName or accountKey are parameter references, the connection string is replaced with a parameter reference for that field. |
Updating job settings for missing properties
When you import certain jobs from the traditional version of DataStage into the modern version, some properties are not
specified in the original job and you might need to specify them manually. DataStage automatically parameterizes some properties as job
parameters. For example, host
is parameterized to #ConnProp_[connection
type]_host#.
You can also set the --create-connection-paramsets
flag on the cpdctl
dsjob migrate
command to parameterize connection details as part of a parameter set asset
instead of as job parameters. For example, host
is parameterized to
#ConnParamSet_[connection type].host#. This option allows you to set parameter
values on the project level by modifying the parameter set asset instead of specifying the value for
each individual job. For more information on the CLI command, see DataStage command-line tools.
A special case for the parameter naming is ODBCConnectorPx and PxODBC, which might insert dsn_type into the name similar to ConnProp_[connectorName]_[dsn_type]_hostname. This pattern applies to JDBC, also (ConnProp_JDBC_[protocol]_[property]). If the string portion protocol is readable it will be carried over, otherwise it be cut from the string and take the form ConnProp_JDBC_[property]. This case applies if dsn_type or protocol are available. This scenario also applies to ODBC.
- Apache Cassandra
- local_datacenter = #ConnProp_Cassandra_local_datacenter# will have a default value of "datacenter1"
- Apache Hive
- database = #ConnProp_Hive_database# will have a default value of "default"
- DRS as Oracle
- service_name = #ConnProp_Oracle_service_name# will have a default value of "orcl"
- Informix® CLI
- database = #ConnProp_Informix_database# will have a default value of "default"
- Informix Enterprise
- database = #ConnProp_Informix_database# will have a default value of "default"
- Informix Load
- database = #ConnProp_Informix_database# will have a default value of "default"
- Informix XPS Load
- database = #ConnProp_Informix_database# will have a default value of "default"
- Oracle
- service_name = #ConnProp_Oracle_service_name# will have a default value of "orcl"
- Oracle Enterprise
- service_name = #ConnProp_Oracle_service_name# will have a default value of "orcl"
For Db2, the field names have periods in them. These periods are dropped when the names are converted into parameter references.
If the missing property is port, the import process produces warnings. To address the warnings, manually specify the information in the DataStage job settings. Then, recompile the job.
The following table shows the connections that might have missing properties, the properties that cannot be retrieved, the default value that will be given, and the field that the original value was derived from in traditional DataStage.
Connection name in traditional DataStage | Connection name in modern DataStage | Missing properties that are automatically given a default parameter | Missing properties that are automatically given a default value | Notes |
---|---|---|---|---|
ODBC Connector | ODBC |
|
|
If the data source is parameterized or if the value does not contain an accepted data source, it will default to Db2. Specify the correct data source type. You can specify a different value for the database property if the correct data source does not use database. For example, Oracle uses service_name instead. |
ODBC Enterprise Connector | ODBC Enterprise |
|
|
If dsn_type is not in the list of allowed values, the default is DB2. |
JDBC Connector | Generic JDBC | jdbc_driver | You must edit the connection for this property and ensure that the driver JAR file is in the runtime pod. | |
DRS Connector | ODBC, Oracle, Db2 (optimized) (not available in the modern version of DataStage) | advanced.hostname | advanced.port = 0 |
|
Informix Load | Informix |
|
port = 1526 | |
Informix Enterprise (if remote server = true) | Informix |
|
port = 1526 | |
Informix Enterprise (If remote server = false) | Informix |
|
port = 1526 | |
Informix CLI | Informix |
|
port = 1526 | |
Informix XPS Load | Informix | host | port = 1526 | |
Db2 Z Bulk Load (PxDB2Z) | Db2 for z/OS® | host | port = 446 | In migrated jobs that contain this connection, DataStage automatically converts other properties to comparable ones in modern DataStage. After you migrate jobs, make sure to check all connection properties for their proper values and provide them where it is necessary. |
Updating credential information
- Google BigQuery
- Google Cloud Storage
- Google Cloud Pub/Sub
- client_id
- client_secret
- access_ token
- refresh_token
Configuring DataStage jobs with Kerberos connection properties
After you migrate a job from legacy DataStage, if the job contains connections that have Kerberos authentication enabled, you must configure the job to work with the authentication.
- Apache Hive
- Apache HDFS
- Apache Kafka
- Provide the Kerberos configuration file by following the instructions in Enabling platform connections to use Kerberos authentication, in the section "Providing the Kerberos configuration file (krb5.conf) to the connections microservice."
- Upload the keytab.
- Open the DataStage flow that is associated with the job, then double-click the connection on the DataStage designer canvas to open the connection properties panel.
- Drag the keytab file from your local computer onto the Keytab file field or click in the field to browse for the file and upload it.
Handling migrated jobs that contain the APT_CONFIG_FILE environment variable
The modern version of DataStage uses dynamic configuration file generation by default. If a migrated flow contains the APT_CONFIG_FILE environment variable, you must review this variable setting and remove it, or create it on /px-storage or other accessible persistent volumes to run the original job. For more information, see Creating and setting the APT_CONFIG_FILE environment variable in DataStage.
Enabling flow connections
If you have a job in traditional DataStage that you want to import that has a connection, you can have the connection properties included in the flow definition when you migrate the job to modern DataStage. To migrate the job this way, you must use the dsjob CLI command to migrate the ISX file that contains the jobs. You must also enable flow connections by adding the parameter --enable-local-connection to the command.
Enabling user-defined functions
If you are migrating a job that contains a parallel routine from traditional DataStage, you must create a function library to enable user-defined functions in the Transformer stage of modern DataStage.
Create the new function library from the existing .so file that the parallel routine points to. Then, configure the library by setting the return data type for each function that you want to use.
Manually creating a new surrogate key file
The stage PxSurrogateKeyGenerator is a type of Surrogate Key Generator (SKG) stage that is deprecated in traditional DataStage. When you migrate a job from traditional DataStage that has this stage, and the stage has both input and output links, the stage is automatically converted to the PxSurrogateKeyGeneratorN type of SKG stage. After migration, you must manually create a new surrogate key file before you run the migrated stage.
- Migrate the job from traditional DataStage.
- Compile the job in modern DataStage. An error is produced that says "Could not find key state file: Surrogate_Key_Generator_1." Record the file name for use in the next step.
- Create a separate DataStage flow that contains an SKG stage to create a key file. Use the name that you recorded for the value of the Source name field in the properties for the SKG stage in the new flow.
- Compile and run the new flow to create the key file Surrogate_Key_Generator_1.
- Compile and run the flow from Step 2 again. The flow compiles and runs successfully because the required key file Surrogate_Key_Generator_1 exists now.