Troubleshoot IBM DataStage

Use these solutions to help resolve problems that you might encounter with IBM® DataStage®.

Getting help and support for DataStage

If you have problems or questions when you use DataStage, you can get help by searching for information or by asking questions through a forum. You can also open a support ticket.

When you ask a question on the forums, tag your question so that it is seen by the DataStage development teams.

For questions about the service and getting started instructions, use the forum at https://stackoverflow.com/questions/tagged/datastage.

General
Connectors
Runtime
ds-metrics

General

Jobs fail with "Resource temporarily unavailable" error

Jobs may fail when the number of processes are running on a pod exceeds the process IDs limit.

Workaround: Increase the process IDs limit. For more information, see Changing the process IDs limit.

Jobs fail because SQL and Before SQL statements run in incorrect order

On the Teradata connector set to ANSI transaction mode, the Before SQL statement may run after the SQL statement instead of before, causing the job to fail.

Workaround: Add a commit statement after each Before SQL statement.

The mailx command fails to run in before-job and after-job subroutines without SMTP server info

If the mailx command is used in a before-job or after-job subroutine, you must provide the SMTP server info or it will forward execution to sendmail and fail.

Properties selections not preserved if you deselect "Use DataStage properties"

If you enter other properties (for example for tables or schemas) with the default Use DataStage properties option selected, and then deselect Use DataStage properties, the properties are not preserved.

Workaround: Deselect the default Use DataStage properties if you intend not to use them before you enter other properties. Otherwise, reselect the properties.

Routine fails when CEL function ds.getUserStatus is run on an external flow

When built-in CEL function ds.getUserStatus is run on a target that is not within the same pipeline, it fails and cannot retrieve the user status. Use the dsjob CLI in your Run Bash script node instead.

For an example of how to rewrite this, see the dsjob command used by DSGetUserStatus() in Routine replacement examples in DataStage.

Job fails when loading a large Excel file

A job with a connector that is processing a large Excel file might fail with this error:

"CDICO9999E: Internal error occurred: IO error: The Excel file is too large. (error code: DATA_IO_ERROR)" 

Try increasing the heap size. The Heap size properties option is in the Other properties section of the connector's Stage tab.

Cannot add DataStage component after system power outage

In some cases of a power outage of the underlying system for Cloud Pak for Data, after power is restored you might get errors or other issues. These issues might include problems when you try to add a DataStage component to a project, for example. You might also encounter other problems like pages not loading or flows not compiling.

To resolve the issues, you can try restarting the DataStage pods:
oc -n ${PROJECT_CPD_INST_OPERANDS} delete pod -l app.kubernetes.io/name=datastage
watch "oc get pods -l app.kubernetes.io/name=datastage | grep -v 1/1 | grep -v 2/2 | grep -v Completed"
Compilation of DataStage flow fails if the Java classpath is parameterized
If you use the Java Integration stage and you parameterize the Java classpath, you must specify the fully qualified JAR file location in the classpath for each JAR file that is used by the stage. See the following example:
PARAM_NAME=/ds-storage/projects/<project_id>/java/<JavaLibraryName>/<classLibraryName>
Exported flows generate JSON connection files that contain plaintext passwords
Downloaded flows might include connection assets that have credentials or other sensitive information. You can run the following command to change the export behavior so that all future exports remove credentials by default.

oc -n ${PROJECT_CPD_INST_OPERANDS} patch datastage datastage --patch '{"spec":{"migration_export_remove_secrets":true}}' --type=merge
Issues browsing database tables with columns that contain special characters

You might have issues when you use the Asset Browser to browse database tables if the selected table contains a column with special characters such as ., $, or #, and you add that table into a DataStage flow. DataStage does not support column names that contain special characters. DataStage flows that reference columns with names that include these special characters will not work.

To work around this problem, create a view over the database table and redefine the column name in the view. For example:

create view view1 as select column1$ as column1, column2# as column2 ... from table

Then, when you use the Asset Browser, find the view and add it to the DataStage flow.

Incorrect inferences assigned to a schema read by the Asset Browser

The Asset Browser will read the first 1000 records and infer the schema, such as column name, length, data type, and nullable, based on these first 1000 records in the files in IBM Cloud Object Storage, Amazon S3, Google Cloud Storage, Azure File Storage, Azure Blob Storage, or the Azure Data Lake service. For instance, the Asset Browser might identify a column as an integer based on what is detected in the first 1000 records, however, later records in the file might show that this column ought to be treated as varchar data type. Similarly, the Asset Browser might infer a column as varchar(20) even though later records show that the column ought to be varchar(100).

Workaround:
  • Profile the source data to generate better metadata.
  • Change all columns to be varchar(1024) and gradually narrow down the data type.
Using sequential files as a source

To use sequential files as a source, make sure that the source file can be accessed from the mounted persistent volume.

Error running jobs with a parquet file format
You might receive the following error when you try to run a job with a parquet file format:
Error: CDICO9999E: Internal error occurred: Illegal 
state error: INTEGER(32,false) can only annotate INT32.
The unsigned 32-bit integer(uint32) and unsigned 64-bit integer(uint64) data types are not supported in the Parquet format that DataStage is using for all the file connectors.

Workaround: You must use supported data types.

Migration pod getting evicted for exceeding its ephemeral storage limits
During import, pod usage of ephemeral local storage can exceed the total limit of containers. You might receive the following message:

Status: Failed
Reason: Evicted
Message: Pod ephemeral local storage usage exceeds the total limit of containers 900Mi.
Workaround: To avoid this problem, you need to increase the ephemeral storage limit to 4Gi from default of 900Mi by running the following command:

oc -n ${PROJECT_CPD_INST_OPERANDS} patch datastage datastage --type merge -p '{"spec": {"custom": {"resources":{"components":{"migration":{"limits":{"ephemeral":"4Gi"}}}}}}}'
Error occurs during the upgrade of the Cloud Pak for Data from 5.0.0 to 5.0.1 version

You may face this error while performing the upgrade of Cloud Pak for Data from 5.0.0 to 5.0.1 The upgrade fails in new upgrade tasks for the remote instances.

Workaround: When the DataStage CR is alternating between Failed and InProgress during the 5.0.1 upgrade, go through the following steps:
  1. Log in to the Red Hat® OpenShift Container Platform cluster by oc and set the default project path to where the Cloud Pak for Data is installed.
    oc project $PROJECT_CPD_INST_OPERANDS
  2. Verify if the PXRruntime instances have been successfully upgraded to version 5.0.1.
    oc get pxruntime
  3. If the PXRruntime CR is not successfully upgraded to version 5.0.1, then run the following commands:
    
    echo "Adding installedVersion to DataStage CR"
    oc patch datastage datastage --type='json' -p='[{"op": "add", "path": "/spec/installedVersion", "value": "5.0.1" }]'
    while true; do echo "Waiting for DataStage CR to be in Completed state"; sleep 30; if [ $(oc get datastage datastage -o=jsonpath="{.status.dsStatus}") = "Completed" ]; then break; fi; done
    echo "Removing installedVersion from DataStage CR"
    oc patch datastage datastage --type='json' -p='[{"op": "remove", "path": "/spec/installedVersion"}]'
    while true; do echo "Waiting for DataStage CR to be in Completed state"; sleep 30; if [ $(oc get datastage datastage -o=jsonpath="{.status.dsStatus}") = "Completed" ]; then break; fi; done
Flows that contain the Transformer stage time out during compilation

A timeout may occur while compiling flows that contain the transformer stage.

Default value of the APT_COMPILEOPT environment variable:
-c -O -fPIC -Wno-deprecated -m64 -mtune=generic -mcmodel=small
Workaround: Disable compile time optimization by changing -O to -O0 in compile options in the APT_COMPILEOPT environment variable:
-c -O0 -fPIC -Wno-deprecated -m64 -mtune=generic -mcmodel=small
Ephemeral local storage usage exceeds the total limit of a container

Exceeding the ephemeral local storage can cause unexpected termination of a compute pod and multiple jobs to fail.

Workaround: Check your ephemeral storage usage with the following commands:
$ oc get nodes                                                                                         
NAME                                   STATUS   ROLES                  AGE    VERSION
master0.tahoetest882.cp.fyre.ibm.com   Ready    control-plane,master   220d   v1.28.15+ff493be
master1.tahoetest882.cp.fyre.ibm.com   Ready    control-plane,master   220d   v1.28.15+ff493be
master2.tahoetest882.cp.fyre.ibm.com   Ready    control-plane,master   220d   v1.28.15+ff493be
worker0.tahoetest882.cp.fyre.ibm.com   Ready    worker                 219d   v1.28.15+ff493be
worker1.tahoetest882.cp.fyre.ibm.com   Ready    worker                 219d   v1.28.15+ff493be
worker2.tahoetest882.cp.fyre.ibm.com   Ready    worker                 219d   v1.28.15+ff493be
worker3.tahoetest882.cp.fyre.ibm.com   Ready    worker                 219d   v1.28.15+ff493be
worker4.tahoetest882.cp.fyre.ibm.com   Ready    worker                 219d   v1.28.15+ff493be
$ oc get --raw "/api/v1/nodes/worker0.tahoetest882.cp.fyre.ibm.com/proxy/stats/summary"
... 
  "ephemeral-storage": {
    "time": "2025-03-06T21:48:21Z",
    "availableBytes": 71087955968,
    "capacityBytes": 267830407168,
    "usedBytes": 408403968,
    "inodesFree": 125135426,
    "inodes": 130809280,
    "inodesUsed": 3288
   },
For more information on ephemeral storage, see Modifying the ephemeral storage limit.

Connectors

Netezza connector: Duplicate records occur when partitioned reads are enabled

When partitioned reads are enabled on the Netezza connector in parallel execution mode, duplicate records may occur. To avoid duplicate records, add partition placeholders into the SQL or set the execution mode to sequential. To add partition placeholders, add the string mod(datasliceid,[[node-count]])=[[node-number]], as in the following example.
SELECT * FROM table WHERE mod(datasliceid,[[node-count]])=[[node-number]]
MySQL connector: Jobs might fail if you use Write mode "Update" for the target without a primary key

If you create a table in a MySQL database without specifying a primary key in the WHERE column, and then you try to run a job that uses that table with the Write mode Update for the target, the job might fail.

Solution: Specify a primary key name in the Key column names field. If the table is big and does not have a primary column, you can create a separate column with auto-increment values to use as the primary key.

FTP connector: The home directory path is prepended to the path

When you run a job that uses data from an FTP data source, the home or login directory is prepended to the path that you specified. This action happens regardless if you specify an absolute path (with a leading forward slash) or a relative path (without a leading forward slash). For example, if you specify the directory as /tmp/SampleData.txt, the path resolves to /home/username/tmp/SampleData.txt.

Workaround: Edit the File name in the FTP connector. Specify the absolute path to the source or target file.

Jobs fail with error "The connector could not establish a connection to Db2 database"

Jobs may fail with error "The connector could not establish a connection to Db2 database".

Workaround: Go to the connection properties and set the Options property to connectTimeout=0.

Job with source data from a SAP OData connector fails

If your flow includes source data from SAP OData, the flow might fail if you created the flow by manually adding columns that do not follow the SAP naming convention.

Workaround: Update the flow by or adding the columns with the Asset browser or by renaming the columns according to the SAP naming convention. The SAP naming convention follows the SAP object hierarchy with two underscore characters (__) as a separator. For example, if the PurchaseOrder column belongs to PurchaseOrderNote, then the column name should be specified as PurchaseOrderNote__PurchaseOrder.

Cannot run transactional SQL on data from Apache Hive version 2.0 or earlier

If your data is from Apache Hive version 2.0 or earlier and your DataStage flow executes UPDATE or DELETE statements, the job might fail. Make sure that the target table has been created according to Hive transactions requirements and that the Apache Hive server is configured to support ACID operations.

The minimum set of parameters (configured in the hive-site.xml file) that you must enable for ACID tables in Apache Hive is:

hive.support.concurrency = true
hive.enforce.bucketing = true (not required as of Hive 2.0)
hive.exec.dynamic.partition.mode = nonstrict
hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager

hive.compactor.initiator.on = true
hive.compactor.worker.threads = <a positive number>

For more information, refer to Hive Transactions.

IBM Db2 for DataStage connection with SSL certificate fails with "Protocol specific error code 414" error

If you use an SSL certificate in the IBM Db2 for DataStage connection and the connection fails with a "Protocol specific error code 414" error, use this workaround:

  1. Identify the root certificate on the Db2 server. You can use this command to view the certificate chain:
    openssl s_client -connect <hostname>:<port> -showcerts
  2. Ensure that the certificate has the same subject and issuer.
  3. In the Create connection: IBM IBM Db2 for DataStage page, enter the root certificate in the SSL certificate (arm) field.
Error parameterizing the credential field for a flow connection in IBM Cloud Object Storage

When the Authentication method property is set to Service credentials (full JSON snippet), do not parameterize the Service credentials field. If a parameter is provided for that field, the flow will not compile.

PostgreSQL connector times out on large tables

The PostgreSQL connector might fail with a timeout error when a large table (100,000+ rows) is used as a source. To fix this error, try setting a higher timeout value for the APT_RECORD_TIMEOUT environment variable. See Managing environment variables in DataStage.

Jobs with an Apache Hive connection that has ZooKeeper discovery enabled fail

If your DataStage flow includes data from an Apache Hive connection and you have selected Use ZooKeeper discovery for the connection, the flow might fail because it has too many warnings.

Workaround: Increase the number of allowed warnings in the DataStage flow. Go to Settings > Run > Warnings. Then recompile the job.

Schema changes that originate in data from the HTTP connector can cause the job to fail

When you use the HTTP connector to download a file and then upload the same file into IBM Cloud Object Storage or a database, if the file's schema changes over time, the job might fail.

Workaround: Re-create the stage.

Cannot preview data from the generic JDBC connector

If your DataStage flow uses the generic JDBC connector as a target, you cannot preview the data for the following data sources in the Generic JDBC connection:

Vendors:

  • Amazon Redshift
  • Apache Hive
  • MongoDB
  • MySQL

Cloud Pak for Data connections:

  • Amazon Redshift
  • Oracle

Workaround for the data sources except Amazon Redshift (vendor or Cloud Pak for Data connection): In the target stage, select Enable quoted identifiers under the Stage properties.

Not able to create successful database connection that uses the SSL certificate

The OpenSSL 3.0.9 version does not allow weak ciphers to be used while generating the SSL certificate. When ValidateServerCertificate (VSC) attribute is set to 0, the connection treats the certificate as invalid. The value must be set to 1 to create a connection.

Workaround: Generate a new SSL certificate which has strong ciphers using OpenSSL 3.0.x version.
  1. Check ciphers in your SSL certificate with the following command. If the certificate uses sha1WithRSAEncryption signature algorithm, it is considered as a weak cipher:
    openssl x509 -in cert.pem -text -noout
  2. Generate the SSL certificate with the following command:
    openssl.exe pkcs12 -in certificate_name -export -out truststore_filename -nokeys -keypbe cryptographic_algorithm -certpbe cryptographic_algorithm -password pass:truststore_password -nomac
Snowflake connector: Jobs are failing with "java/lang/OutOfMemoryError" error
The log displays the following error message:
java/lang/OutOfMemoryError", exception "Failed to create a thread: retVal -1073741830, errno 11"

Workaround: Increase the heap size on the Output or Input tab for your Snowflake connector.

Snowflake connector: Jobs are failing with "Fork failed: Resource temporarily unavailable" error
The log displays the following error messages:
<SCLoadAudIdLd.sf_write__JOB_EXECUTION_LOG__Ins,1> Error: Unable to create iprofiler thread11/6/2024 06:13:03 WARNING IIS-DSEE-USBP-00002 <Sf_STG_VEH_ALFA,0> Error: Unable to create iprofiler thread
and
WARNING IIS-DSEE-USBP-00002 <sc_AFT_COST_CENTRE_CD.sf_write__AUTO_FINANCE_TYPE__Ins,1>
Type=Segmentation error vmState=0x00000000
WARNING IIS-DSEE-USBP-00002 <sc_AFT_COST_CENTRE_CD.sf_write__AUTO_FINANCE_TYPE__Ins,1> J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001

Workaround: Increase PID Limits inside the OpenShift cluster. Set podsPIDLimit to 16,384, which is a maximum value for PID limits. For information on how to increase PID Limits, see Red Hat Customer Portal.

Runtime

Large ISX imports produce gateway timeout and compilation errors

Importing a large ISX file can cause gateway timeout errors, and importing one with many Build stages can cause compilation errors due to lack of resources.

Workaround: Modify the configurations for DataStage and px-runtime to allocate more resources. See the following recommended custom configurations.

For more information, see Customizing resources.

Edit the DataStage custom resource to modify the migration resources:

oc edit datastage datastage -n ${PROJECT_CPD_INST_OPERANDS}
Inside the spec: section of the yaml add the below content:


spec:
  custom:
    resources:
      components: 
        migration:
          replicas: 1
          limits:
            cpu: "6"
            memory: 20Gi
          requests:
            cpu: "2"
            memory: 6Gi
Update PXRuntime resources:
# retrieve pxruntime cr
oc -n ${PROJECT_CPD_INST_OPERANDS} get pxruntime
Edit the custom resource:

oc -n ${PROJECT_CPD_INST_OPERANDS} edit pxruntime <cr-name>
Inside the spec: section of the yaml add the below content:


spec:
  custom:
    resources:
      components:
        pxruntime:
          limits:
            cpu: "4"
            memory: 8Gi
          requests:
            cpu: "2"
            memory: 2Gi
Out of memory issue for the DataStage operator

When more than 5 PX runtime instances are deployed on the cluster, the operator may run out of memory. To resolve this issue, update the CSV to increase the memory limits:

Retrieve the DataStage CSV:

oc -n ${PROJECT_CPD_INST_OPERATORS} oc get csv | grep datastage
Patch the DataStage CSV to increase the operator memory from 1Gi to 2Gi:



oc -n ${PROJECT_CPD_INST_OPERATORS} patch csv <DataStage-CSV-name> --type='json' -p='[{"op": "replace", "path": "/spec/install/spec/deployments/0/spec/template/spec/containers/0/resources/limits/memory", "value": "2Gi" }]'
Jobs queue when compute pods fail to start

If the runtime instance compute pods don't start, all jobs will run on the px-runtime pod. Resource limits lead to jobs being queued.

Workaround: Fix any issues that are preventing the compute pods from starting.

Message handler changes are not applied to a job run

The message handler is cached for 3 minutes after it is fetched. Immediate changes to the message handler will not affect the job run.

Workaround: Wait a few minutes to run the job.

Job queuing

When you submit a job, it may not start immediately. Instead, it gets a queued state. This is the way the system manages resources and prioritizes work.

Why your job run queued?
Reason Description
Resource limitations Your job waits for the necessary resources (CPU, GPU, memory) to become available. It happens when:
  • Other jobs use all available capacity.
  • Your job requires more resources than are currently free.
Concurrency limits Some systems enforce limits on the number of jobs that can run simultaneously. Your job queues until other job finish.
Priority and scheduling Lower-priority jobs may be queued while higher-priority jobs are run first.
Maintenance or downtime The system can be undergoing maintenance or updates, which reduces available capacity or pauses job execution.
Queue backlog High traffic or large numbers of submitted jobs can create a backlog. Jobs start when the earlier jobs are completed.

Resolution: The DataStage Workload Manager (WLM) ships with a default configuration that allows running 5 jobs simultaneously. If you have more resources, you should scale pxruntime to a larger instance. For more information, see Customizing hardware configurations for DataStage service instances with the command line. After modifying the scale configuration, it's important to update the WLM configuration file to enable the execution of more jobs simultaneously. So, if you need to run another concurrent jobs, make sure to adjust the RunJob settings in the XML file.

Restarting container pods
If you notice that compute pods have different ages, then you should consider younger containers as recently restarted.

ds-px-default-ibm-datastage-px-compute-0                          1/1     Running     0               169m
ds-px-default-ibm-datastage-px-compute-1                          1/1     Running     0               5h51m
ds-px-default-ibm-datastage-px-compute-2                          1/1     Running     0               2d7h
Workaround: Container restart can cause some job failures. For example, you can notice the following error message in the job log:

##I IIS-DSEE-TLCP-00013 2024-06-19 01:53:18(000) <main_program> SSLConnection(16252672:411,248): Error from SSL read bytesRead=0 ssl_error=5 ssl_error_string= errno=0 errno_string=Success
##E IIS-DSEE-TFPM-00330 2024-06-19 01:53:18(001) <main_program> The Section Leader on node node4 has terminated unexpectedly.
##I IIS-DSEE-TLCP-00013 2024-06-19 01:53:18(002) <main_program> SSLConnection(15598048:411,248): Error from SSL read bytesRead=-1 ssl_error=5 ssl_error_string= errno=104 errno_string=Connection reset by peer
##E IIS-DSEE-TFPM-00330 2024-06-19 01:53:18(003) <main_program> The Section Leader on node node2 has terminated unexpectedly.
##I IIS-DSEE-TLCP-00013 2024-06-19 01:53:18(004) <main_program> SSLConnection(16202192:411,248): Error from SSL read bytesRead=-1 ssl_error=5 ssl_error_string= errno=104 errno_string=Connection reset by peer
##E IIS-DSEE-TFPM-00330 2024-06-19 01:53:18(005) <main_program> The Section Leader on node node3 has terminated unexpectedly.
##W IIS-DSEE-TFPM-00647 2024-06-19 01:53:23(000) <main_program> APT_PMwaitForSectionLeadersCleanup: non-zero status 2 from APT_PMpollUntilZero. If this message persists try increasing the poll timeout seconds by setting APT_PM_CLEANUP_TIMEOUT.
##I IIS-DSEE-TFSR-00115 2024-06-19 01:53:28(000) <main_program> Starting job postRun
Container restarts due to multiple reasons, for example:
  • Too many jobs are running concurrently. The container have no CPU cycle to respond on OpenShift health probe. You can detect it by running oc adm top pods | grep -i datastage and checking the CPU usage. If the CPU usage is very high, then you need to add more resources or lower the concurrency in wlm.config.xml.
  • Running a large job exhausts all available memory. This causes container to run into OOMKilled. You can detect it by analyzing output from oc get events | grep -i OOM.
  • Ephemeral storage is used up. By default, pxruntime cr has limited temporary storage and compute pods to 10GB. See:
[root@solo1 761]# oc get pxruntime -o yaml
apiVersion: v1
items:
- apiVersion: ds.cpd.ibm.com/v1
  kind: PXRuntime
  metadata:
    annotations:
      ansible.sdk.operatorframework.io/verbosity: "3"
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"ds.cpd.ibm.com/v1","kind":"PXRuntime","metadata":{"annotations":{"ansible.sdk.operatorframework.io/verbosity":"3"},"name":"ds-px-default","namespace":"ds"},"spec":{"description":"The default DataStage runtime instance","ephemeralStorageLimit":"10Gi","license":{"accept":true},"parameters":{"scaleConfig":"small","storageClass":"nfs-client"},"version":"5.0.0","zenCloudPakInstanceId":"dda5c8d1-9e57-472b-90b9-ce27adbe28ea","zenControlPlaneNamespace":"ds","zenServiceInstanceId":1715711527523198,"zenServiceInstanceNamespace":"ds","zenServiceInstanceOwnerUID":1000331001}}
    creationTimestamp: "2024-05-14T18:32:08Z"
    finalizers:
    - ds.cpd.ibm.com/finalizer
    generation: 50
    name: ds-px-default
    namespace: ds
    resourceVersion: "38308991"
    uid: e9d537c6-5d6b-459a-8ea4-dc7fe4c9b5d5
  spec:
    additional_storage:
    - mount_path: /mnts/pipline
      pvc_name: volumes-pipline-pvc
    - mount_path: /mnts/user-ing-flows
      pvc_name: volumes-user-ing-flows-pvc
    description: The default DataStage runtime instance
    ephemeralStorageLimit: 10Gi

If you want to remove ephemeral storage limit or change it to higher value, run oc edit pxruntimeor oc edit sts ds-px-default-ibm-datastage-px-compute. If you have other DataStage runtime instances, then you need to make the same modifications to those instances.

Migration service can face similar problem. To resolve it run oc edit deployment datastage-ibm-datastage-migration, then modify ephemeral-storage: 4Gi to have more space.

PXRuntime instance that is deployed on physical location cannot appear

While deploying remote data plane on the clusters, you can run into the issue of not appearing PXRruntime instance, which was deployed on physical location. It happens because the connection is being closed before completing the request.

Workaround: The timeout setting for the load balancer needs to be increased. For more information, see: Changing load balancer timeout settings.

Jobs get stuck, nonexistent jobs display in Starting/Running state
When your stale jobs are running, you can have a problem with deleting them from the UI. You can face similar problem with the projects because they can possibly using resources (for example memory). To clean up those processes, use one of the following commands:
cpdctl asset delete --asset-id ASSET-ID --purge-on-delete=true`
cpdctl dsjob jobrunclean {{--project PROJECT | --project-id PROJID} | {--space SPACE | --space-id SPACEID}} {--name NAME | --id ID} [--run-id RUNID] [--dry-run] [--threads n] [--all-spaces] [--before YYYY-MM-DD:hh:mm:ss]
Note: Using those commands cleans up all active jobs in a project. Make sure to stop running new jobs.
Synchronization error across pods with mounted NFS path in PXRuntime

If you use NFS for storage, synchronization issue can occur when one compute pod writes a file to an NFS-mounted path and another compute pod reads the file immediately. Although the file exists in the NFS path, read operations sometimes fail because of the delayed visibility across pods.

Workaround: Add the actimeo=0 parameter to the persistent volume (PV) configuration.

cat <<EOF |oc apply -f -
apiVersion: v1
kind: PersistentVolume
metadata:
  name: sample-nfs-pv
spec:
  capacity:
    storage: 100Gi
  accessModes:
  - ReadWriteMany
  mountOptions:
  - actimeo=0
  nfs:
    path: /data/sample-pv    # all worker nodes on the cluster should be allowed 
    server: <NFS server IP>  # to mount the path specified on the NFS server
  persistentVolumeReclaimPolicy: Retain
EOF 

For more information, see Setting up an NFS mount.

ds-metrics

Running ds-metrics on FIPS clusters for 5.0.3 version fails
When a job is run for the first time, the database initialization on a FIPS cluster fails. ds-metrics displays the following error in the logs:
java.lang.NoClassDefFoundError: liquibase.snapshot.SnapshotIdService (initialization failure)
liquibase.exception.UnexpectedLiquibaseException: java.security.NoSuchAlgorithmException: Error constructing implementation (algorithm: MD5, provider: SUN,
class: sun.security.provider.NativeMD5)
java.security.NoSuchAlgorithmException: Error constructing implementation (algorithm: MD5, provider: SUN, class: sun.security.provider.NativeMD5)
java.security.ProviderException: Error in Native Digest
This error causes the database to fill up with connections and will lock the user out. If you log in the database again, the following error message displays:
FATAL: remaining connection slots are reserved for non-replication superuser connections
Workaround:
  1. Open the project where you use your database credentials and remove the credentials.
  2. Disable the Enable metrics option and save your changes.
  3. Restart ds-metrics by deleting its pods.
This causes the large number of connections to disappear from the database and makes it accessible again.
Cannot enable ds-metrics on FIPS-tolerant or FIPS-enabled clusters for 5.1.0 version
ds-metrics displays the following error in the logs:
java.lang.NoClassDefFoundError: liquibase.snapshot.SnapshotIdService (initialization failure)
liquibase.exception.UnexpectedLiquibaseException: java.security.NoSuchAlgorithmException: Error constructing implementation (algorithm: MD5, provider: SUN,
class: sun.security.provider.NativeMD5)
java.security.NoSuchAlgorithmException: Error constructing implementation (algorithm: MD5, provider: SUN, class: sun.security.provider.NativeMD5)
java.security.ProviderException: Error in Native Digest

Workaround: Initialize your database and configure ds-metrics through environment variables. For more information on enabling ds-metrics, see Storing and persisting metrics.

Job_run_log stores only non-INFO level log messages
Workaround: Set the environment variable METRICS_SEND_FULL_LOG to true in the px-runtime instance. If the variable is set to true, the full job run log is sent to metrics and stored in the job_run_log table. To set the environment variable, see Storing and persisting metrics.
Metrics database cannot be cleared out fully after deleting the ds-metrics schema
If you want to clear out the database that was initialized by ds-metrics, you need to delete the databasechangelog and databasechangeloglock tables in the default public schema.
Workaround: Use the following command to clear out the metrics database:
drop schema if exists ds_metrics cascade;
drop table if exists public.databasechangelog;
drop table if exists public.databasechangeloglock;
ds-metrics displays short error messages in the service logs
You may face error messages at the start of a job run, for example:
org.hibernate.engine.jdbc.spi.SqlExceptionHelper E logExceptions Batch entry 0 /* insert for
com.ibm.ds.metrics.api.models.JobRun */insert into ds_metrics.job_run (conductor_pid,config_file,controller_id,create_time,duration,instance_id,job_id,last_update_time,partition,queue_name,run_name,run_status,start_time,stop_time,user_status,run_id) values ((NULL),(NULL),(NULL),('2024-10-02 17:54:56.52134+00'),(NULL),('ds-px-default'),('324cac2e-8eac-4452-af75-81ee357fcb82'),('2024-10-02
17:54:56.521442+00'),('2'::int4),('Medium'),('name'),(NULL),(NULL),(NULL),(NULL),('00b84b48-47ea-4a6e-9343-1ed68afb7179')) was aborted: ERROR: duplicate key
value violates unique constraint "job_run_pkey" Detail: Key (run_id)=(00b84b48-47ea-4a6e-9343-1ed68afb7179) already exists. Call getNextException to see other errors in the batch.

Workaround: ds-metrics handles the errors, which can be ignored.