Troubleshoot IBM DataStage
Use these solutions to help resolve problems that you might encounter with IBM® DataStage®.
Getting help and support for DataStage
If you have problems or questions when you use DataStage, you can get help by searching for information or by asking questions through a forum. You can also open a support ticket.
When you ask a question on the forums, tag your question so that it is seen by the DataStage development teams.
For questions about the service and getting started instructions, use the forum at https://stackoverflow.com/questions/tagged/datastage.
- General
-
- Jobs fail with "Resource temporarily unavailable" error
- Jobs fail because SQL and Before SQL statements run in incorrect order
- The mailx command fails to run in before-job and after-job subroutines without SMTP server info
- Properties selections not preserved if you deselect "Use DataStage properties"
- Routine fails when CEL function ds.getUserStatus is run on an external flow
- Job fails when loading a large Excel file
- Cannot add DataStage component after system power outage
- Compilation of DataStage flow fails if the Java classpath is parameterized
- Exported flows generate JSON connection files that contain plaintext passwords
- Issues browsing database tables with columns that contain special characters
- Incorrect inferences assigned to a schema read by the Asset Browser
- Using sequential files as a source
- Error running jobs with a parquet file format
- Migration pod getting evicted for exceeding its ephemeral storage limits
- Error occurs during the upgrade of the Cloud Pak for Data from 5.0.0 to 5.0.1 version
- Compilation of a flow that contains the transformer stage is timed out
- Ephemeral local storage usage exceeds the total limit of a container
- Connectors
-
- Netezza® connector: Duplicate records occur when partitioned reads are enabled
- MySQL connector: Jobs might fail if you use Write mode "Update" for the target without a primary key
- FTP connector: The home directory path is prepended to the path
- Jobs fail with error "The connector could not establish a connection to Db2® database"
- Job with source data from a SAP OData connector fails
- Cannot run transactional SQL on data from Apache Hive version 2.0 or earlier
- IBM Db2 for DataStage connection with SSL certificate fails with "Protocol specific error code 414" error
- Error parameterizing the credential field for a flow connection in IBM Cloud® Object Storage
- PostgreSQL connector times out on large tables
- Jobs with an Apache Hive connection that has ZooKeeper discovery enabled fail
- Schema changes that originate in data from the HTTP connector can cause the job to fail
- Cannot preview data from the generic JDBC connector
- Not able to create successful database connection that uses the SSL certificate
- Snowflake connector: Jobs are failing with "java/lang/OutOfMemoryError" error
- Snowflake connector: Jobs are failing with "Fork failed: Resource temporarily unavailable" error
- Runtime
-
- Large ISX imports produce gateway timeout and compilation errors
- Out of memory issue for the DataStage operator
- Jobs queue when compute pods fail to start
- Message handler changes are not applied to a job run
- Job queuing
- Restarting container pods
- PXRruntime instance that is deployed on physical location cannot appear
- No existing jobs getting stuck or display in a Starting or Running state
- Synchronization error across pods with mounted NFS path in PXRuntime
- ds-metrics
-
- Running ds-metrics on FIPS-clusters for 5.0.3 version fails
- Cannot enable ds-metrics FIPS-tolerant or FIPS-enabled clusters for 5.1.0 version
Job_run_log
stores only non-INFO level log messages- Metrics database cannot be cleared out fully after deleting the ds_metrics schema
- ds-metrics display short error messages in the service logs
General
- Jobs fail with "Resource temporarily unavailable" error
-
Jobs may fail when the number of processes are running on a pod exceeds the process IDs limit.
Workaround: Increase the process IDs limit. For more information, see Changing the process IDs limit.
- Jobs fail because SQL and Before SQL statements run in incorrect order
-
On the Teradata connector set to ANSI transaction mode, the Before SQL statement may run after the SQL statement instead of before, causing the job to fail.
Workaround: Add a commit statement after each Before SQL statement.
- The mailx command fails to run in before-job and after-job subroutines without SMTP server info
-
If the
mailx
command is used in a before-job or after-job subroutine, you must provide the SMTP server info or it will forward execution tosendmail
and fail.
- Properties selections not preserved if you deselect "Use DataStage properties"
-
If you enter other properties (for example for tables or schemas) with the default Use DataStage properties option selected, and then deselect Use DataStage properties, the properties are not preserved.
Workaround: Deselect the default Use DataStage properties if you intend not to use them before you enter other properties. Otherwise, reselect the properties.
- Routine fails when CEL function ds.getUserStatus is run on an external flow
-
When built-in CEL function
ds.getUserStatus
is run on a target that is not within the same pipeline, it fails and cannot retrieve the user status. Use thedsjob
CLI in your Run Bash script node instead.For an example of how to rewrite this, see the
dsjob
command used byDSGetUserStatus()
in Routine replacement examples in DataStage.
- Job fails when loading a large Excel file
-
A job with a connector that is processing a large Excel file might fail with this error:
"CDICO9999E: Internal error occurred: IO error: The Excel file is too large. (error code: DATA_IO_ERROR)"
Try increasing the heap size. The Heap size properties option is in the Other properties section of the connector's Stage tab.
- Cannot add DataStage component after system power outage
-
In some cases of a power outage of the underlying system for Cloud Pak for Data, after power is restored you might get errors or other issues. These issues might include problems when you try to add a DataStage component to a project, for example. You might also encounter other problems like pages not loading or flows not compiling.
To resolve the issues, you can try restarting the DataStage pods:oc -n ${PROJECT_CPD_INST_OPERANDS} delete pod -l app.kubernetes.io/name=datastage watch "oc get pods -l app.kubernetes.io/name=datastage | grep -v 1/1 | grep -v 2/2 | grep -v Completed"
- Compilation of DataStage flow fails if the Java classpath is parameterized
-
If you use the Java Integration stage and you parameterize the Java classpath, you must specify the fully qualified JAR file location in the classpath for each JAR file that is used by the stage. See the following example:
PARAM_NAME=/ds-storage/projects/<project_id>/java/<JavaLibraryName>/<classLibraryName>
- Exported flows generate JSON connection files that contain plaintext passwords
-
Downloaded flows might include connection assets that have credentials or other sensitive information. You can run the following command to change the export behavior so that all future exports remove credentials by default.
oc -n ${PROJECT_CPD_INST_OPERANDS} patch datastage datastage --patch '{"spec":{"migration_export_remove_secrets":true}}' --type=merge
- Issues browsing database tables with columns that contain special characters
-
You might have issues when you use the Asset Browser to browse database tables if the selected table contains a column with special characters such as ., $, or #, and you add that table into a DataStage flow. DataStage does not support column names that contain special characters. DataStage flows that reference columns with names that include these special characters will not work.
To work around this problem, create a view over the database table and redefine the column name in the view. For example:
create view view1 as select column1$ as column1, column2# as column2 ... from table
Then, when you use the Asset Browser, find the view and add it to the DataStage flow.
- Incorrect inferences assigned to a schema read by the Asset Browser
-
The Asset Browser will read the first 1000 records and infer the schema, such as column name, length, data type, and nullable, based on these first 1000 records in the files in IBM Cloud Object Storage, Amazon S3, Google Cloud Storage, Azure File Storage, Azure Blob Storage, or the Azure Data Lake service. For instance, the Asset Browser might identify a column as an integer based on what is detected in the first 1000 records, however, later records in the file might show that this column ought to be treated as varchar data type. Similarly, the Asset Browser might infer a column as varchar(20) even though later records show that the column ought to be varchar(100).
Workaround:- Profile the source data to generate better metadata.
- Change all columns to be varchar(1024) and gradually narrow down the data type.
- Using sequential files as a source
-
To use sequential files as a source, make sure that the source file can be accessed from the mounted persistent volume.
- Error running jobs with a parquet file format
- You might receive the following error when you try to run a job with a parquet file format:
The unsigned 32-bit integer(uint32) and unsigned 64-bit integer(uint64) data types are not supported in the Parquet format that DataStage is using for all the file connectors.Error: CDICO9999E: Internal error occurred: Illegal state error: INTEGER(32,false) can only annotate INT32.
Workaround: You must use supported data types.
- Migration pod getting evicted for exceeding its ephemeral storage limits
-
During import, pod usage of ephemeral local storage can exceed the total limit of containers. You might receive the following message:
Workaround: To avoid this problem, you need to increase the ephemeral storage limit to 4Gi from default of 900Mi by running the following command:Status: Failed Reason: Evicted Message: Pod ephemeral local storage usage exceeds the total limit of containers 900Mi.
oc -n ${PROJECT_CPD_INST_OPERANDS} patch datastage datastage --type merge -p '{"spec": {"custom": {"resources":{"components":{"migration":{"limits":{"ephemeral":"4Gi"}}}}}}}'
- Error occurs during the upgrade of the Cloud Pak for Data from 5.0.0 to 5.0.1 version
-
You may face this error while performing the upgrade of Cloud Pak for Data from 5.0.0 to 5.0.1 The upgrade fails in new upgrade tasks for the remote instances.
Workaround: When the DataStage CR is alternating between Failed and InProgress during the 5.0.1 upgrade, go through the following steps:- Log in to the Red Hat® OpenShift Container
Platform cluster by
oc
and set the default project path to where the Cloud Pak for Data is installed.oc project $PROJECT_CPD_INST_OPERANDS
- Verify if the PXRruntime instances have been successfully upgraded to version
5.0.1.
oc get pxruntime
- If the PXRruntime CR is not successfully upgraded to version 5.0.1, then run the following
commands:
echo "Adding installedVersion to DataStage CR" oc patch datastage datastage --type='json' -p='[{"op": "add", "path": "/spec/installedVersion", "value": "5.0.1" }]' while true; do echo "Waiting for DataStage CR to be in Completed state"; sleep 30; if [ $(oc get datastage datastage -o=jsonpath="{.status.dsStatus}") = "Completed" ]; then break; fi; done echo "Removing installedVersion from DataStage CR" oc patch datastage datastage --type='json' -p='[{"op": "remove", "path": "/spec/installedVersion"}]' while true; do echo "Waiting for DataStage CR to be in Completed state"; sleep 30; if [ $(oc get datastage datastage -o=jsonpath="{.status.dsStatus}") = "Completed" ]; then break; fi; done
- Log in to the Red Hat® OpenShift Container
Platform cluster by
- Flows that contain the Transformer stage time out during compilation
-
A timeout may occur while compiling flows that contain the transformer stage.
Default value of theAPT_COMPILEOPT
environment variable:-c -O -fPIC -Wno-deprecated -m64 -mtune=generic -mcmodel=small
Workaround: Disable compile time optimization by changing-O
to-O0
in compile options in theAPT_COMPILEOPT
environment variable:-c -O0 -fPIC -Wno-deprecated -m64 -mtune=generic -mcmodel=small
- Ephemeral local storage usage exceeds the total limit of a container
Connectors
- Netezza connector: Duplicate records occur when partitioned reads are enabled
-
When partitioned reads are enabled on the Netezza connector in parallel execution mode, duplicate records may occur. To avoid duplicate records, add partition placeholders into the SQL or set the execution mode to sequential. To add partition placeholders, add the string
mod(datasliceid,[[node-count]])=[[node-number]]
, as in the following example.SELECT * FROM table WHERE mod(datasliceid,[[node-count]])=[[node-number]]
- MySQL connector: Jobs might fail if you use Write mode "Update" for the target without a primary key
-
If you create a table in a MySQL database without specifying a primary key in the WHERE column, and then you try to run a job that uses that table with the Write mode Update for the target, the job might fail.
Solution: Specify a primary key name in the Key column names field. If the table is big and does not have a primary column, you can create a separate column with auto-increment values to use as the primary key.
- FTP connector: The home directory path is prepended to the path
-
When you run a job that uses data from an FTP data source, the home or login directory is prepended to the path that you specified. This action happens regardless if you specify an absolute path (with a leading forward slash) or a relative path (without a leading forward slash). For example, if you specify the directory as
/tmp/SampleData.txt
, the path resolves to/home/username/tmp/SampleData.txt
.Workaround: Edit the File name in the FTP connector. Specify the absolute path to the source or target file.
- Jobs fail with error "The connector could not establish a connection to Db2 database"
-
Jobs may fail with error "The connector could not establish a connection to Db2 database".
Workaround: Go to the connection properties and set the Options property to
connectTimeout=0
.
- Job with source data from a SAP OData connector fails
-
If your flow includes source data from SAP OData, the flow might fail if you created the flow by manually adding columns that do not follow the SAP naming convention.
Workaround: Update the flow by or adding the columns with the Asset browser or by renaming the columns according to the SAP naming convention. The SAP naming convention follows the SAP object hierarchy with two underscore characters (
__
) as a separator. For example, if the PurchaseOrder column belongs to PurchaseOrderNote, then the column name should be specified as PurchaseOrderNote__PurchaseOrder.
- Cannot run transactional SQL on data from Apache Hive version 2.0 or earlier
-
If your data is from Apache Hive version 2.0 or earlier and your DataStage flow executes UPDATE or DELETE statements, the job might fail. Make sure that the target table has been created according to Hive transactions requirements and that the Apache Hive server is configured to support ACID operations.
The minimum set of parameters (configured in the hive-site.xml file) that you must enable for ACID tables in Apache Hive is:
hive.support.concurrency = true hive.enforce.bucketing = true (not required as of Hive 2.0) hive.exec.dynamic.partition.mode = nonstrict hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager hive.compactor.initiator.on = true hive.compactor.worker.threads = <a positive number>
For more information, refer to Hive Transactions.
- IBM Db2 for DataStage connection with SSL certificate fails with "Protocol specific error code 414" error
-
If you use an SSL certificate in the IBM Db2 for DataStage connection and the connection fails with a "Protocol specific error code 414" error, use this workaround:
- Identify the root certificate on the Db2
server. You can use this command to view the certificate
chain:
openssl s_client -connect <hostname>:<port> -showcerts
- Ensure that the certificate has the same subject and issuer.
- In the Create connection: IBM IBM Db2 for DataStage page, enter the root certificate in the SSL certificate (arm) field.
- Identify the root certificate on the Db2
server. You can use this command to view the certificate
chain:
- Error parameterizing the credential field for a flow connection in IBM Cloud Object Storage
-
When the Authentication method property is set to Service credentials (full JSON snippet), do not parameterize the Service credentials field. If a parameter is provided for that field, the flow will not compile.
- PostgreSQL connector times out on large tables
-
The PostgreSQL connector might fail with a timeout error when a large table (100,000+ rows) is used as a source. To fix this error, try setting a higher timeout value for the APT_RECORD_TIMEOUT environment variable. See Managing environment variables in DataStage.
- Jobs with an Apache Hive connection that has ZooKeeper discovery enabled fail
-
If your DataStage flow includes data from an Apache Hive connection and you have selected Use ZooKeeper discovery for the connection, the flow might fail because it has too many warnings.
Workaround: Increase the number of allowed warnings in the DataStage flow. Go to . Then recompile the job.
- Schema changes that originate in data from the HTTP connector can cause the job to fail
-
When you use the HTTP connector to download a file and then upload the same file into IBM Cloud Object Storage or a database, if the file's schema changes over time, the job might fail.
Workaround: Re-create the stage.
- Cannot preview data from the generic JDBC connector
-
If your DataStage flow uses the generic JDBC connector as a target, you cannot preview the data for the following data sources in the Generic JDBC connection:
Vendors:
- Amazon Redshift
- Apache Hive
- MongoDB
- MySQL
Cloud Pak for Data connections:
- Amazon Redshift
- Oracle
Workaround for the data sources except Amazon Redshift (vendor or Cloud Pak for Data connection): In the target stage, select Enable quoted identifiers under the Stage properties.
- Not able to create successful database connection that uses the SSL certificate
- Snowflake connector: Jobs are failing with "java/lang/OutOfMemoryError" error
- Snowflake connector: Jobs are failing with "Fork failed: Resource temporarily unavailable" error
Runtime
- Large ISX imports produce gateway timeout and compilation errors
-
Importing a large ISX file can cause gateway timeout errors, and importing one with many Build stages can cause compilation errors due to lack of resources.
Workaround: Modify the configurations for DataStage and
px-runtime
to allocate more resources. See the following recommended custom configurations.For more information, see Customizing resources.
Edit the DataStage custom resource to modify the migration resources:
Inside the spec: section of the yaml add the below content:oc edit datastage datastage -n ${PROJECT_CPD_INST_OPERANDS}
spec: custom: resources: components: migration: replicas: 1 limits: cpu: "6" memory: 20Gi requests: cpu: "2" memory: 6Gi
Update PXRuntime resources:# retrieve pxruntime cr oc -n ${PROJECT_CPD_INST_OPERANDS} get pxruntime
Edit the custom resource:oc -n ${PROJECT_CPD_INST_OPERANDS} edit pxruntime <cr-name>
Inside the spec: section of the yaml add the below content:spec: custom: resources: components: pxruntime: limits: cpu: "4" memory: 8Gi requests: cpu: "2" memory: 2Gi
- Out of memory issue for the DataStage operator
-
When more than 5 PX runtime instances are deployed on the cluster, the operator may run out of memory. To resolve this issue, update the CSV to increase the memory limits:
Retrieve the DataStage CSV:oc -n ${PROJECT_CPD_INST_OPERATORS} oc get csv | grep datastage
Patch the DataStage CSV to increase the operator memory from 1Gi to 2Gi:oc -n ${PROJECT_CPD_INST_OPERATORS} patch csv <DataStage-CSV-name> --type='json' -p='[{"op": "replace", "path": "/spec/install/spec/deployments/0/spec/template/spec/containers/0/resources/limits/memory", "value": "2Gi" }]'
- Jobs queue when compute pods fail to start
-
If the runtime instance compute pods don't start, all jobs will run on the px-runtime pod. Resource limits lead to jobs being queued.
Workaround: Fix any issues that are preventing the compute pods from starting.
- Message handler changes are not applied to a job run
-
The message handler is cached for 3 minutes after it is fetched. Immediate changes to the message handler will not affect the job run.
Workaround: Wait a few minutes to run the job.
- Job queuing
-
When you submit a job, it may not start immediately. Instead, it gets a queued state. This is the way the system manages resources and prioritizes work.
Why your job run queued?Reason Description Resource limitations Your job waits for the necessary resources (CPU, GPU, memory) to become available. It happens when: - Other jobs use all available capacity.
- Your job requires more resources than are currently free.
Concurrency limits Some systems enforce limits on the number of jobs that can run simultaneously. Your job queues until other job finish. Priority and scheduling Lower-priority jobs may be queued while higher-priority jobs are run first. Maintenance or downtime The system can be undergoing maintenance or updates, which reduces available capacity or pauses job execution. Queue backlog High traffic or large numbers of submitted jobs can create a backlog. Jobs start when the earlier jobs are completed. Resolution: The DataStage Workload Manager (WLM) ships with a default configuration that allows running 5 jobs simultaneously. If you have more resources, you should scale pxruntime to a larger instance. For more information, see Customizing hardware configurations for DataStage service instances with the command line. After modifying the scale configuration, it's important to update the WLM configuration file to enable the execution of more jobs simultaneously. So, if you need to run another concurrent jobs, make sure to adjust the RunJob settings in the XML file.
- Restarting container pods
-
If you notice that compute pods have different ages, then you should consider younger containers as recently restarted.
ds-px-default-ibm-datastage-px-compute-0 1/1 Running 0 169m ds-px-default-ibm-datastage-px-compute-1 1/1 Running 0 5h51m ds-px-default-ibm-datastage-px-compute-2 1/1 Running 0 2d7h
Workaround: Container restart can cause some job failures. For example, you can notice the following error message in the job log:##I IIS-DSEE-TLCP-00013 2024-06-19 01:53:18(000) <main_program> SSLConnection(16252672:411,248): Error from SSL read bytesRead=0 ssl_error=5 ssl_error_string= errno=0 errno_string=Success ##E IIS-DSEE-TFPM-00330 2024-06-19 01:53:18(001) <main_program> The Section Leader on node node4 has terminated unexpectedly. ##I IIS-DSEE-TLCP-00013 2024-06-19 01:53:18(002) <main_program> SSLConnection(15598048:411,248): Error from SSL read bytesRead=-1 ssl_error=5 ssl_error_string= errno=104 errno_string=Connection reset by peer ##E IIS-DSEE-TFPM-00330 2024-06-19 01:53:18(003) <main_program> The Section Leader on node node2 has terminated unexpectedly. ##I IIS-DSEE-TLCP-00013 2024-06-19 01:53:18(004) <main_program> SSLConnection(16202192:411,248): Error from SSL read bytesRead=-1 ssl_error=5 ssl_error_string= errno=104 errno_string=Connection reset by peer ##E IIS-DSEE-TFPM-00330 2024-06-19 01:53:18(005) <main_program> The Section Leader on node node3 has terminated unexpectedly. ##W IIS-DSEE-TFPM-00647 2024-06-19 01:53:23(000) <main_program> APT_PMwaitForSectionLeadersCleanup: non-zero status 2 from APT_PMpollUntilZero. If this message persists try increasing the poll timeout seconds by setting APT_PM_CLEANUP_TIMEOUT. ##I IIS-DSEE-TFSR-00115 2024-06-19 01:53:28(000) <main_program> Starting job postRun
Container restarts due to multiple reasons, for example:- Too many jobs are running concurrently. The container have no CPU cycle to respond on OpenShift
health probe. You can detect it by running
oc adm top pods | grep -i datastage
and checking the CPU usage. If the CPU usage is very high, then you need to add more resources or lower the concurrency in wlm.config.xml. - Running a large job exhausts all available memory. This causes container to run into OOMKilled.
You can detect it by analyzing output from
oc get events | grep -i OOM
. - Ephemeral storage is used up. By default, pxruntime cr has limited temporary storage and compute pods to 10GB. See:
[root@solo1 761]# oc get pxruntime -o yaml apiVersion: v1 items: - apiVersion: ds.cpd.ibm.com/v1 kind: PXRuntime metadata: annotations: ansible.sdk.operatorframework.io/verbosity: "3" kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"ds.cpd.ibm.com/v1","kind":"PXRuntime","metadata":{"annotations":{"ansible.sdk.operatorframework.io/verbosity":"3"},"name":"ds-px-default","namespace":"ds"},"spec":{"description":"The default DataStage runtime instance","ephemeralStorageLimit":"10Gi","license":{"accept":true},"parameters":{"scaleConfig":"small","storageClass":"nfs-client"},"version":"5.0.0","zenCloudPakInstanceId":"dda5c8d1-9e57-472b-90b9-ce27adbe28ea","zenControlPlaneNamespace":"ds","zenServiceInstanceId":1715711527523198,"zenServiceInstanceNamespace":"ds","zenServiceInstanceOwnerUID":1000331001}} creationTimestamp: "2024-05-14T18:32:08Z" finalizers: - ds.cpd.ibm.com/finalizer generation: 50 name: ds-px-default namespace: ds resourceVersion: "38308991" uid: e9d537c6-5d6b-459a-8ea4-dc7fe4c9b5d5 spec: additional_storage: - mount_path: /mnts/pipline pvc_name: volumes-pipline-pvc - mount_path: /mnts/user-ing-flows pvc_name: volumes-user-ing-flows-pvc description: The default DataStage runtime instance ephemeralStorageLimit: 10Gi
If you want to remove ephemeral storage limit or change it to higher value, run
oc edit pxruntime
oroc edit sts ds-px-default-ibm-datastage-px-compute
. If you have other DataStage runtime instances, then you need to make the same modifications to those instances.Migration service can face similar problem. To resolve it run
oc edit deployment datastage-ibm-datastage-migration
, then modifyephemeral-storage: 4Gi
to have more space. - Too many jobs are running concurrently. The container have no CPU cycle to respond on OpenShift
health probe. You can detect it by running
- PXRuntime instance that is deployed on physical location cannot appear
-
While deploying remote data plane on the clusters, you can run into the issue of not appearing PXRruntime instance, which was deployed on physical location. It happens because the connection is being closed before completing the request.
Workaround: The timeout setting for the load balancer needs to be increased. For more information, see: Changing load balancer timeout settings.
- Jobs get stuck, nonexistent jobs display in Starting/Running state
-
When your stale jobs are running, you can have a problem with deleting them from the UI. You can face similar problem with the projects because they can possibly using resources (for example memory). To clean up those processes, use one of the following commands:
cpdctl asset delete --asset-id ASSET-ID --purge-on-delete=true`
cpdctl dsjob jobrunclean {{--project PROJECT | --project-id PROJID} | {--space SPACE | --space-id SPACEID}} {--name NAME | --id ID} [--run-id RUNID] [--dry-run] [--threads n] [--all-spaces] [--before YYYY-MM-DD:hh:mm:ss]
Note: Using those commands cleans up all active jobs in a project. Make sure to stop running new jobs.
- Synchronization error across pods with mounted NFS path in PXRuntime
ds-metrics
- Running
ds-metrics
on FIPS clusters for 5.0.3 version fails - When a job is run for the first time, the database initialization on a FIPS cluster fails.
ds-metrics
displays the following error in the logs:
This error causes the database to fill up with connections and will lock the user out. If you log in the database again, the following error message displays:java.lang.NoClassDefFoundError: liquibase.snapshot.SnapshotIdService (initialization failure) liquibase.exception.UnexpectedLiquibaseException: java.security.NoSuchAlgorithmException: Error constructing implementation (algorithm: MD5, provider: SUN, class: sun.security.provider.NativeMD5) java.security.NoSuchAlgorithmException: Error constructing implementation (algorithm: MD5, provider: SUN, class: sun.security.provider.NativeMD5) java.security.ProviderException: Error in Native Digest
FATAL: remaining connection slots are reserved for non-replication superuser connections
Workaround:- Open the project where you use your database credentials and remove the credentials.
- Disable the Enable metrics option and save your changes.
- Restart
ds-metrics
by deleting its pods.
- Cannot enable
ds-metrics
on FIPS-tolerant or FIPS-enabled clusters for 5.1.0 version ds-metrics
displays the following error in the logs:java.lang.NoClassDefFoundError: liquibase.snapshot.SnapshotIdService (initialization failure) liquibase.exception.UnexpectedLiquibaseException: java.security.NoSuchAlgorithmException: Error constructing implementation (algorithm: MD5, provider: SUN, class: sun.security.provider.NativeMD5) java.security.NoSuchAlgorithmException: Error constructing implementation (algorithm: MD5, provider: SUN, class: sun.security.provider.NativeMD5) java.security.ProviderException: Error in Native Digest
Workaround: Initialize your database and configure
ds-metrics
through environment variables. For more information on enablingds-metrics
, see Storing and persisting metrics.
Job_run_log
stores only non-INFO level log messages- Workaround: Set the environment variable METRICS_SEND_FULL_LOG to
true
in the px-runtime instance. If the variable is set totrue
, the full job run log is sent to metrics and stored in thejob_run_log
table. To set the environment variable, see Storing and persisting metrics.
- Metrics database cannot be cleared out fully after deleting the ds-metrics schema
- If you want to clear out the database that was initialized by
ds-metrics
, you need to delete thedatabasechangelog
anddatabasechangeloglock
tables in the defaultpublic
schema.Workaround: Use the following command to clear out the metrics database:drop schema if exists ds_metrics cascade; drop table if exists public.databasechangelog; drop table if exists public.databasechangeloglock;
ds-metrics
displays short error messages in the service logs- You may face error messages at the start of a job run, for
example:
org.hibernate.engine.jdbc.spi.SqlExceptionHelper E logExceptions Batch entry 0 /* insert for com.ibm.ds.metrics.api.models.JobRun */insert into ds_metrics.job_run (conductor_pid,config_file,controller_id,create_time,duration,instance_id,job_id,last_update_time,partition,queue_name,run_name,run_status,start_time,stop_time,user_status,run_id) values ((NULL),(NULL),(NULL),('2024-10-02 17:54:56.52134+00'),(NULL),('ds-px-default'),('324cac2e-8eac-4452-af75-81ee357fcb82'),('2024-10-02 17:54:56.521442+00'),('2'::int4),('Medium'),('name'),(NULL),(NULL),(NULL),(NULL),('00b84b48-47ea-4a6e-9343-1ed68afb7179')) was aborted: ERROR: duplicate key value violates unique constraint "job_run_pkey" Detail: Key (run_id)=(00b84b48-47ea-4a6e-9343-1ed68afb7179) already exists. Call getNextException to see other errors in the batch.
Workaround:
ds-metrics
handles the errors, which can be ignored.