Fix Readme
Abstract
This consolidated patch addresses multiple critical issues in IBM watsonx.data Spark Analytics Engine for CPD versions 5.2.1/2.2.1.
Content
Table of Contents
- Overview
- Issues Addressed
- Patch Images
- Prerequisites
- Backup Steps
- Air-Gapped Installation Steps
- Patch Application
- Validation
- Rollback Instructions
- Troubleshooting
- Summary of Changes
Overview
This consolidated patch addresses multiple critical issues in IBM watsonx.data Spark Analytics Engine for CPD versions 5.2.1/2.2.1.
Issues Addressed
Issue 1: Ephemeral Storage Configuration
Problem: When a Mira application is submitted, ephemeral storage is automatically set in resource requests/limits with incorrect values.
Solution: Ephemeral storage values are entirely removed from request/limit configurations.
Issue 2: Iceberg Merge Table Operations
Problem: Enabling a required table property on Iceberg tables during migration from Delta Lake caused MERGE INTO operations to fail due to an open-source bug.
Solution: Fix for performing merge into operations for Iceberg tables.
Issue 3: Authorization Token Expiry
Problem: Long-running jobs exceeding 20 minutes fail because the auth token expires after 20 minutes, causing any ongoing operations to fail.
Solution: Updated runtime images to handle token refresh for long-running operations.
Issue 4: Authorization Issues with Idle running Query Server
Problem: Running queries after keeping Query Server idle for approx ~15 mins, causing authorization failure (instance not found) for Spark queries.
Solution: Updated runtime images of Spark to include fixed version of ACExtension which had a race condition.
Patch Images
This hotfix updates the following container images:
| Component Name (Key) | Image Name | Full Image Reference | Digest |
|---|---|---|---|
| spark-hb-control-plane | spark-hb-control-plane | cp.icr.io/cp/cpd/spark-hb-control-plane@sha256:33fc3bdb873ef1c2ad979aff4e1418fbf5ea41a904dbcf78cbaa105093bce100 | sha256:33fc3bdb873ef1c2ad979aff4e1418fbf5ea41a904dbcf78cbaa105093bce100 |
| spark-hb-helm-repo | spark-hb-helm-repo | cp.icr.io/cp/cpd/spark-hb-helm-repo@sha256:5e72d5d33d242bff2e28423b9419921afc6bbe52dc1f3becd53469ab72d5ddf8 | sha256:5e72d5d33d242bff2e28423b9419921afc6bbe52dc1f3becd53469ab72d5ddf8 |
| spark-hb-wxd-cpd-miniforge-runtimes-v34 | spark-hb-wxd-cpd-miniforge-runtimes | cp.icr.io/cp/cpd/spark-hb-wxd-cpd-miniforge-runtimes@sha256:c49d8ad3b467ac1669e798cfd4810ea65a8d424f8f7cebd32f748f6936f45c68 | sha256:c49d8ad3b467ac1669e798cfd4810ea65a8d424f8f7cebd32f748f6936f45c68 |
| spark-hb-wxd-cpd-miniforge-runtimes-v35 | spark-hb-wxd-cpd-miniforge-runtimes | cp.icr.io/cp/cpd/spark-hb-wxd-cpd-miniforge-runtimes@sha256:682d5db5bff70fbd6a3a87af6903963c306a3eac8043df3d740afa555dda2ecf | sha256:682d5db5bff70fbd6a3a87af6903963c306a3eac8043df3d740afa555dda2ecf |
| spark-hb-wxd-cpd-miniforge-runtimes-v40 | spark-hb-wxd-cpd-miniforge-runtimes | cp.icr.io/cp/cpd/spark-hb-wxd-cpd-miniforge-runtimes@sha256:7c3b3f35b01816ec4f21241982d61f61cf2786c53460bae6f92060e752143065 | sha256:7c3b3f35b01816ec4f21241982d61f61cf2786c53460bae6f92060e752143065 |
Prerequisites
Before applying this hotfix, ensure you have:
- OpenShift cluster admin access
- Access to the CPD instance namespace
- Backup of current configuration
- For air-gapped environments:
skopeotool installed - For air-gapped environments: Valid
auth.jsoncredentials file - Set environment variable:
export PROJECT_CPD_INSTANCE=<your-cpd-namespace>
Backup Steps
Why backup? This allows you to quickly revert to the previous state if needed.
Create Backup of AnalyticsEngine Custom Resource
oc get analyticsengine analyticsengine-sample -n ${PROJECT_CPD_INSTANCE} -o yaml > analyticsengine_bkp.yamlVerify Backup
ls -lh analyticsengine_bkp.yaml
cat analyticsengine_bkp.yaml | head -20Expected Result: The file should contain valid Kubernetes YAML configuration for the AnalyticsEngine resource.
Air-Gapped Installation Steps
Note: Skip this section if you have direct internet connectivity to IBM Container Registry.
Step 1: Login to OpenShift
oc login --token=<your-token> --server=<your-server-url>Step 2: Prepare Authentication Credentials
You need an auth.json file with credentials for both IBM Container Registry and your private registry.
Option A: Use existing auth.json from CASE download
export AUTH_FILE="${HOME}/.airgap/auth.json"Option B: Create a new auth.json file
{
"auths": {
"cp.icr.io": {
"email": "unused",
"auth": "<base64 encoded id:apikey>"
},
"<private-registry-hostname>": {
"email": "unused",
"auth": "<base64 encoded id:password>"
}
}
}How to encode credentials:
echo -n "iamapikey:<your-api-key>" | base64Reference: containers-auth.json documentation
Step 3: Install Skopeo
yum install skopeo -yOr for RHEL 8+:
dnf install skopeo -yStep 4: Identify Your Private Registry
Find the current image location:
oc describe pod -l function=spark-hb-control-plane -n ${PROJECT_CPD_INSTANCE} | grep -i "image:"Step 5: Get Private Registry Details
oc get imageContentSourcePolicy
oc describe imageContentSourcePolicy cloud-pak-for-data-mirrorLook for output like:
- mirrors:
- ${PRIVATE_REGISTRY_LOCATION}/cp/
source: cp.icr.io/cp/cpdReference: Configuring cluster to pull CPD images
Step 6: Copy Hotfix Images to Private Registry
Important: When copying the commands below, paste them into a text editor first to ensure no extra newline characters are added after the backslashes (\). Extra characters will cause the command to fail.
Copy spark-hb-control-plane:
skopeo copy --all --authfile "<path-to-auth-json>/auth.json" \
--dest-tls-verify=false --src-tls-verify=false \
docker://cp.icr.io/cp/cpd/spark-hb-control-plane@sha256:33fc3bdb873ef1c2ad979aff4e1418fbf5ea41a904dbcf78cbaa105093bce100 \
docker://<your-private-registry>/cp/cpd/spark-hb-control-plane@sha256:33fc3bdb873ef1c2ad979aff4e1418fbf5ea41a904dbcf78cbaa105093bce100Copy spark-hb-helm-repo:
skopeo copy --all --authfile "<path-to-auth-json>/auth.json" \
--dest-tls-verify=false --src-tls-verify=false \
docker://cp.icr.io/cp/cpd/spark-hb-helm-repo@sha256:5e72d5d33d242bff2e28423b9419921afc6bbe52dc1f3becd53469ab72d5ddf8 \
docker://<your-private-registry>/cp/cpd/spark-hb-helm-repo@sha256:5e72d5d33d242bff2e28423b9419921afc6bbe52dc1f3becd53469ab72d5ddf8Copy spark-hb-wxd-cpd-miniforge-runtimes-v34:
skopeo copy --all --authfile "<path-to-auth-json>/auth.json" \
--dest-tls-verify=false --src-tls-verify=false \
docker://cp.icr.io/cp/cpd/spark-hb-wxd-cpd-miniforge-runtimes@sha256:c49d8ad3b467ac1669e798cfd4810ea65a8d424f8f7cebd32f748f6936f45c68 \
docker://<your-private-registry>/cp/cpd/spark-hb-wxd-cpd-miniforge-runtimes@sha256:c49d8ad3b467ac1669e798cfd4810ea65a8d424f8f7cebd32f748f6936f45c68Copy spark-hb-wxd-cpd-miniforge-runtimes-v35:
skopeo copy --all --authfile "<path-to-auth-json>/auth.json" \
--dest-tls-verify=false --src-tls-verify=false \
docker://cp.icr.io/cp/cpd/spark-hb-wxd-cpd-miniforge-runtimes@sha256:682d5db5bff70fbd6a3a87af6903963c306a3eac8043df3d740afa555dda2ecf \
docker://<your-private-registry>/cp/cpd/spark-hb-wxd-cpd-miniforge-runtimes@sha256:682d5db5bff70fbd6a3a87af6903963c306a3eac8043df3d740afa555dda2ecfCopy spark-hb-wxd-cpd-miniforge-runtimes-v40:
skopeo copy --all --authfile "<path-to-auth-json>/auth.json" \
--dest-tls-verify=false --src-tls-verify=false \
docker://cp.icr.io/cp/cpd/spark-hb-wxd-cpd-miniforge-runtimes@sha256:7c3b3f35b01816ec4f21241982d61f61cf2786c53460bae6f92060e752143065 \
docker://<your-private-registry>/cp/cpd/spark-hb-wxd-cpd-miniforge-runtimes@sha256:7c3b3f35b01816ec4f21241982d61f61cf2786c53460bae6f92060e752143065Replace:
<path-to-auth-json>with your actual auth.json file path<your-private-registry>with your private registry hostname
Patch Application
This section applies whether you're using online registry or air-gapped environment.
Step 1: Set Environment Variable
export PROJECT_CPD_INSTANCE=<your-cpd-namespace>Example:
export PROJECT_CPD_INSTANCE=cpd-instanceStep 2: Apply Comprehensive Patch to AnalyticsEngine
Important: Ensure the command is on a single line with no line breaks or extra spaces.
Complete Patch Command (includes all fixes):
oc patch analyticsengine analyticsengine-sample -n ${PROJECT_CPD_INSTANCE} --type merge --patch '{
"spec": {
"autoScaleConfig": true,
"scaleConfig": "large",
"controlPlaneResourceConfig": {
"requests": {
"cpu": "2",
"memory": "8Gi",
"ephemeralStorage": "1200Mi"
},
"limits": {
"cpu": "2",
"memory": "8Gi",
"ephemeralStorage": "1200Mi"
}
},
"hpa": {
"control_plane": {
"autoscaling": {
"min_replicas": 1,
"max_replicas": 3,
"medium_min_replicas": 3,
"medium_max_replicas": 7,
"large_min_replicas": 5,
"large_max_replicas": 11,
"target_cpu_utilization_percentage": 50
}
},
"deployer_agent": {
"autoscaling": {
"min_replicas": 1,
"max_replicas": 3,
"medium_min_replicas": 3,
"medium_max_replicas": 7,
"large_min_replicas": 5,
"large_max_replicas": 11,
"target_cpu_utilization_percentage": 50
}
},
"nginx": {
"autoscaling": {
"min_replicas": 1,
"max_replicas": 3,
"medium_min_replicas": 3,
"medium_max_replicas": 7,
"large_min_replicas": 5,
"large_max_replicas": 11,
"target_cpu_utilization_percentage": 50
}
},
"ui": {
"autoscaling": {
"min_replicas": 1,
"max_replicas": 3,
"medium_min_replicas": 3,
"medium_max_replicas": 7,
"large_min_replicas": 5,
"large_max_replicas": 9,
"target_cpu_utilization_percentage": 140
}
}
},
"image_digests": {
"spark-hb-control-plane": "sha256:33fc3bdb873ef1c2ad979aff4e1418fbf5ea41a904dbcf78cbaa105093bce100",
"spark-hb-helm-repo": "sha256:5e72d5d33d242bff2e28423b9419921afc6bbe52dc1f3becd53469ab72d5ddf8",
"spark-hb-wxd-cpd-miniforge-runtimes-v34": "sha256:c49d8ad3b467ac1669e798cfd4810ea65a8d424f8f7cebd32f748f6936f45c68",
"spark-hb-wxd-cpd-miniforge-runtimes-v35": "sha256:682d5db5bff70fbd6a3a87af6903963c306a3eac8043df3d740afa555dda2ecf",
"spark-hb-wxd-cpd-miniforge-runtimes-v40": "sha256:7c3b3f35b01816ec4f21241982d61f61cf2786c53460bae6f92060e752143065"
}
}
}'Expected Output:
analyticsengine.ae.cpd.ibm.com/analyticsengine-sample patchedStep 3: Monitor Reconciliation
Watch the AnalyticsEngine resource status:
oc get AnalyticsEngine analyticsengine-sample -n ${PROJECT_CPD_INSTANCE} -wWhat to look for:
- Status should transition to
Completed - This may take 5-10 minutes
Alternative monitoring:
watch -n 10 "oc get AnalyticsEngine analyticsengine-sample -n ${PROJECT_CPD_INSTANCE}"Step 4: Verify Pod Updates
Check that the pods are running with the new images:
oc get pods -l function=spark-hb-control-plane -n ${PROJECT_CPD_INSTANCE}Expected Result: All pods should be in Running state with recent restart times.
Validation
1. Check AnalyticsEngine Status
oc get AnalyticsEngine analyticsengine-sample -n ${PROJECT_CPD_INSTANCE}Expected Output:
NAME STATUS AGE
analyticsengine-sample Completed 45d2. Verify Image Digests
Verify control-plane image:
oc get pod -l function=spark-hb-control-plane -n ${PROJECT_CPD_INSTANCE} -o jsonpath='{.items[0].status.containerStatuses[?(@.name=="spark-hb-control-plane")].imageID}' | grep -q "33fc3bdb873ef1c2ad979aff4e1418fbf5ea41a904dbcf78cbaa105093bce100" && echo "Control plane patch applied successfully" || echo "Control plane patch validation failed"Verify all image digests:
oc get AnalyticsEngine analyticsengine-sample -n ${PROJECT_CPD_INSTANCE} -o yaml | grep -A 10 image_digestsExpected output should show:
image_digests:
spark-hb-control-plane: sha256:33fc3bdb873ef1c2ad979aff4e1418fbf5ea41a904dbcf78cbaa105093bce100
spark-hb-helm-repo: sha256:5e72d5d33d242bff2e28423b9419921afc6bbe52dc1f3becd53469ab72d5ddf8
spark-hb-wxd-cpd-miniforge-runtimes-v34: sha256:c49d8ad3b467ac1669e798cfd4810ea65a8d424f8f7cebd32f748f6936f45c68
spark-hb-wxd-cpd-miniforge-runtimes-v35: sha256:682d5db5bff70fbd6a3a87af6903963c306a3eac8043df3d740afa555dda2ecf
spark-hb-wxd-cpd-miniforge-runtimes-v40: sha256:7c3b3f35b01816ec4f21241982d61f61cf2786c53460bae6f92060e7521430653. Verify Pods are Running
oc get pods -n ${PROJECT_CPD_INSTANCE} | grep spark-hbExpected Result: All spark-hb pods should be in Running state.
4. Test Functionality
Test 1: Validate Ephemeral Storage Fix
- Ensure the AnalyticsEngine CR is in Completed state
- Run an autoscale job
- Verify that the Mira worker pod does not have ephemeral storage configured in its resource requests and limits
oc describe pod <mira-worker-pod-name> -n ${PROJECT_CPD_INSTANCE} | grep -A 10 "Limits:"Test 2: Validate Iceberg Merge Fix
- Run a DBT application performing a
MERGE INTOoperation - Ensure the target table has the property
"write.spark.accept-any-schema" = true - Confirm that the operation completes successfully without errors
Test 3: Validate Token Expiry Fix
- Submit a long-running Spark job (>20 minutes)
- Monitor the job execution
- Verify that the job completes successfully without authentication errors
oc logs <spark-driver-pod> -n ${PROJECT_CPD_INSTANCE} | grep -i "auth\|token"Test 4: Validate Long Running Idle Query Server fix
- Start a spark query server with ACExtension enabled
- Wait untill started and then run sample LH queries or DBT model
- Let query server be idle for approx 15mins by not running any query
- Run the same LH queries or DBT model to verify if the authorisation issue is resolved
oc logs <spark-driver-pod> -n ${PROJECT_CPD_INSTANCE} | grep -i "auth\|token"Rollback Instructions
If you need to revert the hotfix, follow these steps:
Prerequisites
- Ensure you have the
analyticsengine_bkp.yamlbackup file created in the Backup Steps section
Rollback Steps
1. Restore Original Configuration
oc apply -f analyticsengine_bkp.yamlExpected Output:
analyticsengine.ae.cpd.ibm.com/analyticsengine-sample configured2. Monitor Reconciliation
oc get AnalyticsEngine analyticsengine-sample -n ${PROJECT_CPD_INSTANCE} -wWait for the status to return to Completed.
3. Verify Rollback
Check that pods are running with the original images:
oc get pods -l function=spark-hb-control-plane -n ${PROJECT_CPD_INSTANCE}
oc describe pod -l function=spark-hb-control-plane -n ${PROJECT_CPD_INSTANCE} | grep -i "image:"The image digests should not contain the patched versions.
Troubleshooting
Issue 1: Patch command fails with "invalid character" error
Solution: Ensure there are no line breaks or extra spaces in the oc patch command. Copy to a text editor first, verify it's properly formatted, then execute.
Issue 2: Pods not updating after patch
Solution:
Verify operator pod is running:
oc get pods -l name=analyticsengine -n ${PROJECT_CPD_INST_OPERATORS}Check operator logs:
oc logs -l name=analyticsengine -n ${PROJECT_CPD_INST_OPERATORS} --tail=100Verify the patch was applied:
oc get AnalyticsEngine analyticsengine-sample -n ${PROJECT_CPD_INSTANCE} -o yaml | grep -A 10 image_digests
Issue 3: Image pull errors in air-gapped environment
Solution:
- Verify the images exist in your private registry
- Check imageContentSourcePolicy is correctly configured
- Verify auth.json has valid credentials for both registries
Test image pull manually:
oc run test-pull --image=<your-private-registry>/cp/cpd/spark-hb-control-plane@sha256:33fc3bdb873ef1c2ad979aff4e1418fbf5ea41a904dbcf78cbaa105093bce100 --rm -it --restart=Never -- /bin/sh
Issue 4: Validation command shows wrong digest
Solution:
- Wait longer for pod rollout to complete (can take 10-15 minutes)
- Check if pods are still in
ContainerCreatingorImagePullBackOffstate Review pod events:
oc describe pod <pod-name> -n ${PROJECT_CPD_INSTANCE}
Issue 5: Long-running jobs still failing after patch
Solution:
Verify all runtime images are updated:
oc get cj spark-hb-preload-jkg-image -n ${PROJECT_CPD_INSTANCE} -o yaml | grep image | grep wxdCheck Spark driver logs for authentication errors:
oc logs <spark-driver-pod> -n ${PROJECT_CPD_INSTANCE} | grep -i "auth\|token\|expire"- Ensure the job is using the updated runtime version
Support
If you encounter issues not covered in this document:
Collect diagnostic information:
oc get AnalyticsEngine analyticsengine-sample -n ${PROJECT_CPD_INSTANCE} -o yaml > analyticsengine_status.yaml oc get pods -n ${PROJECT_CPD_INSTANCE} > pods_status.txt oc logs -l name=analyticsengine-operator -n ${PROJECT_CPD_INSTANCE} --tail=500 > operator_logs.txt- Open a support ticket with IBM Support
- Include the collected diagnostic files
Additional Notes
- Downtime: This hotfix requires pod restarts but does not require full AnalyticsEngine downtime
- Duration: Typical CR reconciliation time is 10-15 minutes
- Compatibility: This hotfix is compatible with CPD 5.2.1/2.2.1
- Components Updated:
- spark-hb-control-plane
- spark-hb-helm-repo
- spark-hb-wxd-cpd-miniforge-runtimes (v34, v35, v40)
Summary of Changes
| Issue | Component | Fix Description |
|---|---|---|
| Ephemeral Storage | Mira Worker Pods | Removed incorrect ephemeral storage values from resource requests/limits |
| Iceberg Merge | Runtime Images (v34, v35, v40) | Fixed MERGE INTO operations for Iceberg tables with schema evolution |
| Token Expiry | Runtime Images (v34, v35, v40) | Implemented token refresh mechanism for long-running jobs (>20 minutes) |
| Autoscaling | Control Plane, HPA Config | Updated autoscaling configuration for large-scale deployments |
| Authorization Issue with idle query server | Runtime Images (v34, v35, v40) | Updated runtime images with fixed version of ACExtension |
Was this topic helpful?
Document Information
Modified date:
07 May 2026
UID
ibm17272337