IBM Support

Consolidated Patch for IBM watsonx.data Spark Analytics Engine for CPD versions 5.2.1/2.2.1

Fix Readme


Abstract

This consolidated patch addresses multiple critical issues in IBM watsonx.data Spark Analytics Engine for CPD versions 5.2.1/2.2.1.

Content

Table of Contents

  1. Overview
  2. Issues Addressed
  3. Patch Images
  4. Prerequisites
  5. Backup Steps
  6. Air-Gapped Installation Steps
  7. Patch Application
  8. Validation
  9. Rollback Instructions
  10. Troubleshooting
  11. Summary of Changes

Overview

This consolidated patch addresses multiple critical issues in IBM watsonx.data Spark Analytics Engine for CPD versions 5.2.1/2.2.1.

Issues Addressed

Issue 1: Ephemeral Storage Configuration

Problem: When a Mira application is submitted, ephemeral storage is automatically set in resource requests/limits with incorrect values.

Solution: Ephemeral storage values are entirely removed from request/limit configurations.

Issue 2: Iceberg Merge Table Operations

Problem: Enabling a required table property on Iceberg tables during migration from Delta Lake caused MERGE INTO operations to fail due to an open-source bug.

Solution: Fix for performing merge into operations for Iceberg tables.

Issue 3: Authorization Token Expiry

Problem: Long-running jobs exceeding 20 minutes fail because the auth token expires after 20 minutes, causing any ongoing operations to fail.

Solution: Updated runtime images to handle token refresh for long-running operations.

Issue 4: Authorization Issues with Idle running Query Server

Problem: Running queries after keeping Query Server idle for approx ~15 mins, causing authorization failure (instance not found) for Spark queries.

Solution: Updated runtime images of Spark to include fixed version of ACExtension which had a race condition.


Patch Images

This hotfix updates the following container images:

Component Name (Key)Image NameFull Image ReferenceDigest
spark-hb-control-planespark-hb-control-planecp.icr.io/cp/cpd/spark-hb-control-plane@sha256:33fc3bdb873ef1c2ad979aff4e1418fbf5ea41a904dbcf78cbaa105093bce100sha256:33fc3bdb873ef1c2ad979aff4e1418fbf5ea41a904dbcf78cbaa105093bce100
spark-hb-helm-repospark-hb-helm-repocp.icr.io/cp/cpd/spark-hb-helm-repo@sha256:5e72d5d33d242bff2e28423b9419921afc6bbe52dc1f3becd53469ab72d5ddf8sha256:5e72d5d33d242bff2e28423b9419921afc6bbe52dc1f3becd53469ab72d5ddf8
spark-hb-wxd-cpd-miniforge-runtimes-v34spark-hb-wxd-cpd-miniforge-runtimescp.icr.io/cp/cpd/spark-hb-wxd-cpd-miniforge-runtimes@sha256:c49d8ad3b467ac1669e798cfd4810ea65a8d424f8f7cebd32f748f6936f45c68sha256:c49d8ad3b467ac1669e798cfd4810ea65a8d424f8f7cebd32f748f6936f45c68
spark-hb-wxd-cpd-miniforge-runtimes-v35spark-hb-wxd-cpd-miniforge-runtimescp.icr.io/cp/cpd/spark-hb-wxd-cpd-miniforge-runtimes@sha256:682d5db5bff70fbd6a3a87af6903963c306a3eac8043df3d740afa555dda2ecfsha256:682d5db5bff70fbd6a3a87af6903963c306a3eac8043df3d740afa555dda2ecf
spark-hb-wxd-cpd-miniforge-runtimes-v40spark-hb-wxd-cpd-miniforge-runtimescp.icr.io/cp/cpd/spark-hb-wxd-cpd-miniforge-runtimes@sha256:7c3b3f35b01816ec4f21241982d61f61cf2786c53460bae6f92060e752143065sha256:7c3b3f35b01816ec4f21241982d61f61cf2786c53460bae6f92060e752143065

Prerequisites

Before applying this hotfix, ensure you have:

  • OpenShift cluster admin access
  • Access to the CPD instance namespace
  • Backup of current configuration
  • For air-gapped environments: skopeo tool installed
  • For air-gapped environments: Valid auth.json credentials file
  • Set environment variable: export PROJECT_CPD_INSTANCE=<your-cpd-namespace>

Backup Steps

Why backup? This allows you to quickly revert to the previous state if needed.

Create Backup of AnalyticsEngine Custom Resource

oc get analyticsengine analyticsengine-sample -n ${PROJECT_CPD_INSTANCE} -o yaml > analyticsengine_bkp.yaml

Verify Backup

ls -lh analyticsengine_bkp.yaml
cat analyticsengine_bkp.yaml | head -20

Expected Result: The file should contain valid Kubernetes YAML configuration for the AnalyticsEngine resource.


Air-Gapped Installation Steps

Note: Skip this section if you have direct internet connectivity to IBM Container Registry.

Step 1: Login to OpenShift

oc login --token=<your-token> --server=<your-server-url>

Step 2: Prepare Authentication Credentials

You need an auth.json file with credentials for both IBM Container Registry and your private registry.

Option A: Use existing auth.json from CASE download

export AUTH_FILE="${HOME}/.airgap/auth.json"

Option B: Create a new auth.json file

{
  "auths": { 
    "cp.icr.io": {
      "email": "unused",
      "auth": "<base64 encoded id:apikey>"
    },
    "<private-registry-hostname>": {
      "email": "unused",
      "auth": "<base64 encoded id:password>"
    } 
  }
}

How to encode credentials:

echo -n "iamapikey:<your-api-key>" | base64

Reference: containers-auth.json documentation

Step 3: Install Skopeo

yum install skopeo -y

Or for RHEL 8+:

dnf install skopeo -y

Step 4: Identify Your Private Registry

Find the current image location:

oc describe pod -l function=spark-hb-control-plane -n ${PROJECT_CPD_INSTANCE} | grep -i "image:"

Step 5: Get Private Registry Details

oc get imageContentSourcePolicy
oc describe imageContentSourcePolicy cloud-pak-for-data-mirror

Look for output like:

- mirrors:
  - ${PRIVATE_REGISTRY_LOCATION}/cp/
  source: cp.icr.io/cp/cpd

Reference: Configuring cluster to pull CPD images

Step 6: Copy Hotfix Images to Private Registry

Important: When copying the commands below, paste them into a text editor first to ensure no extra newline characters are added after the backslashes (\). Extra characters will cause the command to fail.

Copy spark-hb-control-plane:

skopeo copy --all --authfile "<path-to-auth-json>/auth.json" \
  --dest-tls-verify=false --src-tls-verify=false \
  docker://cp.icr.io/cp/cpd/spark-hb-control-plane@sha256:33fc3bdb873ef1c2ad979aff4e1418fbf5ea41a904dbcf78cbaa105093bce100 \
  docker://<your-private-registry>/cp/cpd/spark-hb-control-plane@sha256:33fc3bdb873ef1c2ad979aff4e1418fbf5ea41a904dbcf78cbaa105093bce100

Copy spark-hb-helm-repo:

skopeo copy --all --authfile "<path-to-auth-json>/auth.json" \
  --dest-tls-verify=false --src-tls-verify=false \
  docker://cp.icr.io/cp/cpd/spark-hb-helm-repo@sha256:5e72d5d33d242bff2e28423b9419921afc6bbe52dc1f3becd53469ab72d5ddf8 \
  docker://<your-private-registry>/cp/cpd/spark-hb-helm-repo@sha256:5e72d5d33d242bff2e28423b9419921afc6bbe52dc1f3becd53469ab72d5ddf8

Copy spark-hb-wxd-cpd-miniforge-runtimes-v34:

skopeo copy --all --authfile "<path-to-auth-json>/auth.json" \
  --dest-tls-verify=false --src-tls-verify=false \
  docker://cp.icr.io/cp/cpd/spark-hb-wxd-cpd-miniforge-runtimes@sha256:c49d8ad3b467ac1669e798cfd4810ea65a8d424f8f7cebd32f748f6936f45c68 \
  docker://<your-private-registry>/cp/cpd/spark-hb-wxd-cpd-miniforge-runtimes@sha256:c49d8ad3b467ac1669e798cfd4810ea65a8d424f8f7cebd32f748f6936f45c68

Copy spark-hb-wxd-cpd-miniforge-runtimes-v35:

skopeo copy --all --authfile "<path-to-auth-json>/auth.json" \
  --dest-tls-verify=false --src-tls-verify=false \
  docker://cp.icr.io/cp/cpd/spark-hb-wxd-cpd-miniforge-runtimes@sha256:682d5db5bff70fbd6a3a87af6903963c306a3eac8043df3d740afa555dda2ecf \
  docker://<your-private-registry>/cp/cpd/spark-hb-wxd-cpd-miniforge-runtimes@sha256:682d5db5bff70fbd6a3a87af6903963c306a3eac8043df3d740afa555dda2ecf

Copy spark-hb-wxd-cpd-miniforge-runtimes-v40:

skopeo copy --all --authfile "<path-to-auth-json>/auth.json" \
  --dest-tls-verify=false --src-tls-verify=false \
  docker://cp.icr.io/cp/cpd/spark-hb-wxd-cpd-miniforge-runtimes@sha256:7c3b3f35b01816ec4f21241982d61f61cf2786c53460bae6f92060e752143065 \
  docker://<your-private-registry>/cp/cpd/spark-hb-wxd-cpd-miniforge-runtimes@sha256:7c3b3f35b01816ec4f21241982d61f61cf2786c53460bae6f92060e752143065

Replace:

  • <path-to-auth-json> with your actual auth.json file path
  • <your-private-registry> with your private registry hostname

Patch Application

This section applies whether you're using online registry or air-gapped environment.

Step 1: Set Environment Variable

export PROJECT_CPD_INSTANCE=<your-cpd-namespace>

Example:

export PROJECT_CPD_INSTANCE=cpd-instance

Step 2: Apply Comprehensive Patch to AnalyticsEngine

Important: Ensure the command is on a single line with no line breaks or extra spaces.

Complete Patch Command (includes all fixes):

oc patch analyticsengine analyticsengine-sample -n ${PROJECT_CPD_INSTANCE} --type merge --patch '{
  "spec": {
    "autoScaleConfig": true,
    "scaleConfig": "large",
    "controlPlaneResourceConfig": {
      "requests": {
        "cpu": "2",
        "memory": "8Gi",
        "ephemeralStorage": "1200Mi"
      },
      "limits": {
        "cpu": "2",
        "memory": "8Gi",
        "ephemeralStorage": "1200Mi"
      }
    },
    "hpa": {
      "control_plane": {
        "autoscaling": {
          "min_replicas": 1,
          "max_replicas": 3,
          "medium_min_replicas": 3,
          "medium_max_replicas": 7,
          "large_min_replicas": 5,
          "large_max_replicas": 11,
          "target_cpu_utilization_percentage": 50
        }
      },
      "deployer_agent": {
        "autoscaling": {
          "min_replicas": 1,
          "max_replicas": 3,
          "medium_min_replicas": 3,
          "medium_max_replicas": 7,
          "large_min_replicas": 5,
          "large_max_replicas": 11,
          "target_cpu_utilization_percentage": 50
        }
      },
      "nginx": {
        "autoscaling": {
          "min_replicas": 1,
          "max_replicas": 3,
          "medium_min_replicas": 3,
          "medium_max_replicas": 7,
          "large_min_replicas": 5,
          "large_max_replicas": 11,
          "target_cpu_utilization_percentage": 50
        }
      },
      "ui": {
        "autoscaling": {
          "min_replicas": 1,
          "max_replicas": 3,
          "medium_min_replicas": 3,
          "medium_max_replicas": 7,
          "large_min_replicas": 5,
          "large_max_replicas": 9,
          "target_cpu_utilization_percentage": 140
        }
      }
    },
    "image_digests": {
      "spark-hb-control-plane": "sha256:33fc3bdb873ef1c2ad979aff4e1418fbf5ea41a904dbcf78cbaa105093bce100",
      "spark-hb-helm-repo": "sha256:5e72d5d33d242bff2e28423b9419921afc6bbe52dc1f3becd53469ab72d5ddf8",
      "spark-hb-wxd-cpd-miniforge-runtimes-v34": "sha256:c49d8ad3b467ac1669e798cfd4810ea65a8d424f8f7cebd32f748f6936f45c68",
      "spark-hb-wxd-cpd-miniforge-runtimes-v35": "sha256:682d5db5bff70fbd6a3a87af6903963c306a3eac8043df3d740afa555dda2ecf",
      "spark-hb-wxd-cpd-miniforge-runtimes-v40": "sha256:7c3b3f35b01816ec4f21241982d61f61cf2786c53460bae6f92060e752143065"
    }
  }
}'

Expected Output:

analyticsengine.ae.cpd.ibm.com/analyticsengine-sample patched

Step 3: Monitor Reconciliation

Watch the AnalyticsEngine resource status:

oc get AnalyticsEngine analyticsengine-sample -n ${PROJECT_CPD_INSTANCE} -w

What to look for:

  • Status should transition to Completed
  • This may take 5-10 minutes

Alternative monitoring:

watch -n 10 "oc get AnalyticsEngine analyticsengine-sample -n ${PROJECT_CPD_INSTANCE}"

Step 4: Verify Pod Updates

Check that the pods are running with the new images:

oc get pods -l function=spark-hb-control-plane -n ${PROJECT_CPD_INSTANCE}

Expected Result: All pods should be in Running state with recent restart times.


Validation

1. Check AnalyticsEngine Status

oc get AnalyticsEngine analyticsengine-sample -n ${PROJECT_CPD_INSTANCE}

Expected Output:

NAME                      STATUS      AGE
analyticsengine-sample    Completed   45d

2. Verify Image Digests

Verify control-plane image:

oc get pod -l function=spark-hb-control-plane -n ${PROJECT_CPD_INSTANCE} -o jsonpath='{.items[0].status.containerStatuses[?(@.name=="spark-hb-control-plane")].imageID}' | grep -q "33fc3bdb873ef1c2ad979aff4e1418fbf5ea41a904dbcf78cbaa105093bce100" && echo "Control plane patch applied successfully" || echo "Control plane patch validation failed"

Verify all image digests:

oc get AnalyticsEngine analyticsengine-sample -n ${PROJECT_CPD_INSTANCE} -o yaml | grep -A 10 image_digests

Expected output should show:

image_digests:
  spark-hb-control-plane: sha256:33fc3bdb873ef1c2ad979aff4e1418fbf5ea41a904dbcf78cbaa105093bce100
  spark-hb-helm-repo: sha256:5e72d5d33d242bff2e28423b9419921afc6bbe52dc1f3becd53469ab72d5ddf8
  spark-hb-wxd-cpd-miniforge-runtimes-v34: sha256:c49d8ad3b467ac1669e798cfd4810ea65a8d424f8f7cebd32f748f6936f45c68
  spark-hb-wxd-cpd-miniforge-runtimes-v35: sha256:682d5db5bff70fbd6a3a87af6903963c306a3eac8043df3d740afa555dda2ecf
  spark-hb-wxd-cpd-miniforge-runtimes-v40: sha256:7c3b3f35b01816ec4f21241982d61f61cf2786c53460bae6f92060e752143065

3. Verify Pods are Running

oc get pods -n ${PROJECT_CPD_INSTANCE} | grep spark-hb

Expected Result: All spark-hb pods should be in Running state.

4. Test Functionality

Test 1: Validate Ephemeral Storage Fix

  1. Ensure the AnalyticsEngine CR is in Completed state
  2. Run an autoscale job
  3. Verify that the Mira worker pod does not have ephemeral storage configured in its resource requests and limits
oc describe pod <mira-worker-pod-name> -n ${PROJECT_CPD_INSTANCE} | grep -A 10 "Limits:"

Test 2: Validate Iceberg Merge Fix

  1. Run a DBT application performing a MERGE INTO operation
  2. Ensure the target table has the property "write.spark.accept-any-schema" = true
  3. Confirm that the operation completes successfully without errors

Test 3: Validate Token Expiry Fix

  1. Submit a long-running Spark job (>20 minutes)
  2. Monitor the job execution
  3. Verify that the job completes successfully without authentication errors
oc logs <spark-driver-pod> -n ${PROJECT_CPD_INSTANCE} | grep -i "auth\|token"

Test 4: Validate Long Running Idle Query Server fix

  1. Start a spark query server with ACExtension enabled
  2. Wait untill started and then run sample LH queries or DBT model
  3. Let query server be idle for approx 15mins by not running any query
  4. Run the same LH queries or DBT model to verify if the authorisation issue is resolved
oc logs <spark-driver-pod> -n ${PROJECT_CPD_INSTANCE} | grep -i "auth\|token"

Rollback Instructions

If you need to revert the hotfix, follow these steps:

Prerequisites

  • Ensure you have the analyticsengine_bkp.yaml backup file created in the Backup Steps section

Rollback Steps

1. Restore Original Configuration

oc apply -f analyticsengine_bkp.yaml

Expected Output:

analyticsengine.ae.cpd.ibm.com/analyticsengine-sample configured

2. Monitor Reconciliation

oc get AnalyticsEngine analyticsengine-sample -n ${PROJECT_CPD_INSTANCE} -w

Wait for the status to return to Completed.

3. Verify Rollback

Check that pods are running with the original images:

oc get pods -l function=spark-hb-control-plane -n ${PROJECT_CPD_INSTANCE}
oc describe pod -l function=spark-hb-control-plane -n ${PROJECT_CPD_INSTANCE} | grep -i "image:"

The image digests should not contain the patched versions.


Troubleshooting

Issue 1: Patch command fails with "invalid character" error

Solution: Ensure there are no line breaks or extra spaces in the oc patch command. Copy to a text editor first, verify it's properly formatted, then execute.

Issue 2: Pods not updating after patch

Solution:

  1. Verify operator pod is running:

    oc get pods -l name=analyticsengine -n ${PROJECT_CPD_INST_OPERATORS}
  2. Check operator logs:

    oc logs -l name=analyticsengine -n ${PROJECT_CPD_INST_OPERATORS} --tail=100
  3. Verify the patch was applied:

    oc get AnalyticsEngine analyticsengine-sample -n ${PROJECT_CPD_INSTANCE} -o yaml | grep -A 10 image_digests

Issue 3: Image pull errors in air-gapped environment

Solution:

  1. Verify the images exist in your private registry
  2. Check imageContentSourcePolicy is correctly configured
  3. Verify auth.json has valid credentials for both registries
  4. Test image pull manually:

    oc run test-pull --image=<your-private-registry>/cp/cpd/spark-hb-control-plane@sha256:33fc3bdb873ef1c2ad979aff4e1418fbf5ea41a904dbcf78cbaa105093bce100 --rm -it --restart=Never -- /bin/sh

Issue 4: Validation command shows wrong digest

Solution:

  1. Wait longer for pod rollout to complete (can take 10-15 minutes)
  2. Check if pods are still in ContainerCreating or ImagePullBackOff state
  3. Review pod events:

    oc describe pod <pod-name> -n ${PROJECT_CPD_INSTANCE}

Issue 5: Long-running jobs still failing after patch

Solution:

  1. Verify all runtime images are updated:

    oc get cj spark-hb-preload-jkg-image -n ${PROJECT_CPD_INSTANCE} -o yaml | grep image | grep wxd
  2. Check Spark driver logs for authentication errors:

    oc logs <spark-driver-pod> -n ${PROJECT_CPD_INSTANCE} | grep -i "auth\|token\|expire"
  3. Ensure the job is using the updated runtime version

Support

If you encounter issues not covered in this document:

  1. Collect diagnostic information:

    oc get AnalyticsEngine analyticsengine-sample -n ${PROJECT_CPD_INSTANCE} -o yaml > analyticsengine_status.yaml
    oc get pods -n ${PROJECT_CPD_INSTANCE} > pods_status.txt
    oc logs -l name=analyticsengine-operator -n ${PROJECT_CPD_INSTANCE} --tail=500 > operator_logs.txt
  2. Open a support ticket with IBM Support
  3. Include the collected diagnostic files

Additional Notes

  • Downtime: This hotfix requires pod restarts but does not require full AnalyticsEngine downtime
  • Duration: Typical CR reconciliation time is 10-15 minutes
  • Compatibility: This hotfix is compatible with CPD 5.2.1/2.2.1
  • Components Updated:
    • spark-hb-control-plane
    • spark-hb-helm-repo
    • spark-hb-wxd-cpd-miniforge-runtimes (v34, v35, v40)

Summary of Changes

IssueComponentFix Description
Ephemeral StorageMira Worker PodsRemoved incorrect ephemeral storage values from resource requests/limits
Iceberg MergeRuntime Images (v34, v35, v40)Fixed MERGE INTO operations for Iceberg tables with schema evolution
Token ExpiryRuntime Images (v34, v35, v40)Implemented token refresh mechanism for long-running jobs (>20 minutes)
AutoscalingControl Plane, HPA ConfigUpdated autoscaling configuration for large-scale deployments
Authorization Issue with idle query serverRuntime Images (v34, v35, v40)Updated runtime images with fixed version of ACExtension

[{"Type":"MASTER","Line of Business":{"code":"LOB76","label":"Data Platform"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSCA0YO","label":"IBM watsonx.data"},"ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"2.2.1"}]

Document Information

Modified date:
07 May 2026

UID

ibm17272337