Fix Readme

Abstract

This consolidated patch addresses multiple critical issues in IBM watsonx.data Spark Analytics Engine for CPD versions 5.2.1/2.2.1.

Content

Overview
Issues Addressed
Patch Images
Prerequisites
Backup Steps
Air-Gapped Installation Steps
Patch Application
Validation
Rollback Instructions
Troubleshooting
Summary of Changes

Overview

This consolidated patch addresses multiple critical issues in IBM watsonx.data Spark Analytics Engine for CPD versions 5.2.1/2.2.1.

Issues Addressed

Issue 1: Ephemeral Storage Configuration

Problem: When a Mira application is submitted, ephemeral storage is automatically set in resource requests/limits with incorrect values.

Solution: Ephemeral storage values are entirely removed from request/limit configurations.

Issue 2: Iceberg Merge Table Operations

Problem: Enabling a required table property on Iceberg tables during migration from Delta Lake caused MERGE INTO operations to fail due to an open-source bug.

Solution: Fix for performing merge into operations for Iceberg tables.

Issue 3: Authorization Token Expiry

Problem: Long-running jobs exceeding 20 minutes fail because the auth token expires after 20 minutes, causing any ongoing operations to fail.

Solution: Updated runtime images to handle token refresh for long-running operations.

Issue 4: Authorization Issues with Idle running Query Server

Problem: Running queries after keeping Query Server idle for approx ~15 mins, causing authorization failure (instance not found) for Spark queries.

Solution: Updated runtime images of Spark to include fixed version of ACExtension which had a race condition.

Patch Images

This hotfix updates the following container images:

Component Name (Key)	Image Name	Full Image Reference	Digest
spark-hb-control-plane	`spark-hb-control-plane`	`cp.icr.io/cp/cpd/spark-hb-control-plane@sha256:33fc3bdb873ef1c2ad979aff4e1418fbf5ea41a904dbcf78cbaa105093bce100`	`sha256:33fc3bdb873ef1c2ad979aff4e1418fbf5ea41a904dbcf78cbaa105093bce100`
spark-hb-helm-repo	`spark-hb-helm-repo`	`cp.icr.io/cp/cpd/spark-hb-helm-repo@sha256:5e72d5d33d242bff2e28423b9419921afc6bbe52dc1f3becd53469ab72d5ddf8`	`sha256:5e72d5d33d242bff2e28423b9419921afc6bbe52dc1f3becd53469ab72d5ddf8`
spark-hb-wxd-cpd-miniforge-runtimes-v34	`spark-hb-wxd-cpd-miniforge-runtimes`	`cp.icr.io/cp/cpd/spark-hb-wxd-cpd-miniforge-runtimes@sha256:c49d8ad3b467ac1669e798cfd4810ea65a8d424f8f7cebd32f748f6936f45c68`	`sha256:c49d8ad3b467ac1669e798cfd4810ea65a8d424f8f7cebd32f748f6936f45c68`
spark-hb-wxd-cpd-miniforge-runtimes-v35	`spark-hb-wxd-cpd-miniforge-runtimes`	`cp.icr.io/cp/cpd/spark-hb-wxd-cpd-miniforge-runtimes@sha256:682d5db5bff70fbd6a3a87af6903963c306a3eac8043df3d740afa555dda2ecf`	`sha256:682d5db5bff70fbd6a3a87af6903963c306a3eac8043df3d740afa555dda2ecf`
spark-hb-wxd-cpd-miniforge-runtimes-v40	`spark-hb-wxd-cpd-miniforge-runtimes`	`cp.icr.io/cp/cpd/spark-hb-wxd-cpd-miniforge-runtimes@sha256:7c3b3f35b01816ec4f21241982d61f61cf2786c53460bae6f92060e752143065`	`sha256:7c3b3f35b01816ec4f21241982d61f61cf2786c53460bae6f92060e752143065`

Prerequisites

Before applying this hotfix, ensure you have:

OpenShift cluster admin access
Access to the CPD instance namespace
Backup of current configuration
For air-gapped environments: skopeo tool installed
For air-gapped environments: Valid auth.json credentials file
Set environment variable: export PROJECT_CPD_INSTANCE=<your-cpd-namespace>

Backup Steps

Why backup? This allows you to quickly revert to the previous state if needed.

Create Backup of AnalyticsEngine Custom Resource

oc get analyticsengine analyticsengine-sample -n ${PROJECT_CPD_INSTANCE} -o yaml > analyticsengine_bkp.yaml

Verify Backup

ls -lh analyticsengine_bkp.yaml
cat analyticsengine_bkp.yaml | head -20

Expected Result: The file should contain valid Kubernetes YAML configuration for the AnalyticsEngine resource.

Air-Gapped Installation Steps

Note: Skip this section if you have direct internet connectivity to IBM Container Registry.

Step 1: Login to OpenShift

oc login --token=<your-token> --server=<your-server-url>

Step 2: Prepare Authentication Credentials

You need an auth.json file with credentials for both IBM Container Registry and your private registry.

Option A: Use existing auth.json from CASE download

export AUTH_FILE="${HOME}/.airgap/auth.json"

Option B: Create a new auth.json file

{
  "auths": { 
    "cp.icr.io": {
      "email": "unused",
      "auth": "<base64 encoded id:apikey>"
    },
    "<private-registry-hostname>": {
      "email": "unused",
      "auth": "<base64 encoded id:password>"
    } 
  }
}

How to encode credentials:

echo -n "iamapikey:<your-api-key>" | base64

Reference: containers-auth.json documentation

Step 3: Install Skopeo

yum install skopeo -y

Or for RHEL 8+:

dnf install skopeo -y

Step 4: Identify Your Private Registry

Find the current image location:

oc describe pod -l function=spark-hb-control-plane -n ${PROJECT_CPD_INSTANCE} | grep -i "image:"

Step 5: Get Private Registry Details

oc get imageContentSourcePolicy
oc describe imageContentSourcePolicy cloud-pak-for-data-mirror

Look for output like:

- mirrors:
  - ${PRIVATE_REGISTRY_LOCATION}/cp/
  source: cp.icr.io/cp/cpd

Reference: Configuring cluster to pull CPD images

Step 6: Copy Hotfix Images to Private Registry

Important: When copying the commands below, paste them into a text editor first to ensure no extra newline characters are added after the backslashes (\). Extra characters will cause the command to fail.

Copy spark-hb-control-plane:

skopeo copy --all --authfile "<path-to-auth-json>/auth.json" \
  --dest-tls-verify=false --src-tls-verify=false \
  docker://cp.icr.io/cp/cpd/spark-hb-control-plane@sha256:33fc3bdb873ef1c2ad979aff4e1418fbf5ea41a904dbcf78cbaa105093bce100 \
  docker://<your-private-registry>/cp/cpd/spark-hb-control-plane@sha256:33fc3bdb873ef1c2ad979aff4e1418fbf5ea41a904dbcf78cbaa105093bce100

Copy spark-hb-helm-repo:

skopeo copy --all --authfile "<path-to-auth-json>/auth.json" \
  --dest-tls-verify=false --src-tls-verify=false \
  docker://cp.icr.io/cp/cpd/spark-hb-helm-repo@sha256:5e72d5d33d242bff2e28423b9419921afc6bbe52dc1f3becd53469ab72d5ddf8 \
  docker://<your-private-registry>/cp/cpd/spark-hb-helm-repo@sha256:5e72d5d33d242bff2e28423b9419921afc6bbe52dc1f3becd53469ab72d5ddf8

Copy spark-hb-wxd-cpd-miniforge-runtimes-v34:

skopeo copy --all --authfile "<path-to-auth-json>/auth.json" \
  --dest-tls-verify=false --src-tls-verify=false \
  docker://cp.icr.io/cp/cpd/spark-hb-wxd-cpd-miniforge-runtimes@sha256:c49d8ad3b467ac1669e798cfd4810ea65a8d424f8f7cebd32f748f6936f45c68 \
  docker://<your-private-registry>/cp/cpd/spark-hb-wxd-cpd-miniforge-runtimes@sha256:c49d8ad3b467ac1669e798cfd4810ea65a8d424f8f7cebd32f748f6936f45c68

Copy spark-hb-wxd-cpd-miniforge-runtimes-v35:

skopeo copy --all --authfile "<path-to-auth-json>/auth.json" \
  --dest-tls-verify=false --src-tls-verify=false \
  docker://cp.icr.io/cp/cpd/spark-hb-wxd-cpd-miniforge-runtimes@sha256:682d5db5bff70fbd6a3a87af6903963c306a3eac8043df3d740afa555dda2ecf \
  docker://<your-private-registry>/cp/cpd/spark-hb-wxd-cpd-miniforge-runtimes@sha256:682d5db5bff70fbd6a3a87af6903963c306a3eac8043df3d740afa555dda2ecf

Copy spark-hb-wxd-cpd-miniforge-runtimes-v40:

skopeo copy --all --authfile "<path-to-auth-json>/auth.json" \
  --dest-tls-verify=false --src-tls-verify=false \
  docker://cp.icr.io/cp/cpd/spark-hb-wxd-cpd-miniforge-runtimes@sha256:7c3b3f35b01816ec4f21241982d61f61cf2786c53460bae6f92060e752143065 \
  docker://<your-private-registry>/cp/cpd/spark-hb-wxd-cpd-miniforge-runtimes@sha256:7c3b3f35b01816ec4f21241982d61f61cf2786c53460bae6f92060e752143065

Replace:

<path-to-auth-json> with your actual auth.json file path
<your-private-registry> with your private registry hostname

Patch Application

This section applies whether you're using online registry or air-gapped environment.

Step 1: Set Environment Variable

export PROJECT_CPD_INSTANCE=<your-cpd-namespace>

Example:

export PROJECT_CPD_INSTANCE=cpd-instance

Step 2: Apply Comprehensive Patch to AnalyticsEngine

Important: Ensure the command is on a single line with no line breaks or extra spaces.

Complete Patch Command (includes all fixes):

oc patch analyticsengine analyticsengine-sample -n ${PROJECT_CPD_INSTANCE} --type merge --patch '{
  "spec": {
    "autoScaleConfig": true,
    "scaleConfig": "large",
    "controlPlaneResourceConfig": {
      "requests": {
        "cpu": "2",
        "memory": "8Gi",
        "ephemeralStorage": "1200Mi"
      },
      "limits": {
        "cpu": "2",
        "memory": "8Gi",
        "ephemeralStorage": "1200Mi"
      }
    },
    "hpa": {
      "control_plane": {
        "autoscaling": {
          "min_replicas": 1,
          "max_replicas": 3,
          "medium_min_replicas": 3,
          "medium_max_replicas": 7,
          "large_min_replicas": 5,
          "large_max_replicas": 11,
          "target_cpu_utilization_percentage": 50
        }
      },
      "deployer_agent": {
        "autoscaling": {
          "min_replicas": 1,
          "max_replicas": 3,
          "medium_min_replicas": 3,
          "medium_max_replicas": 7,
          "large_min_replicas": 5,
          "large_max_replicas": 11,
          "target_cpu_utilization_percentage": 50
        }
      },
      "nginx": {
        "autoscaling": {
          "min_replicas": 1,
          "max_replicas": 3,
          "medium_min_replicas": 3,
          "medium_max_replicas": 7,
          "large_min_replicas": 5,
          "large_max_replicas": 11,
          "target_cpu_utilization_percentage": 50
        }
      },
      "ui": {
        "autoscaling": {
          "min_replicas": 1,
          "max_replicas": 3,
          "medium_min_replicas": 3,
          "medium_max_replicas": 7,
          "large_min_replicas": 5,
          "large_max_replicas": 9,
          "target_cpu_utilization_percentage": 140
        }
      }
    },
    "image_digests": {
      "spark-hb-control-plane": "sha256:33fc3bdb873ef1c2ad979aff4e1418fbf5ea41a904dbcf78cbaa105093bce100",
      "spark-hb-helm-repo": "sha256:5e72d5d33d242bff2e28423b9419921afc6bbe52dc1f3becd53469ab72d5ddf8",
      "spark-hb-wxd-cpd-miniforge-runtimes-v34": "sha256:c49d8ad3b467ac1669e798cfd4810ea65a8d424f8f7cebd32f748f6936f45c68",
      "spark-hb-wxd-cpd-miniforge-runtimes-v35": "sha256:682d5db5bff70fbd6a3a87af6903963c306a3eac8043df3d740afa555dda2ecf",
      "spark-hb-wxd-cpd-miniforge-runtimes-v40": "sha256:7c3b3f35b01816ec4f21241982d61f61cf2786c53460bae6f92060e752143065"
    }
  }
}'

Expected Output:

analyticsengine.ae.cpd.ibm.com/analyticsengine-sample patched

Step 3: Monitor Reconciliation

Watch the AnalyticsEngine resource status:

oc get AnalyticsEngine analyticsengine-sample -n ${PROJECT_CPD_INSTANCE} -w

What to look for:

Status should transition to Completed
This may take 5-10 minutes

Alternative monitoring:

watch -n 10 "oc get AnalyticsEngine analyticsengine-sample -n ${PROJECT_CPD_INSTANCE}"

Step 4: Verify Pod Updates

Check that the pods are running with the new images:

oc get pods -l function=spark-hb-control-plane -n ${PROJECT_CPD_INSTANCE}

Expected Result: All pods should be in Running state with recent restart times.

Validation

1. Check AnalyticsEngine Status

oc get AnalyticsEngine analyticsengine-sample -n ${PROJECT_CPD_INSTANCE}

Expected Output:

NAME                      STATUS      AGE
analyticsengine-sample    Completed   45d

2. Verify Image Digests

Verify control-plane image:

oc get pod -l function=spark-hb-control-plane -n ${PROJECT_CPD_INSTANCE} -o jsonpath='{.items[0].status.containerStatuses[?(@.name=="spark-hb-control-plane")].imageID}' | grep -q "33fc3bdb873ef1c2ad979aff4e1418fbf5ea41a904dbcf78cbaa105093bce100" && echo "Control plane patch applied successfully" || echo "Control plane patch validation failed"

Verify all image digests:

oc get AnalyticsEngine analyticsengine-sample -n ${PROJECT_CPD_INSTANCE} -o yaml | grep -A 10 image_digests

Expected output should show:

image_digests:
  spark-hb-control-plane: sha256:33fc3bdb873ef1c2ad979aff4e1418fbf5ea41a904dbcf78cbaa105093bce100
  spark-hb-helm-repo: sha256:5e72d5d33d242bff2e28423b9419921afc6bbe52dc1f3becd53469ab72d5ddf8
  spark-hb-wxd-cpd-miniforge-runtimes-v34: sha256:c49d8ad3b467ac1669e798cfd4810ea65a8d424f8f7cebd32f748f6936f45c68
  spark-hb-wxd-cpd-miniforge-runtimes-v35: sha256:682d5db5bff70fbd6a3a87af6903963c306a3eac8043df3d740afa555dda2ecf
  spark-hb-wxd-cpd-miniforge-runtimes-v40: sha256:7c3b3f35b01816ec4f21241982d61f61cf2786c53460bae6f92060e752143065

3. Verify Pods are Running

oc get pods -n ${PROJECT_CPD_INSTANCE} | grep spark-hb

Expected Result: All spark-hb pods should be in Running state.

4. Test Functionality

Test 1: Validate Ephemeral Storage Fix

Ensure the AnalyticsEngine CR is in Completed state
Run an autoscale job
Verify that the Mira worker pod does not have ephemeral storage configured in its resource requests and limits

oc describe pod <mira-worker-pod-name> -n ${PROJECT_CPD_INSTANCE} | grep -A 10 "Limits:"

Test 2: Validate Iceberg Merge Fix

Run a DBT application performing a MERGE INTO operation
Ensure the target table has the property "write.spark.accept-any-schema" = true
Confirm that the operation completes successfully without errors

Test 3: Validate Token Expiry Fix

Submit a long-running Spark job (>20 minutes)
Monitor the job execution
Verify that the job completes successfully without authentication errors

oc logs <spark-driver-pod> -n ${PROJECT_CPD_INSTANCE} | grep -i "auth\|token"

Test 4: Validate Long Running Idle Query Server fix

Start a spark query server with ACExtension enabled
Wait untill started and then run sample LH queries or DBT model
Let query server be idle for approx 15mins by not running any query
Run the same LH queries or DBT model to verify if the authorisation issue is resolved

oc logs <spark-driver-pod> -n ${PROJECT_CPD_INSTANCE} | grep -i "auth\|token"

Rollback Instructions

If you need to revert the hotfix, follow these steps:

Prerequisites

Ensure you have the analyticsengine_bkp.yaml backup file created in the Backup Steps section

Rollback Steps

1. Restore Original Configuration

oc apply -f analyticsengine_bkp.yaml

Expected Output:

analyticsengine.ae.cpd.ibm.com/analyticsengine-sample configured

2. Monitor Reconciliation

oc get AnalyticsEngine analyticsengine-sample -n ${PROJECT_CPD_INSTANCE} -w

Wait for the status to return to Completed.

3. Verify Rollback

Check that pods are running with the original images:

oc get pods -l function=spark-hb-control-plane -n ${PROJECT_CPD_INSTANCE}
oc describe pod -l function=spark-hb-control-plane -n ${PROJECT_CPD_INSTANCE} | grep -i "image:"

The image digests should not contain the patched versions.

Troubleshooting

Issue 1: Patch command fails with "invalid character" error

Solution: Ensure there are no line breaks or extra spaces in the oc patch command. Copy to a text editor first, verify it's properly formatted, then execute.

Issue 2: Pods not updating after patch

Solution:

Verify operator pod is running:

oc get pods -l name=analyticsengine -n ${PROJECT_CPD_INST_OPERATORS}

Check operator logs:

oc logs -l name=analyticsengine -n ${PROJECT_CPD_INST_OPERATORS} --tail=100

Verify the patch was applied:

oc get AnalyticsEngine analyticsengine-sample -n ${PROJECT_CPD_INSTANCE} -o yaml | grep -A 10 image_digests

Issue 3: Image pull errors in air-gapped environment

Solution:

Verify the images exist in your private registry
Check imageContentSourcePolicy is correctly configured
Verify auth.json has valid credentials for both registries

Test image pull manually:

oc run test-pull --image=<your-private-registry>/cp/cpd/spark-hb-control-plane@sha256:33fc3bdb873ef1c2ad979aff4e1418fbf5ea41a904dbcf78cbaa105093bce100 --rm -it --restart=Never -- /bin/sh

Issue 4: Validation command shows wrong digest

Solution:

Wait longer for pod rollout to complete (can take 10-15 minutes)
Check if pods are still in ContainerCreating or ImagePullBackOff state

Review pod events:

oc describe pod <pod-name> -n ${PROJECT_CPD_INSTANCE}

Issue 5: Long-running jobs still failing after patch

Solution:

Verify all runtime images are updated:

oc get cj spark-hb-preload-jkg-image -n ${PROJECT_CPD_INSTANCE} -o yaml | grep image | grep wxd

Check Spark driver logs for authentication errors:

oc logs <spark-driver-pod> -n ${PROJECT_CPD_INSTANCE} | grep -i "auth\|token\|expire"

Ensure the job is using the updated runtime version

Support

If you encounter issues not covered in this document:

Collect diagnostic information:

oc get AnalyticsEngine analyticsengine-sample -n ${PROJECT_CPD_INSTANCE} -o yaml > analyticsengine_status.yaml
oc get pods -n ${PROJECT_CPD_INSTANCE} > pods_status.txt
oc logs -l name=analyticsengine-operator -n ${PROJECT_CPD_INSTANCE} --tail=500 > operator_logs.txt

Open a support ticket with IBM Support
Include the collected diagnostic files

Additional Notes

Downtime: This hotfix requires pod restarts but does not require full AnalyticsEngine downtime
Duration: Typical CR reconciliation time is 10-15 minutes
Compatibility: This hotfix is compatible with CPD 5.2.1/2.2.1
Components Updated:
- spark-hb-control-plane
- spark-hb-helm-repo
- spark-hb-wxd-cpd-miniforge-runtimes (v34, v35, v40)

Summary of Changes

Issue	Component	Fix Description
Ephemeral Storage	Mira Worker Pods	Removed incorrect ephemeral storage values from resource requests/limits
Iceberg Merge	Runtime Images (v34, v35, v40)	Fixed MERGE INTO operations for Iceberg tables with schema evolution
Token Expiry	Runtime Images (v34, v35, v40)	Implemented token refresh mechanism for long-running jobs (>20 minutes)
Autoscaling	Control Plane, HPA Config	Updated autoscaling configuration for large-scale deployments
Authorization Issue with idle query server	Runtime Images (v34, v35, v40)	Updated runtime images with fixed version of ACExtension

[{"Type":"MASTER","Line of Business":{"code":"LOB76","label":"Data Platform"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSCA0YO","label":"IBM watsonx.data"},"ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"2.2.1"}]

Tips

Consolidated Patch for IBM watsonx.data Spark Analytics Engine for CPD versions 5.2.1/2.2.1

Fix Readme

Abstract

Content

Table of Contents

Overview

Issues Addressed

Issue 1: Ephemeral Storage Configuration

Issue 2: Iceberg Merge Table Operations

Issue 3: Authorization Token Expiry

Issue 4: Authorization Issues with Idle running Query Server

Patch Images

Prerequisites

Backup Steps

Create Backup of AnalyticsEngine Custom Resource

Verify Backup

Air-Gapped Installation Steps

Step 1: Login to OpenShift

Step 2: Prepare Authentication Credentials

Step 3: Install Skopeo

Step 4: Identify Your Private Registry

Step 5: Get Private Registry Details

Step 6: Copy Hotfix Images to Private Registry

Patch Application

Step 1: Set Environment Variable

Step 2: Apply Comprehensive Patch to AnalyticsEngine

Step 3: Monitor Reconciliation

Step 4: Verify Pod Updates

Validation

1. Check AnalyticsEngine Status

2. Verify Image Digests

3. Verify Pods are Running

4. Test Functionality

Test 1: Validate Ephemeral Storage Fix

Test 2: Validate Iceberg Merge Fix

Test 3: Validate Token Expiry Fix

Test 4: Validate Long Running Idle Query Server fix

Rollback Instructions

Prerequisites

Rollback Steps

1. Restore Original Configuration

2. Monitor Reconciliation

3. Verify Rollback

Troubleshooting

Issue 1: Patch command fails with "invalid character" error

Issue 2: Pods not updating after patch

Issue 3: Image pull errors in air-gapped environment

Issue 4: Validation command shows wrong digest

Issue 5: Long-running jobs still failing after patch

Support

Additional Notes

Summary of Changes

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?