Customize Apache Spark runtimes in IBM Analytics Engine

After installing IBM Analytics Engine powered by Apache Spark, an instance administrator can customize the Apache Spark runtime by creating a container image that extends the default runtime. Use this image to add system packages, Python libraries, or JAR files required by your workloads.

Before you begin

Ensure that you have the following:

Access to a Cloud Pak for Data cluster with Analytics Engine installed.
Administrator privileges to run oc and docker commands.
The base image ID for the Spark version you want to customize:
- Spark 3.4: os-image-id-jkg34-cp4d-wxd
- Spark 3.5: os-image-id-jkg35-cp4d-wxd
A working Docker environment to build the image.

Steps

Get the base image details:

Run the following command to retrieve the base image and tag. Replace <os-image-id> with the appropriate ID for your Spark version. cpd-instance has to be replaced with namespace where CPD is installed

 oc get cm -n cpd-instance spark-hb-cluster-template \
  -o=jsonpath='{$.data.os-image-json}' \
| jq -r '.docs[] | select(._id | test("<os-image-id>")) | "\(.docker_repo)/\(.docker_image)@\(.docker_image_tag)"'

From the output, copy the following values:


REGISTRY=<docker_repo>
IMAGE_NAME=<docker_image>

Create a file named Dockerfile with the following content:

ARG REGISTRY
ARG IMAGE_NAME
FROM ${REGISTRY}/${IMAGE_NAME}
USER root:3000
COPY *.sh /tmp/
RUN bash /tmp/install-os-packages.sh && \
    bash /tmp/install-conda-packages.sh && \
    bash /tmp/install-jar.sh
USER ${CLUSTER_USER}
WORKDIR ${WORK_DIR}

Create the customization scripts:

install-os-packages.sh

#!/bin/bash
set -e -o pipefail

# Install operating system packages
microdnf update -y
microdnf install -y vim

install -jar.sh

#!/bin/bash
set -e -o pipefail

# Install additional JAR files
wget -c -P /opt/ibm/spark/external-jars/ \
https://repo1.maven.org/maven2/com/fasterxml/jackson/core/jackson-databind/2.18.0/jackson-databind-2.18.0.jar

Build the custom image:

To get the digest of the base image:

os_image_json=$(oc get cm spark-hb-cluster-template -o jsonpath='{.data.os-image-json}' -n <namespace>)
echo "$os_image_json" | grep -A 6 '"docker_image": "spark-hb-cpd-miniconda-runtimes"' | grep 'docker_image_tag' | awk -F': "' '{print $2}' | sed 's/"$//'

Use the digest to build your custom image:

docker build -t <YOUR_REGISTRY>/spark-hb-cpd-miniconda-runtimes:custom-runtime \
--build-arg REGISTRY=<YOUR_REGISTRY> \
--build-arg IMAGE_NAME=<image_name_COPIED>

Configure the custom image:

Cluster level

Update the AnalyticsEngine custom resource:

spec:
  customSparkImage: <your_custom_image_digest>
  builtInSparkImages: <existing_images>

Instance level

Run the following command to patch the default configuration:

curl -k --request PATCH https://<cpd-url>/v4/analytics_engines/<engine_id>/default_configs \
-H "Authorization: Bearer $TOKEN" \
--data-raw '{
  "ae.spark.kubernetes.container.image": "<your_custom_image_digest>",
  "ae.spark.kubernetes.driver.container.image": "",
  "ae.spark.kubernetes.executor.container.image": ""
}'

Job or Kernel level

Specify the image in the job configuration:

"conf": {
  "ae.spark.kubernetes.container.image": "<your_custom_image_digest>",
  "ae.spark.kubernetes.driver.container.image": "",
  "ae.spark.kubernetes.executor.container.image": ""
}

What to do next

Complete the following tasks in order before users can access the service:

An instance administrator can set the scale of the service adjust the number of available pods. See Scaling services.
Before you can submit Spark jobs by using the Spark jobs API, you must provision a service instance. See Provisioning the service instance.
The service is ready to use. See Spark environments.