Customize Apache Spark runtimes in IBM Analytics Engine
After installing IBM Analytics Engine powered by Apache Spark, an instance administrator can customize the Apache Spark runtime by creating a container image that extends the default runtime. Use this image to add system packages, Python libraries, or JAR files required by your workloads.
Before you begin
Ensure that you have the following:
- Access to a Cloud Pak for Data cluster with Analytics Engine installed.
- Administrator privileges to run
ocanddockercommands. - The base image ID for the Spark version you want to customize:
- Spark 3.4:
os-image-id-jkg34-cp4d-wxd - Spark 3.5:
os-image-id-jkg35-cp4d-wxd
- Spark 3.4:
- A working Docker environment to build the image.
Steps
- Get the base image details:
-
Run the following command to retrieve the base image and tag. Replace
<os-image-id>with the appropriate ID for your Spark version.cpd-instancehas to be replaced with namespace where CPD is installedoc get cm -n cpd-instance spark-hb-cluster-template \ -o=jsonpath='{$.data.os-image-json}' \ | jq -r '.docs[] | select(._id | test("<os-image-id>")) | "\(.docker_repo)/\(.docker_image)@\(.docker_image_tag)"' -
From the output, copy the following values:
REGISTRY=<docker_repo> IMAGE_NAME=<docker_image>
-
- Create a file named
Dockerfilewith the following content:ARG REGISTRY ARG IMAGE_NAME FROM ${REGISTRY}/${IMAGE_NAME} USER root:3000 COPY *.sh /tmp/ RUN bash /tmp/install-os-packages.sh && \ bash /tmp/install-conda-packages.sh && \ bash /tmp/install-jar.sh USER ${CLUSTER_USER} WORKDIR ${WORK_DIR} - Create the customization scripts:
- install-
os-packages.sh#!/bin/bash set -e -o pipefail # Install operating system packages microdnf update -y microdnf install -y vim - install
-jar.sh#!/bin/bash set -e -o pipefail # Install additional JAR files wget -c -P /opt/ibm/spark/external-jars/ \ https://repo1.maven.org/maven2/com/fasterxml/jackson/core/jackson-databind/2.18.0/jackson-databind-2.18.0.jar
- install-
- Build the custom image:
-
To get the digest of the base image:
os_image_json=$(oc get cm spark-hb-cluster-template -o jsonpath='{.data.os-image-json}' -n <namespace>) echo "$os_image_json" | grep -A 6 '"docker_image": "spark-hb-cpd-miniconda-runtimes"' | grep 'docker_image_tag' | awk -F': "' '{print $2}' | sed 's/"$//' - Use the digest to build your custom image:
docker build -t <YOUR_REGISTRY>/spark-hb-cpd-miniconda-runtimes:custom-runtime \ --build-arg REGISTRY=<YOUR_REGISTRY> \ --build-arg IMAGE_NAME=<image_name_COPIED>
-
- Configure the custom image:
- Cluster level
Update the
AnalyticsEnginecustom resource:spec: customSparkImage: <your_custom_image_digest> builtInSparkImages: <existing_images> - Instance levelRun the following command to patch the default configuration:
curl -k --request PATCH https://<cpd-url>/v4/analytics_engines/<engine_id>/default_configs \ -H "Authorization: Bearer $TOKEN" \ --data-raw '{ "ae.spark.kubernetes.container.image": "<your_custom_image_digest>", "ae.spark.kubernetes.driver.container.image": "", "ae.spark.kubernetes.executor.container.image": "" }' - Job or Kernel levelSpecify the image in the job configuration:
"conf": { "ae.spark.kubernetes.container.image": "<your_custom_image_digest>", "ae.spark.kubernetes.driver.container.image": "", "ae.spark.kubernetes.executor.container.image": "" }
- Cluster level
What to do next
Complete the following tasks in order before users can access the service:
- An instance administrator can set the scale of the service adjust the number of available pods. See Scaling services.
- Before you can submit Spark jobs by using the Spark jobs API, you must provision a service instance. See Provisioning the service instance.
- The service is ready to use. See Spark environments.