Customize Apache Spark runtimes in IBM Analytics Engine

After installing IBM Analytics Engine powered by Apache Spark, an instance administrator can customize the Apache Spark runtime by creating a container image that extends the default runtime. Use this image to add system packages, Python libraries, or JAR files required by your workloads.

Before you begin

Ensure that you have the following:

  • Access to a Cloud Pak for Data cluster with Analytics Engine installed.
  • Administrator privileges to run oc and docker commands.
  • The base image ID for the Spark version you want to customize:
    • Spark 3.4: os-image-id-jkg34-cp4d-wxd
    • Spark 3.5: os-image-id-jkg35-cp4d-wxd
  • A working Docker environment to build the image.

Steps

  1. Get the base image details:
    1. Run the following command to retrieve the base image and tag. Replace <os-image-id> with the appropriate ID for your Spark version. cpd-instance has to be replaced with namespace where CPD is installed

       oc get cm -n cpd-instance spark-hb-cluster-template \
        -o=jsonpath='{$.data.os-image-json}' \
      | jq -r '.docs[] | select(._id | test("<os-image-id>")) | "\(.docker_repo)/\(.docker_image)@\(.docker_image_tag)"' 
      
    2. From the output, copy the following values:

      
      REGISTRY=<docker_repo>
      IMAGE_NAME=<docker_image>
      
  2. Create a file named Dockerfile with the following content:
    ARG REGISTRY
    ARG IMAGE_NAME
    FROM ${REGISTRY}/${IMAGE_NAME}
    USER root:3000
    COPY *.sh /tmp/
    RUN bash /tmp/install-os-packages.sh && \
        bash /tmp/install-conda-packages.sh && \
        bash /tmp/install-jar.sh
    USER ${CLUSTER_USER}
    WORKDIR ${WORK_DIR}
    
  3. Create the customization scripts:
    1. install-os-packages.sh
      #!/bin/bash
      set -e -o pipefail
      
      # Install operating system packages
      microdnf update -y
      microdnf install -y vim
      
    2. install -jar.sh
      #!/bin/bash
      set -e -o pipefail
      
      # Install additional JAR files
      wget -c -P /opt/ibm/spark/external-jars/ \
      https://repo1.maven.org/maven2/com/fasterxml/jackson/core/jackson-databind/2.18.0/jackson-databind-2.18.0.jar
      
  4. Build the custom image:
    1. To get the digest of the base image:

      os_image_json=$(oc get cm spark-hb-cluster-template -o jsonpath='{.data.os-image-json}' -n <namespace>)
      echo "$os_image_json" | grep -A 6 '"docker_image": "spark-hb-cpd-miniconda-runtimes"' | grep 'docker_image_tag' | awk -F': "' '{print $2}' | sed 's/"$//'
      
    2. Use the digest to build your custom image:
      docker build -t <YOUR_REGISTRY>/spark-hb-cpd-miniconda-runtimes:custom-runtime \
      --build-arg REGISTRY=<YOUR_REGISTRY> \
      --build-arg IMAGE_NAME=<image_name_COPIED>
      
  5. Configure the custom image:
    • Cluster level

      Update the AnalyticsEngine custom resource:

      spec:
        customSparkImage: <your_custom_image_digest>
        builtInSparkImages: <existing_images>
      
    • Instance level
      Run the following command to patch the default configuration:
      curl -k --request PATCH https://<cpd-url>/v4/analytics_engines/<engine_id>/default_configs \
      -H "Authorization: Bearer $TOKEN" \
      --data-raw '{
        "ae.spark.kubernetes.container.image": "<your_custom_image_digest>",
        "ae.spark.kubernetes.driver.container.image": "",
        "ae.spark.kubernetes.executor.container.image": ""
      }'
      
    • Job or Kernel level
      Specify the image in the job configuration:
      "conf": {
        "ae.spark.kubernetes.container.image": "<your_custom_image_digest>",
        "ae.spark.kubernetes.driver.container.image": "",
        "ae.spark.kubernetes.executor.container.image": ""
      }
      

What to do next

Complete the following tasks in order before users can access the service:

  1. An instance administrator can set the scale of the service adjust the number of available pods. See Scaling services.
  2. Before you can submit Spark jobs by using the Spark jobs API, you must provision a service instance. See Provisioning the service instance.
  3. The service is ready to use. See Spark environments.