Upgrading Analytics Engine powered by Apache Spark from Version 5.2 to Version 5.4

An instance administrator can upgrade Analytics Engine powered by Apache Spark from Version 5.2 to Version 5.4.

Who needs to complete this task?

Instance administrator To upgrade Analytics Engine powered by Apache Spark, you must be an instance administrator. An instance administrator has permission to manage software in the following projects:

The operators project for the instance

The operators for this instance of Analytics Engine powered by Apache Spark are installed in the operators project. In the upgrade commands, the ${PROJECT_CPD_INST_OPERATORS} environment variable refers to the operators project.

The operands project for the instance

The custom resources for the control plane and Analytics Engine powered by Apache Spark are installed in the operands project. In the upgrade commands, the ${PROJECT_CPD_INST_OPERANDS} environment variable refers to the operands project.

When do you need to complete this task?

Review the following options to determine whether you need to complete this task:

  • If you want to upgrade the IBM Software Hub control plane and one or more services at the same time, follow the process in Upgrading an instance of IBM Software Hub instead.
  • If you didn't upgrade Analytics Engine powered by Apache Spark when you upgraded the IBM Software Hub control plane, complete this task to upgrade Analytics Engine powered by Apache Spark.

    Repeat as needed If you are responsible for multiple instances of IBM Software Hub, you can repeat this task to upgrade more instances of Analytics Engine powered by Apache Spark on the cluster.

Information you need to complete this task

Review the following information before you upgrade Analytics Engine powered by Apache Spark:

Version requirements

All the components that are associated with an instance of IBM Software Hub must be installed at the same release. For example, if the IBM Software Hub control plane is at Version 5.4.0, you must upgrade Analytics Engine powered by Apache Spark to Version 5.4.0.

Environment variables
The commands in this task use environment variables so that you can run the commands exactly as written.
  • If you don't have the script that defines the environment variables, see Setting up installation environment variables.
  • To use the environment variables from the script, you must source the environment variables before you run the commands in this task. For example, run:
    source ./cpd_vars.sh

Before you begin

This task assumes that the following prerequisites are met:

System requirements
This task assumes that the cluster meets the minimum requirements for Analytics Engine powered by Apache Spark.
Where to find more information
If this task is not complete, see System requirements.
Workstation
This task assumes that the workstation from which you will run the upgrade is set up as a client workstation and has the following command-line interfaces:
  • IBM Software Hub CLI: cpd-cli
  • OpenShift® CLI: oc
  • Helm CLI: helm
Where to find more information
If this task is not complete, see Updating client workstations.
Control plane
This task assumes that the IBM Software Hub control plane is upgraded.
Where to find more information
If this task is not complete, see Upgrading an instance of IBM Software Hub.
Private container registry
If your environment uses a private container registry (for example, your cluster is air-gapped), this task assumes that the following tasks are complete:
  1. The Analytics Engine powered by Apache Spark software images are mirrored to the private container registry.
    Where to find more information
    If this task is not complete, see Mirroring images to a private container registry.
  2. The cpd-cli is configured to pull the olm-utils-v4 image from the private container registry.
    Where to find more information
    If this task is not complete, see Pulling the olm-utils-v4 image from the private container registry.
Cluster-scoped resources
This task assumes that the cluster-scoped resources, such as custom resource definitions, cluster roles, and cluster role bindings, were updated.
Where to find more information
If this task is not complete, see Updating the cluster-scoped resources for the platform and services.
Image pull secrets
This task assumes that the secrets that contain the image pull credentials for the instance exist.
Where to find more information
If this task is not complete, see Creating image pull secrets for an instance of IBM Software Hub.

Procedure

Complete the following tasks to upgrade Analytics Engine powered by Apache Spark:

  1. Specifying installation options
  2. Upgrading the service
  3. Validating the upgrade
  4. Upgrading existing service instances
  5. What to do next

Analytics Engine powered by Apache Spark parameters

If you plan to install Analytics Engine powered by Apache Spark, you can specify the following installation options in a file named install-options.yml in the cpd-cli work directory (For example: cpd-cli-workspace/olm-utils-workspace/work).

The parameters are optional. If you do not set these installation parameters, the default values are used.

Retain the --- syntax at the beginning of the entry to ensure that this entry is treated as a separate document.

---
# ............................................................................
# Analytics Engine powered by Apache Spark parameters
# ............................................................................
non_olm:
  analyticsengine:

# ------------------------------------------------------------------------------
# Analytics Engine powered by Apache Spark service configuration parameters
# ------------------------------------------------------------------------------
    serviceConfig:
      sparkAdvEnabled: true
      jobAutoDeleteEnabled: true
      kernelCullTime: 30
      imagePullParallelism: "40"
      imagePullCompletions: "20"
      kernelCleanupSchedule: "*/30 * * * *"
      jobCleanupSchedule: "*/30 * * * *"
      skipSelinuxRelabeling: false
      mountCustomizationsFromCchome: false

# ------------------------------------------------------------------------------
# Spark runtime configuration parameters
# ------------------------------------------------------------------------------
    sparkRuntimeConfig:
      maxDriverCpuCores: 5
      maxExecutorCpuCores: 5
      maxDriveMemory: "50g"
      maxExecutorMemory: "50g"
      maxNumWorkers: 50
      localDirScaleFactor: 10
Analytics Engine powered by Apache Spark service configuration parameters

The service configuration parameters determine how the Analytics Engine powered by Apache Spark service behaves.

Property Description
sparkAdvEnabled Specify whether to display the job UI.
Default value
true
Valid values
false
Do not display the job UI.
true
Display the job UI.
jobAutoDeleteEnabled Specify whether to automatically delete jobs after they reach a terminal state, such as FINISHED or FAILED. The default is true.
Default value
true
Valid values
true
Delete jobs after they reach a terminal state.
false
Retain jobs after they reach a terminal state.
kernelCullTime The amount of time, in minutes, idle kernels are kept.
Default value
30
Valid values
An integer greater than 0.
imagePullParallelism The number of pods that are scheduled to pull the Spark image in parallel.

For example, if you have 100 nodes in the cluster, set:

  • analyticsengine_image_pull_completions: "100"
  • analyticsengine_image_pull_parallelism: "150"

In this example, at least 100 nodes will pull the image successfully with 150 pods pulling the image in parallel.

Default value
"40"
Valid values
An integer greater than or equal to 1.

Increase this value only if you have a very large cluster and you have sufficient network bandwidth and disk I/O to support more pulls in parallel.

imagePullCompletions The number of pods that should be completed in order for the image pull job to be completed.

For example, if you have 100 nodes in the cluster, set:

  • analyticsengine_image_pull_completions: "100"
  • analyticsengine_image_pull_parallelism: "150"

In this example, at least 100 nodes will pull the image successfully with 150 pods pulling the image in parallel.

Default value
"20"
Valid values
An integer greater than or equal to 1.

Increase this value only if you have a very large cluster and you have sufficient network bandwidth and disk I/O to support more pulls in parallel.

kernelCleanupSchedule Override the analyticsengine_kernel_cull_time setting for the kernel cleanup CronJob.

By default, the kernel cleanup CronJob runs every 30 minutes.

Default value
"*/30 * * * *"
Valid values
A string that uses the CronJob schedule syntax.
jobCleanupSchedule Override the analyticsengine_kernel_cull_time setting for the job cleanup CronJob.

By default, the job cleanup CronJob runs every 30 minutes.

Default value
"*/30 * * * *"
Valid values
A string that uses the CronJob schedule syntax.
skipSelinuxRelabeling Specify whether to skip the SELinux relabeling.

To use this feature, you must create the required MachineConfig and RuntimeClass definitions. For more information, see Enabling MachineConfig and RuntimeClass definitions for certain properties.

Default value
false
Valid values
false
Do not skip the SELinux relabeling.
true
Skip the SELinux relabeling.
mountCustomizationsFromCchome Specify whether to you want to enable custom drivers. These drivers need to be mounted from the cc-home-pvc directory.

Common core services This feature is available only when the Cloud Pak for Data common core services are installed.

Default value
false
Valid values
false
You do not want to use custom drivers.
true
You want to enable custom drivers.
Spark runtime configuration parameters

The runtime configuration parameters determine how the Spark runtimes generated by the Analytics Engine powered by Apache Spark service behave.

Property Description
maxDriverCpuCores The number of CPUs to allocate to the Spark jobs driver.
Default value
5
Valid values
An integer greater than or equal to 1.
maxExecutorCpuCores The number of CPUs to allocate to the Spark jobs executor.
Default value
5
Valid values
An integer greater than or equal to 1.
maxDriveMemory The amount of memory, in gigabytes to allocate to the driver.
Default value
"50g"
Valid values
An integer greater than or equal to 1.
maxExecutorMemory The amount of memory, in gigabytes to allocate to the executor.
Default value
"50g"
Valid values
An integer greater than or equal to 1.
maxNumWorker The number of workers (also called executors) to allocate to Spark jobs.
Default value
50
Valid values
An integer greater than or equal to 1.
localDirScaleFactor The number that is used to calculate the temporary disk size on Spark nodes.

The formula is:

temp_disk_size = number_of_cpu * local_dir_scale_factor
Default value
10
Valid values
An integer greater than or equal to 1.

Upgrading the service

To upgrade Analytics Engine powered by Apache Spark:

  1. Log the cpd-cli in to the Red Hat® OpenShift Container Platform cluster:
    ${CPDM_OC_LOGIN}
    Remember: CPDM_OC_LOGIN is an alias for the cpd-cli manage login-to-ocp command.
  2. Update the operator and custom resource for Analytics Engine powered by Apache Spark.

    Run the appropriate command to create the custom resource.

    Default installation (without installation options)
    cpd-cli manage install-components \
    --license_acceptance=true \
    --components=analyticsengine \
    --release=${VERSION} \
    --patch_id=${PATCH_ID} \
    --operator_ns=${PROJECT_CPD_INST_OPERATORS} \
    --instance_ns=${PROJECT_CPD_INST_OPERANDS} \
    --image_pull_prefix=${IMAGE_PULL_PREFIX} \
    --image_pull_secret=${IMAGE_PULL_SECRET} \
    --upgrade=true
    Custom installation (with installation options)
    cpd-cli manage install-components \
    --license_acceptance=true \
    --components=analyticsengine \
    --release=${VERSION} \
    --patch_id=${PATCH_ID} \
    --operator_ns=${PROJECT_CPD_INST_OPERATORS} \
    --instance_ns=${PROJECT_CPD_INST_OPERANDS} \
    --image_pull_prefix=${IMAGE_PULL_PREFIX} \
    --image_pull_secret=${IMAGE_PULL_SECRET} \
    --param-file=/tmp/work/install-options.yml \
    --upgrade=true

Validating the upgrade

Analytics Engine powered by Apache Spark is upgraded when the install-components command returns:
[SUCCESS]... The install-components command ran successfully

If you want to confirm that the custom resource status is Completed, you can run the cpd-cli manage get-cr-status command:

cpd-cli manage get-cr-status \
--cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} \
--components=analyticsengine

Upgrading existing service instances

After you upgrade Analytics Engine powered by Apache Spark, you must upgrade any service instances that are associated with Analytics Engine powered by Apache Spark.

Before you begin

Create a profile on the workstation from which you will upgrade the service instances.

The profile must be associated with a IBM Software Hub user who has either the following permissions:

  • Create service instances (can_provision)
  • Manage service instances (manage_service_instances)

For more information, see Creating a profile to use the cpd-cli management commands.

Procedure

To upgrade the service instances:

cpd-cli service-instance upgrade \
--service-type=spark \
--profile=${CPD_PROFILE_NAME} \
--all

What to do next

  1. If you used self-signed certificates or CA certificates to securely connect between the Spark runtime and your resources, you need to add these certificates to the Spark truststore again after upgrading Analytics Engine powered by Apache Spark. For details, see Using a CA certificate to connect to internal servers from the platform.
  2. Analytics Engine powered by Apache Spark is ready to use. For details, see Extending analytics using Spark.