Installing Analytics Engine powered by Apache Spark

An instance administrator can install Analytics Engine powered by Apache Spark on IBM Cloud Pak® for Data Version 4.8.

Who needs to complete this task?

Instance administrator To install Analytics Engine powered by Apache Spark, you must be an instance administrator. An instance administrator has permission to install software in the following projects:

The operators project for the instance

The operators for this instance of Cloud Pak for Data are installed in the operators project.

In the installation commands, the ${PROJECT_CPD_INST_OPERATORS} environment variable refers to the operators project.

The operands project for the instance

The Cloud Pak for Data control plane and the services for this instance of Cloud Pak for Data are installed in the operands project.

In the installation commands, the ${PROJECT_CPD_INST_OPERANDS} environment variable refers to the operands project.

When do you need to complete this task?

Review the following options to determine whether you need to complete this task:

  • If you want to install the Cloud Pak for Data control plane and one or more services at the same time, follow the process in Installing an instance of Cloud Pak for Data instead.
  • If you didn't install Analytics Engine powered by Apache Spark when you installed the Cloud Pak for Data control plane, complete this task to add Analytics Engine powered by Apache Spark to your environment.

    Repeat as needed If you are responsible for multiple instances of Cloud Pak for Data, you can repeat this task to install more instances of Analytics Engine powered by Apache Spark on the cluster.

Information you need to complete this task

Review the following information before you install Analytics Engine powered by Apache Spark:

Version requirements

All of the components that are associated with an instance of Cloud Pak for Data must be installed at the same release. For example, if the Cloud Pak for Data control plane is installed at Version 4.8.7, you must install Analytics Engine powered by Apache Spark at Version 4.8.7.

Environment variables

The commands in this task use environment variables so that you can run the commands exactly as written.

  • If you don't have the script that defines the environment variables, see Setting up installation environment variables.
  • To use the environment variables from the script, you must source the environment variables before you run the commands in this task. For example, run:
    source ./cpd_vars.sh
Security context constraint

Analytics Engine powered by Apache Spark works with the default Red Hat® OpenShift® Container Platform security context constraint, restricted-v2.

Storage requirements
You don't need to specify storage information when you install Analytics Engine powered by Apache Spark. However, you do need to specify storage when you provision an instance of Analytics Engine powered by Apache Spark.

Before you begin

This task assumes that the following prerequisites are met:

Prerequisite Where to find more information
The cluster meets the minimum requirements for installing Analytics Engine powered by Apache Spark. If this task is not complete, see System requirements.
The workstation from which you will run the installation is set up as a client workstation and includes the following command-line interfaces:
  • Cloud Pak for Data CLI: cpd-cli
  • OpenShift CLI: oc
If this task is not complete, see Setting up a client workstation.
The Cloud Pak for Data control plane is installed. If this task is not complete, see Installing an instance of Cloud Pak for Data.
For environments that use a private container registry, such as air-gapped environments, the Analytics Engine powered by Apache Spark software images are mirrored to the private container registry. If this task is not complete, see Mirroring images to a private container registry.
For environments that use a private container registry, such as air-gapped environments, the cpd-cli is configured to pull the olm-utils-v2 image from the private container registry. If this task is not complete, see Pulling the olm-utils-v2 image from the private container registry.

Procedure

Complete the following tasks to install Analytics Engine powered by Apache Spark:

  1. Specifying installation options
  2. Installing the service
  3. Validating the installation
  4. What to do next

Analytics Engine powered by Apache Spark parameters

If you plan to install Analytics Engine powered by Apache Spark, you can specify the following installation options in a file named install-options.yml in the work directory.

The parameters are optional. If you do not set these installation parameters, the default values are used. Uncomment the parameters that you want to override and update the values appropriately.

The sample YAML content uses the default values.

################################################################################
# Analytics Engine powered by Apache Spark parameters
################################################################################

# ------------------------------------------------------------------------------
# Analytics Engine powered by Apache Spark service configuration parameters
# ------------------------------------------------------------------------------
#analyticsengine_spark_adv_enabled: true
#analyticsengine_job_auto_delete_enabled: true
#analyticsengine_kernel_cull_time: 30
#analyticsengine_image_pull_parallelism: "40"
#analyticsengine_image_pull_completions: "20"
#analyticsengine_kernel_cleanup_schedule: "*/30 * * * *"
#analyticsengine_job_cleanup_schedule: "*/30 * * * *"
#analyticsengine_skip_selinux_relabeling: false
#analyticsengine_mount_customizations_from_cchome: false

# ------------------------------------------------------------------------------
# Spark runtime configuration parameters
# ------------------------------------------------------------------------------
#analyticsengine_max_driver_cpu_cores: 5          # The number of CPUs to allocate to the Spark jobs driver. The default is 5.  
#analyticsengine_max_executor_cpu_cores: 5        # The number of CPUs to allocate to the Spark jobs executor. The default is 5.
#analyticsengine_max_driver_memory: "50g"         # The amount of memory, in gigabytes to allocate to the driver. The default is 50g.
#analyticsengine_max_executor_memory: "50g"       # The amount of memory, in gigabytes to allocate to the executor. The default is 50g. 
#analyticsengine_max_num_workers: 50              # The number of workers (also called executors) to allocate to spark jobs. The default is 50.
#analyticsengine_local_dir_scale_factor: 10       # The number that is used to calculate the temporary disk size on Spark nodes. The formula is temp_disk_size = number_of_cpu * local_dir_scale_factor. The default is 10.
Analytics Engine powered by Apache Spark service configuration parameters

The service configuration parameters determine how the Analytics Engine powered by Apache Spark service behaves.

Property Description
analyticsengine_spark_adv_enabled Specify whether to display the job UI.
Default value
true
Valid values
false
Do not display the job UI.
true
Display the job UI.
analyticsengine_job_auto_delete_enabled Specify whether to automatically delete jobs after they reach a terminal state, such as FINISHED or FAILED. The default is true.
Default value
true
Valid values
true
Delete jobs after they reach a terminal state.
false
Retain jobs after they reach a terminal state.
analyticsengine_kernel_cull_time The amount of time, in minutes, idle kernels are kept.
Default value
30
Valid values
An integer greater than 0.
analyticsengine_image_pull_parallelism The number of pods that are scheduled to pull the Spark image in parallel.

For example, if you have 100 nodes in the cluster, set:

  • analyticsengine_image_pull_completions: "100"
  • analyticsengine_image_pull_parallelism: "150"

In this example, at least 100 nodes will pull the image successfully with 150 pods pulling the image in parallel.

Default value
"40"
Valid values
An integer greater than or equal to 1.

Increase this value only if you have a very large cluster and you have sufficient network bandwidth and disk I/O to support more pulls in parallel.

analyticsengine_image_pull_completions The number of pods that should be completed in order for the image pull job to be completed.

For example, if you have 100 nodes in the cluster, set:

  • analyticsengine_image_pull_completions: "100"
  • analyticsengine_image_pull_parallelism: "150"

In this example, at least 100 nodes will pull the image successfully with 150 pods pulling the image in parallel.

Default value
"20"
Valid values
An integer greater than or equal to 1.

Increase this value only if you have a very large cluster and you have sufficient network bandwidth and disk I/O to support more pulls in parallel.

analyticsengine_kernel_cleanup_schedule Override the analyticsengine_kernel_cull_time setting for the kernel cleanup CronJob.

By default, the kernel cleanup CronJob runs every 30 minutes.

Default value
"*/30 * * * *"
Valid values
A string that uses the CronJob schedule syntax.
analyticsengine_job_cleanup_schedule Override the analyticsengine_kernel_cull_time setting for the job cleanup CronJob.

By default, the job cleanup CronJob runs every 30 minutes.

Default value
"*/30 * * * *"
Valid values
A string that uses the CronJob schedule syntax.
analyticsengine_skip_selinux_relabeling Specify whether to skip the SELinux relabeling.

To use this feature, you must create the required MachineConfig and RuntimeClass definitions. For more information, see Enabling MachineConfig and RuntimeClass definitions for certain properties.

Default value
false
Valid values
false
Do not skip the SELinux relabeling.
true
Skip the SELinux relabeling.
analyticsengine_mount_customizations_from_cchome Specify whether to you want to enable custom drivers. These drivers need to be mounted from the cc-home-pvc directory.

Common core services This feature is available only when the Cloud Pak for Data common core services are installed.

Default value
false
Valid values
false
You do not want to use custom drivers.
true
You want to enable custom drivers.
Spark runtime configuration parameters

The runtime configuration parameters determine how the Spark runtimes generated by the Analytics Engine powered by Apache Spark service behave.

Property Description
analyticsengine_max_driver_cpu_cores The number of CPUs to allocate to the Spark jobs driver.
Default value
5
Valid values
An integer greater than or equal to 1.
analyticsengine_max_executor_cpu_cores The number of CPUs to allocate to the Spark jobs executor.
Default value
5
Valid values
An integer greater than or equal to 1.
analyticsengine_max_driver_memory The amount of memory, in gigabytes to allocate to the driver.
Default value
"50g"
Valid values
An integer greater than or equal to 1.
analyticsengine_max_executor_memory The amount of memory, in gigabytes to allocate to the executor.
Default value
"50g"
Valid values
An integer greater than or equal to 1.
analyticsengine_max_num_worker The number of workers (also called executors) to allocate to Spark jobs.
Default value
50
Valid values
An integer greater than or equal to 1.
analyticsengine_local_dir_scale_factor The number that is used to calculate the temporary disk size on Spark nodes.

The formula is:

temp_disk_size = number_of_cpu * local_dir_scale_factor
Default value
10
Valid values
An integer greater than or equal to 1.

Installing the service

To install Analytics Engine powered by Apache Spark:

  1. Log the cpd-cli in to the Red Hat OpenShift Container Platform cluster:
    ${CPDM_OC_LOGIN}
    Remember: CPDM_OC_LOGIN is an alias for the cpd-cli manage login-to-ocp command.
  2. Run the following command to create the required OLM objects for Analytics Engine powered by Apache Spark in the operators project for the instance:
    cpd-cli manage apply-olm \
    --release=${VERSION} \
    --cpd_operator_ns=${PROJECT_CPD_INST_OPERATORS} \
    --components=analyticsengine
    Wait for the cpd-cli to return the following message before you proceed to the next step:
    [SUCCESS]... The apply-olm command ran successfully

    If the apply-olm fails, see Troubleshooting the apply-olm command during installation or upgrade.

  3. Create the custom resource for Analytics Engine powered by Apache Spark.

    Run the appropriate command to create the custom resource.

    Default installation (without installation options)
    cpd-cli manage apply-cr \
    --components=analyticsengine \
    --release=${VERSION} \
    --cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} \
    --license_acceptance=true
    Custom installation (with installation options)
    cpd-cli manage apply-cr \
    --components=analyticsengine \
    --release=${VERSION} \
    --cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} \
    --param-file=/tmp/work/install-options.yml \
    --license_acceptance=true

Validating the installation

Analytics Engine powered by Apache Spark is installed when the apply-cr command returns:
[SUCCESS]... The apply-cr command ran successfully

If you want to confirm that the custom resource status is Completed, you can run the cpd-cli manage get-cr-status command:

cpd-cli manage get-cr-status \
--cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} \
--components=analyticsengine

What to do next

Before you can submit Spark jobs by using the Spark jobs API, you must provision a service instance. See Provisioning the service instance.

Optionally, you can perform the following tasks:

  1. For additional properties that you can specify in the custom resource, see Specifying additional configurations for Analytics Engine powered by Apache Spark.
  2. An instance administrator can set the scale of the service to adjust the number of available pods. See Scaling services.

The service is ready to use. See Spark environments.