Table of contents

Installing Analytics Engine Powered by Apache Spark

A project administrator can install Analytics Engine Powered by Apache Spark on IBM® Cloud Pak for Data.

Permissions you need for this task
You must be an administrator of the OpenShift® project (Kubernetes namespace) where you will deploy Analytics Engine Powered by Apache Spark.
Information you need to complete this task
  • Analytics Engine Powered by Apache Spark needs only the restricted security context constraint (SCC).
  • Analytics Engine Powered by Apache Spark must be installed in the same project as Cloud Pak for Data.
  • Analytics Engine Powered by Apache Spark uses the following storage classes. If you don't use these storage classes on your cluster, ensure that you have a storage class with an equivalent definition:
    • OpenShift Container Storage: ocs-storagecluster-cephfs
    • NFS: managed-nfs-storage
    • Portworx: portworx-shared-gp3
    • IBM Cloud File Storage: ibmc-file-gold-gid or ibm-file-custom-gold-gid

Before you begin

Ensure that the cluster meets the minimum requirements for installing Analytics Engine Powered by Apache Spark. For details, see System requirements.

Additionally, ensure that a cluster administrator completed the required Pre-installation tasks for your environment. Specifically, verify that a cluster administrator completed the following tasks:

  1. Cloud Pak for Data is installed. For details, see Installing Cloud Pak for Data.
  2. For environments that use a private container registry, such as air-gapped environments, the Analytics Engine Powered by Apache Spark software images are mirrored to the private container registry. For details, see Mirroring images to your container registry.
  3. The cluster is configured to pull the Analytics Engine Powered by Apache Spark software images. For details, see Configuring your cluster to pull images.
  4. The Analytics Engine Powered by Apache Spark operator subscription exists. For details, see Creating operator subscriptions.

If these tasks are not complete, the Analytics Engine Powered by Apache Spark installation will fail.

Procedure

Complete the following tasks to install Analytics Engine Powered by Apache Spark:

  1. Installing the service
  2. Verifying the installation
  3. What to do next

Installing the service

To install Analytics Engine Powered by Apache Spark:

  1. Log in to Red Hat® OpenShift Container Platform as a user with sufficient permissions to complete the task:
    oc login OpenShift_URL:port
  2. Create a AnalyticsEngine custom resource to install Analytics Engine Powered by Apache Spark. Follow the appropriate guidance for your environment.
    Tip: For additional properties that you can specify in the custom resource, see Additional installation options
    • The recommended storage class names are described in Setting up shared persistent storage.

      Create a custom resource with the following format.

      cat <<EOF |oc apply -f -
      apiVersion: ae.cpd.ibm.com/v1
      kind: AnalyticsEngine
      metadata:
        name: analyticsengine-sample     # This is the recommended name, but you can change it
        namespace: project-name     # Replace with the project where you will install Analytics Engine Powered by Apache Spark
      spec:
        license:
          accept: true
          license: Enterprise|Standard     # Specify the license you purchased
        version: 4.0.1
        storageVendor: nfs|ocs|portworx     # Specify the type of storage to use, such as ocs
      EOF
    • Important: Use a storage class with attributes similar to the storage class described in the Service persistent storage requirements section of Storage requirements.

      Create a custom resource with the following format.

      cat <<EOF |oc apply -f -
      apiVersion: ae.cpd.ibm.com/v1
      kind: AnalyticsEngine
      metadata:
        name: analyticsengine-sample     # This is the recommended name, but you can change it
        namespace: project-name     # Replace with the project where you will install Analytics Engine Powered by Apache Spark
      spec:
        license:
          accept: true
          license: Enterprise|Standard     # Specify the license you purchased
        version: 4.0.1
        storageClass: storage-class-name     # See the guidance in "Information you need to complete this task"
      EOF

Additional installation options

The following specifications are optional and can be altered to change the service level configurations.

serviceConfig:
  sparkAdvEnabled: true                        # This flag will enable or disable job UI capabilities of Analytics Engine Powered by Apache Spark
  jobAutoDeleteEnabled: true                   # Set this to false if you do not want to remove Analytics Engine Powered by Apache Spark jobs once they have reached terminal states. For example, FINISHED/FAILED
  fipsEnabled: false                           # Set this to "true" if your system is FIPS enabled
  kernelCullTime: 30                           # Change this value to minutes if you want to remove idle kernel after X minutes
  imagePullCompletions: 20                     # If you have a large Openshift cluster, you can update imagePullCompletions and imagePullParallelism accordingly
  imagePullParallelism: "40"                   # If you have 100 nodes in the cluster, set imagePullCompletions: "100" and imagePullParallelism: "150"
  kernelCleanupSchedule: "*/30 * * * *"        # By default, the kernel and job cleanup cronjobs look for idle spark kernels/jobs based on the kernelCullTime parameter
  jobCleanupSchedule: "30*/30 * * * *"         # and removes them. If you want a less or more aggressive cleanup, change this value accordingly. For example, to 1 hour "* */1 * * *" k8s format 

The following specifications are optional and can be altered to change Spark runtime level configurations.

sparkRuntimeConfig:                   
  maxDriverCpuCores: 5                         # If you want to create Spark jobs with drive CPUs more than 5, set this value accordingly 
  maxExecutorCpuCores: 5                       # If you want to create Spark jobs with more than 5 CPU per Executor, set this value accordingly
  maxDriverMemory: "50g"                       # If you want to create Spark jobs with Drive Memory more than 50g, set this value accordingly 
  maxExecutorMemory: "50g"                     # If you want to create Spark jobs with more than 50g Memory per Executor, set this value accordingly
  maxNumWorkers: 50                            # If you want to create Spark jobs with more than 50 workers/executors, set this value accordingly

The following specifications are optional and can be altered to change the services instance level configurations. Each Analytics Engine Powered by Apache Spark service instance has a resource quota (CPU/memory) set by default. It can be changed via API for an instance, but to change default values for any new instance creation, update the following values.

serviceInstanceConfig:                   
  defaultCpuQuota: 20                          # defaultCpuQuota is the accumulative CPU consumption of Spark jobs created under an instance. It can be no more than 20 
  defaultMemoryQuota: 80                    # defaultMemoryQuota is the accumulative memory consumption of Spark jobs created under an instance. It can be no more than 80 gigabytes.
Table 1. Analytics Engine Powered by Apache Spark Custom Resource description
Property Type Default Required/Optional Description
spec.version String   Optional To avoid automatically upgrading the service, specify the Cloud Pak for Data version that you want, otherwise do not set this parameter.
spec.license Object   Required  
spec.license.accept String True Required  
spec.license.license String (Choice Parameter) Enterprise Required Possible values: Enterprise or Standard
spec.scaleConfig String (Choice Parameter) Small Optional Possible values: Small, Medium, or Large
spec.serviceConfig Object   Optional To change service level configurations.
spec.service Config.sparkAdvEnabled Boolean False Optional This flag will enable or disable job UI capabilities of Analytics Engine Powered by Apache Spark.
spec.service Config.jobAutoDeleteEnabled Boolean True Optional Set to false if you do not want to remove jobs once they have reached terminal states. For example, FINISHED/FAILED.
spec.service Config.fipsEnabled Boolean False Optional Set it to True if your system is FIPS enabled.
spec.service Config.kernelCullTime Integer 30 Optional Change the value to minutes if you want to remove the idle kernel after X minutes.
spec.serviceConfig.imagePullCompletions Integer 20 Optional If you have a large Openshift cluster, you can update imagePullCompletions and imagePullParallelism accordingly.
spec.serviceConfig.imagePullParallelism Integer 40 Optional If you have 100 nodes in the cluster, set imagePullCompletions: "100" and imagePullParallelism: "150".
spec.serviceConfig.kernelCleanupSchedule String "*/30 * * * *" Optional By default, kernel and job cleanup cronjobs look for idle Spark kernels/jobs based on the kernelCullTime parameter. If you want a less or more aggressive cleanup, change the value accordingly. For example, to 1 hour "* */1 * * *" k8s format.
spec.serviceConfig.jobCleanupSchedule String "*/30 * * * *" Optional By default, kernel and job cleanup cronjobs look for idle Spark kernels/jobs based on the kernelCullTime parameter. If you want a less or more aggressive cleanup, change the value accordingly. For example, to 1 hour "* */1 * * *" k8s format
spec.sparkRuntimeConfig Object   Optional Change Spark runtime level configurations.
spec.sparkRuntimeConfig.maxDriverCpuCores Integer 5 Optional Maximum number of Driver CPUs .
spec.sparkRuntimeConfig.maxExecutorCpuCore Integer 5 Optional Maximum number of Executor CPUs.
spec.sparkRuntimeConfig.maxDriverMemory String 50g Optional Maximum Driver memory in gigabytes.
spec.sparkRuntimeConfig.maxExecutorMemory String 50g Optional Maximum Executor memory in gigabytes.
spec.sparkRuntimeConfig.maxNumWorkers Integer 50 Optional Maximum number of workers/executors.
spec.serviceInstanceConfig Object   Optional Service instance level configurations. Each Analytics Engine Powered by Apache Spark service instance has a resource quota (CPU/memory) set by default. It can be changed using the API for an instance, but to change default values for any new instance, update serviceInstanceConfig.
spec.serviceInstanceConfig.defaultCpuQuota Integer 20 Optional defaultCpuQuota is the accumulative CPU consumption of Spark jobs created. Under an instance, the CPU consumption can be no more than 20.
spec.serviceInstanceConfig.defaultMemoryQuota String 80 Optional defaultCpuQuota is the accumulative memory consumption of Spark jobs created. Under an instance, the CPU consumption can be no more than 80.

Verifying the installation

When you create the custom resource, the Analytics Engine Powered by Apache Spark operator processes the contents of the custom resource and starts up the microservices that comprise Analytics Engine Powered by Apache Spark, including AnalyticsEngine. (The AnalyticsEngine microservice is defined by the analyticsengine-sample custom resource.) Analytics Engine Powered by Apache Spark is installed when the AnalyticsEngine status is Completed.

To check the status of the installation:

  1. Change to the project where you installed Analytics Engine Powered by Apache Spark:
    oc project project-name
  2. Get the status of Analytics Engine Powered by Apache Spark (analyticsengine-sample):
    oc get AnalyticsEngine analyticsengine-sample -o jsonpath='{.status.analyticsengineStatus} {"\n"}'

    Analytics Engine Powered by Apache Spark is ready when the command returns Completed

What to do next

Complete the following tasks in order before users can access the service:

  1. A project administrator can set the scale of the service adjust the number of available pods. See Scaling services.
  2. Before you can submit Spark jobs by using the Spark jobs API, you must provision a service instance. See Provisioning the service instance.
  3. The service is ready to use. See Spark environments.