Specifying additional configurations for Analytics Engine Powered by Apache Spark

A project administrator can specify additional configurations for Analytics Engine Powered by Apache Spark on IBM Cloud Pak® for Data.

You can set additional configurations, other than the default ones, as part of pre-install or post-install steps. The following specifications are optional and can be altered to change the service level configurations.

serviceConfig:
  schedulerForQuotaAndQueuing: "ibm-cpd-scheduler" # set the scheduler for resource quota and queueing features. Supported scheduler is ibm-cpd-scheduler.
  sparkAdvEnabled: true                  # This flag will enable or disable job UI capabilities of Analytics Engine Powered by Apache Spark
  jobAutoDeleteEnabled: true             # Set this to false if you do not want to remove Analytics Engine Powered by Apache Spark jobs once they have reached terminal states. For example, FINISHED/FAILED
  kernelCullTime: 30                     # Change this value to minutes if you want to remove idle kernel after X minutes
  imagePullCompletions: 20               # If you have a large Openshift cluster, you can update imagePullCompletions and imagePullParallelism accordingly
  imagePullParallelism: "40"             # If you have 100 nodes in the cluster, set imagePullCompletions: "100" and imagePullParallelism: "150"
  kernelCleanupSchedule: "*/30 * * * *"  # By default, the kernel and job cleanup cronjobs look for idle spark kernels/jobs based on the kernelCullTime parameter
  jobCleanupSchedule: "*/30 * * * *"     # and removes them. If you want a less or more aggressive cleanup, change this value accordingly. For example, to 1 hour "* */1 * * *" k8s format 

The following specifications are optional and can be altered to change Spark runtime level configurations.

sparkRuntimeConfig:                   
  maxDriverCpuCores: 5                         # If you want to create Spark jobs with drive CPUs more than 5, set this value accordingly 
  maxExecutorCpuCores: 5                       # If you want to create Spark jobs with more than 5 CPU per Executor, set this value accordingly
  maxDriverMemory: "50g"                       # If you want to create Spark jobs with Drive Memory more than 50g, set this value accordingly 
  maxExecutorMemory: "50g"                     # If you want to create Spark jobs with more than 50g Memory per Executor, set this value accordingly
  maxNumWorkers: 50                            # If you want to create Spark jobs with more than 50 workers/executors, set this value accordingly
  localDirScaleFactor: 10                      # If you want to increase local disk space for your Spark jobs, set this value accordingly.

The following specifications are optional and can be altered to change the services instance level configurations. Each Analytics Engine Powered by Apache Spark service instance has a resource quota (CPU/memory) set by default. It can be changed via API for an instance, but to change default values for any new instance creation, update the following values.

serviceInstanceConfig:                   
  defaultCpuQuota: 20                       # defaultCpuQuota is the accumulative CPU consumption of Spark jobs created under an instance. It can be no more than 20 
  defaultMemoryQuota: 80                    # defaultMemoryQuota is the accumulative memory consumption of Spark jobs created under an instance. It can be no more than 80 gigabytes.
Table 1. Analytics Engine Powered by Apache Spark Custom Resource description
Property Description Type Specification for .yml files
spec.scaleConfig Possible values: Small, Medium, or Large.

Default: Small

String (Choice Parameter) N/A
spec.serviceConfig To change service level configurations. Object N/A
spec.serviceConfig.schedulerForQuotaAndQueuing Set scheduler for resource quota and queuing features. Supported scheduler is ibm-cpd-scheduler. String N/A
spec.serviceConfig.sparkAdvEnabled This flag will enable or disable job UI capabilities of Analytics Engine Powered by Apache Spark.

Default: False

Boolean analyticsengine_spark_adv_enabled
spec.serviceConfig.jobAutoDeleteEnabled Set to false if you do not want to remove jobs once they have reached terminal states. For example, FINISHED/FAILED.

Default: True

Boolean analyticsengine_job_auto_delete_enabled
spec.serviceConfig.kernelCullTime Change the value to minutes if you want to remove the idle kernel after X minutes.

Default: 30

Integer analyticsengine_kernel_cull_time
spec.serviceConfig.imagePullCompletions If you have a large Openshift cluster, you can update imagePullCompletions and imagePullParallelism accordingly.

Default: 20

Integer analyticsengine_image_pull_completions
spec.serviceConfig.imagePullParallelism If you have 100 nodes in the cluster, set imagePullCompletions: "100" and imagePullParallelism: "150".

Default: 40

Integer analyticsengine_image_pull_parallelism
spec.serviceConfig.kernelCleanupSchedule By default, kernel and job cleanup cronjobs look for idle Spark kernels/jobs based on the kernelCullTime parameter. If you want a less or more aggressive cleanup, change the value accordingly. For example, to 1 hour "* */1 * * *" k8s format.

Default: "*/30 * * * *"

String analyticsengine_kernel_cleanup_schedule
spec.serviceConfig.jobCleanupSchedule By default, kernel and job cleanup cronjobs look for idle Spark kernels/jobs based on the kernelCullTime parameter. If you want a less or more aggressive cleanup, change the value accordingly. For example, to 1 hour "* */1 * * *" k8s format.

Default: "*/30 * * * *"

String analyticsengine_job_cleanup_schedule
spec.sparkRuntimeConfig Change Spark runtime level configurations. Object N/A
spec.sparkRuntimeConfig.maxDriverCpuCores Maximum number of Driver CPUs .

Default: 5

Integer analyticsengine_max_driver_cpu_cores
spec.sparkRuntimeConfig.maxExecutorCpuCore Maximum number of Executor CPUs.

Default: 5

Integer analyticsengine_max_executor_cpu_cores
spec.sparkRuntimeConfig.maxDriverMemory Maximum Driver memory in gigabytes.

Default: 50g

String analyticsengine_max_driver_memory
spec.sparkRuntimeConfig.maxExecutorMemory Maximum Executor memory in gigabytes.

Default: 50g

String analyticsengine_max_executor_memory
spec.sparkRuntimeConfig.maxNumWorkers Maximum number of workers/executors.

Default: 50

Integer analyticsengine_max_num_workers
spec.sparkRuntimeConfig.localDirScaleFactor

Temp disk size in Spark master/worker is a factor of number of CPU.

temp_disk_space = numCpu * localDirScaleFactor

Default: 10

Integer analyticsengine_local_dir_scale_factor
spec.serviceInstanceConfig Service instance level configurations. Each Analytics Engine Powered by Apache Spark service instance has a resource quota (CPU/memory) set by default. It can be changed using the API for an instance, but to change default values for any new instance, update serviceInstanceConfig. Object N/A
spec.serviceInstanceConfig.defaultCpuQuota defaultCpuQuota is the accumulative CPU consumption of Spark jobs created. Under an instance, the CPU consumption can be no more than 20.

Default: 20

Integer analyticsengine_default_cpu_quota
spec.serviceInstanceConfig.defaultMemoryQuota defaultCpuQuota is the accumulative memory consumption of Spark jobs created. Under an instance, the CPU consumption can be no more than 80.

Default: 80

String analyticsengine_default_memory_quota

What to do next

Complete the following tasks in order before users can access the service:

  1. A project administrator can set the scale of the service adjust the number of available pods. See Scaling services.
  2. Before you can submit Spark jobs by using the Spark jobs API, you must provision a service instance. See Provisioning the service instance.
  3. The service is ready to use. See Spark environments.