Configuring Spark Environment Variables

Spark environment variables can be configured at three distinct levels.

  • Service Level (Service-Wide Configuration): Applies to all Spark instances and jobs and you cannot override the immutable variables defined here.
  • Instance Level (Specific to Instances of Analytics Engine): Specific to a single instance of the Analytics Engine. You can ovverride the mutable variables defined here but you cannot ovverride the immutable variables.
  • Job/Kernel Level (Specific to individual Spark jobs): Environment variables that are passed when submitting a Spark job. Only mutable variables can be set at this level.

Defining service level configuration

The Spark environment variables defined at service level (both immutable and mutable) in the Analytics Engine Custom Resource are global configurations that affect all instances and jobs under the Analytics Engine service.

apiVersion: ae.cpd.ibm.com/v1
kind: AnalyticsEngine
metadata:
  name: analyticsengine-sample
  namespace: cpd-instance
spec:
  blockStorageClass: managed-nfs-storage
  fileStorageClass: managed-nfs-storage
  sparkDefaults:
    immutableConfigs:
      spark.ui.requestheadersize: "12k"
    mutableConfigs:
      ae.kernel.idle_timeout: "1000"
    immutableEnvVars:
      SPARK_WORKER_CORES: "4"
      SPARK_EXECUTOR_INSTANCES: "2"
    mutableEnvVars:
      SPARK_EXECUTOR_MEMORY: "8g"
      SPARK_DRIVER_MEMORY: "4g"
  license:
    accept: true

Defining instance level configuration

The Spark environment variables defined at instance level (mutable and immutable) are stored in the instance table database.

Use the following configuration for Immutable environment variables

curl -k -X PUT/PATCH <cpd-route>/v4/analytics_engines/<instance_id>/immutable_env_vars -H "Authorization: Bearer $TOKEN" --data-raw '{
  "SPARK_WORKER_CORES": "4",
  "SPARK_EXECUTOR_INSTANCES": "2"
}'

Use the following configuration for Mutable environment variables

curl -k -X PUT/PATCH <cpd-route>/v4/analytics_engines/<instance_id>/default_env_vars -H "Authorization: Bearer $TOKEN" --data-raw '{
"SPARK_EXECUTOR_MEMORY": "10g",
"SPARK_DRIVER_MEMORY": "6g"
}'

Defining kernel level configuration

The Spark environment variables defined at kernel level are specified while submitting Spark jobs.

curl -k -X POST "$job_endpoint" -H "Authorization: Bearer $token" -H "Content-Type: application/json" -d '{
"name": "spark-job",
"engine": {
 "env": {
   "SPARK_EXECUTOR_MEMORY": "12g",   # Overrides mutable setting
   "SPARK_WORKER_CORES": "4"        # Ignored (immutable setting)
 }
}
}'