Best practices for tuning Spark applications

The following best practices advise how to configure your Spark application/cluster, and how to monitor/profile the Spark application.

Spark executor configuration

To change the resources for the Spark application in the notebook, complete the following steps:

  1. Stop the pre-created spark context, and then create a new spark context with the proper resource configuration. Python example:

    sc.stop()
    from pyspark import SparkConf, SparkContext
    conf = (SparkConf()
        .set("spark.cores.max", "15")
        .set("spark.dynamicAllocation.initialExecutors", "3")
        .set("spark.executor.cores", "5")
        .set("spark.executor.memory", "6g"))
    sc=SparkContext(conf=conf)
    
  2. Verify the new settings by running the following command in a cell using the new spark context:

    for item in sorted(sc._conf.getAll()): print(item)
    

Note that the resource settings also apply to running notebooks for scheduling jobs.

More details can be found in Creating notebooks.

Spark Compute node

If you add more compute nodes, one additional Spark worker will be started on each added compute node. For each Spark application, it contains the Spark Driver(Jupyter Pod) and Spark Executors(Spark worker Pod). Both the Jupyter Pod and Spark worker Pod run on the compute node. When you are preparing the compute nodes requirement for the workload, complete the following steps:

  1. Estimate each application CPU/memory usage. See * Jupyter notebook configuration and Spark executor configuration.
  2. Set the total CPU/Memory usage to the number of concurrent applications x each application CPU/memory usage.
  3. Prepare the compute nodes based on the total CPU/Memory usage. In all cases, it is recommended you allocate at most 75% of the memory for Spark, and leave the rest for the operating system and buffer cache.

Note that the Watson Studio cluster Spark is using Standalone mode on K8S with dynamic allocation enabled.

Monitoring

You can monitor the per-pod CPU/memory/disk usage from the dashboard in the Admin Console.

Admin dashboard

The Watson Studio administrator can also sign in to the cluster, and use command line to check the usage of each pod:

[root@allan5-master-1 ~]# kubectl top pod  -n zen           jupyter-server-1531839933048-999-5dd9b4f47f-qs6js
NAME                                                CPU(cores)   MEMORY(bytes)   
jupyter-server-1531839933048-999-5dd9b4f47f-qs6js   5m           102Mi