Guidelines for Spark application code

A Spark application defines RDDs to describe how data is obtained and transformed. The relationship between RDDs is used to optimize the job logical plan. When writing your Spark application, factor in a few application code guidelines to optimize performance.

Set partition numbers properly

When creating Spark RDDs, explicitly specify the number of RDD partitions, instead of using the default level of parallelism (2 by default). Many RDD APIs provide an optional partitions (parallelism) parameter and you can choose a proper value according to your application workload. If you do not set it in the code, use the spark.default.parallelism parameter.

When specifying partitions, bigger is not always better. More partitions mean more tasks processing data in parallel. Each task has its own overhead, including those relating to communication and data. Submitting too many tasks causes multiple TaskStarted and TaskEnded messages to be sent and handled in the messaging channel, and multiple events to be written to the event log - both factors slow down driver performance. Task metrics also consume memory when the Spark UI is rebuilt and might sometimes crash the driver program if a Java OutOfMemory exception occurs.

Be aware of data shuffling

For most repartition jobs, such as groupByKey and join, a ShuffleDependency is created, which causes data to be shuffled. However, if the partitioner of all the RDDs involved is the same, the Spark scheduler uses a OneToOneDependency instead. In this case, tasks can be pipelined and data is not shuffled. Explicitly setting a partitioner for your RDDs might help reduce data shuffles.

Make the best of data caching

Spark provides a well-designed API, so application developers can use chained programming conventions to make the code shorter and simpler. However, if some parts of the chain are to be reused and its computation cost is high, this part of data must be cached by invoking the cache()or persist(storageLevel) actions.

By calling cache(), you actually call persist(MEMORY_ONLY) and data is cached to memory. Other storage levels are available for you to choose. For recommendations on how to balance memory and CPU, see the Spark programming guide.