Guidelines for Spark application code
A Spark application defines RDDs to describe how data is obtained and transformed. The relationship between RDDs is used to optimize the job logical plan. When writing your Spark application, factor in a few application code guidelines to optimize performance.
Set partition numbers properly
When creating Spark RDDs, explicitly specify the number of RDD partitions, instead of using the default level of parallelism (2 by default). Many RDD APIs provide an optional partitions (parallelism) parameter and you can choose a proper value according to your application workload. If you do not set it in the code, use the spark.default.parallelism parameter.
When specifying partitions, bigger is not always better. More partitions mean more tasks processing data in parallel. Each task has its own overhead, including those relating to communication and data. Submitting too many tasks causes multiple TaskStarted and TaskEnded messages to be sent and handled in the messaging channel, and multiple events to be written to the event log - both factors slow down driver performance. Task metrics also consume memory when the Spark UI is rebuilt and might sometimes crash the driver program if a Java OutOfMemory exception occurs.
Be aware of data shuffling
For most repartition jobs, such as groupByKey and join, a
ShuffleDependency is created, which causes data to be shuffled. However, if the
partitioner of all the RDDs involved is the same, the Spark scheduler uses a
OneToOneDependency instead. In this case, tasks can be pipelined and data is not
shuffled. Explicitly setting a partitioner for your RDDs might help reduce data shuffles.
Make the best of data caching
Spark provides a well-designed API, so application developers can use chained programming
conventions to make the code shorter and simpler. However, if some parts of the chain are to be
reused and its computation cost is high, this part of data must be cached by invoking the
or cache()
actions. persist(storageLevel)
By calling
, you actually call
cache()
and data is cached to memory. Other storage levels are
available for you to choose. For recommendations on how to balance memory and CPU, see the Spark programming guide.persist(MEMORY_ONLY)