Creating shared Spark batch applications

Create shared Spark batch applications where multiple users can submit Spark jobs to an application and use the same Resilient Distributed Datasets (RDDs).

The Spark RDD abstraction is a collection of partitioned data elements that can be operated on in parallel. RDDs work at the application level, wherein each application manages its own RDDs. By design, RDDs cannot be shared between different Spark batch applications because each application has its own SparkContext. However, in some cases, the same RDD might be used by different Spark batch applications.

Consider a scenario where a raw log file is scanned by Spark batch applications of different departments for different purposes. This workload typically does not use extensive CPU resources and is sensitive to disk and network IO. However, because different applications are reading the same file, they create different RDD objects and load the same data block repeatedly, wasting I/O resources. In this case, sharing the Spark context of a Spark batch application – or its driver – for all applications that use the same RDDs can reduce RDD construction and generation costs.

Shared Spark batch applications use a sharable RDD API to create and manage sharable RDDs within a Spark context. The sharable RDD API provides a data caching layer, wherein the shared RDD data is computed once and cached for reuse.

You can only create shared Spark batch application with certain Spark versions. Spark versions not supported: 1.5.2.