Introduction to IBM Z Platform for Apache Spark

This topic provides a brief introduction to the product components and terminology in IBM® Z Platform for Apache Spark (Spark).

Product components

IBM Z Platform for Apache Spark consists of the following components:

z/OS® Spark (FMID HSPK130): z/OS Spark is built on Apache Spark, a high-performance, general execution engine for large-scale data processing. One of its key features is the capability to perform in-memory computing. Unlike traditional large data processing technologies, Spark allows caching of intermediate results in memory rather than writing them to disk, thereby dramatically improving the performance of iterative processing.

Terminology

The following terms and abbreviations appear throughout this documentation:

Master: The Spark daemon that allocates resources across applications.
Worker: The Spark daemon that monitors and reports resource availability and, when directed by the master, spawns executors. The worker also monitors the liveness and resource consumption of the executors.
Executor: A process that the worker creates for an application. The executors perform the actual computation and data processing for an application. Each application has its own executors.
Driver program: The process that runs the main function of the application and creates the SparkContext.
SparkContext: Coordinates all executors in the cluster and sends tasks for the executors to run.
Deploy mode: Distinguishes where the driver process runs. In cluster deploy mode, the framework starts the driver inside the cluster. In client deploy mode, the submitter starts the driver from outside the cluster. If you use Jupyter Notebook to interact with Spark, you are likely using client deploy mode. The default is client deploy mode.
Local mode: A non-distributed, single-JVM deployment mode in which all of the Spark execution components—driver, master, worker, and executors—run in the same JVM.
Cluster mode: Not to be confused with cluster deploy mode, Spark in cluster mode means that, unlike local mode, each Spark execution component—driver, master, worker, and executors—runs in a separate JVM. An application can be submitted to a Spark cluster in both cluster deploy mode and client deploy mode.
Cluster manager: The software that manages resources for the Spark cluster. Apache Spark supports Standalone, Mesos, and YARN. Only the Standalone cluster manager is available for Z Platform for Apache Spark.
Task: A unit of work that is sent to one executor.