Spark environments (Watson Studio)

If your notebook includes Spark APIs, or you want to create machine learning models or model flows with Spark runtimes, you need to associate the tool that you are using with a Spark environment. With Spark environments, you can configure the size of the Spark driver, and the size and number of the executors.

Spark environments are not available by default. You need IBM Analytics Engine powered by Apache Spark service.

Service This service is not available by default. An administrator must install the service. To determine whether the service is installed, open the Services catalog. If the service is installed and ready to use, the tile in the catalog shows Ready to use.

You can run Spark workloads in two ways:

Included environment templates

You can use the included Spark environment templates to quickly get started with Spark in Watson Studio tools, without having to create your own environment templates. The included environment templates are listed under Templates on the Environments page on the Manage tab of your project.

Note:
  • R-based runtimes are not supported on IBM Z (s390x) platforms.
  • Spark 3.3 in Notebooks and JupyterLab is deprecated. Although you can still use Spark 3.3 to run your notebooks and scripts, you should consider moving to Spark 3.4.
  • Runtime environments based on Spark 3.4 and Python 3.10 or R 4.2 (Default Spark 3.4 & Python 3.10 and Default Spark 3.4 & R 4.2) are deprecated and will be removed in a future release.

~ Indicates that the environment includes libraries from a 22.2 Runtime release.

Table 1. Environment templates available for Spark with Python and R
Name Hardware configuration
From release 5.1.2: Default Spark 3.5 & Python 3.11 2 Executors each: 1 vCPU and 4 GB RAM;
Driver: 1 vCPU and 4 GB RAM
Default Spark 3.4 & Python 3.11 2 Executors each: 1 vCPU and 4 GB RAM;
Driver: 1 vCPU and 4 GB RAM
Default Spark 3.4 & Python 3.10 (deprecated) 2 Executors each: 1 vCPU and 4 GB RAM;
Driver: 1 vCPU and 4 GB RAM
Default Spark 3.3 & Python 3.10 (deprecated) ~ 2 Executors each: 1 vCPU and 4 GB RAM;
Driver: 1 vCPU and 4 GB RAM
From release 5.1.2: Default Spark 3.5 & R 4.3 2 Executors each: 1 vCPU and 4 GB RAM;
Driver: 1 vCPU and 4 GB RAM
Default Spark 3.4 & R 4.3 2 Executors each: 1 vCPU and 4 GB RAM;
Driver: 1 vCPU and 4 GB RAM
Default Spark 3.4 & R 4.2 2 Executors each: 1 vCPU and 4 GB RAM;
Driver: 1 vCPU and 4 GB RAM
Default Spark 3.3 & R 4.2 (deprecated) ~ 2 Executors each: 1 vCPU and 4 GB RAM;
Driver: 1 vCPU and 4 GB RAM
Masking Flow Spark 2 Executors each: 1 vCPU and 4 GB RAM;
Driver: 1 vCPU and 4 GB RAM

When you start a Spark environment, extra resources are needed for the Jupyter Enterprise Gateway, Spark Master, and the Spark worker daemons. These extra resources amount to 1 vCPU and 2 GB of RAM for the driver and 1 GB RAM for each executor. You need to take these extra resources into account when selecting the hardware size of a Spark environment. For example: if you create a notebook and select Default Spark 3.4 & Python 3.11, the Spark cluster consumes 3 vCPU and 12 GB RAM but, as 1 vCPU and 4 GB RAM are required for the extra resources, the resources remaining for the notebook are 2 vCPU and 8 GB RAM.

Notebooks and Spark environments

Note:
  • You cannot use the Masking Flow Spark environment template with notebooks. This environment is designed for the Masking flow tool.
  • Spark runtimes for notebooks are currently not supported on Z (s390x) configurations.

When you create a notebook, you can select the Spark runtime that you want the notebook to run in. From the Environments page of your project, select an included Spark environment template or a Spark environment template that you created.

You can create more than one notebook and select the same Spark environment template. Every notebook associated with the environment has its own dedicated Spark cluster and no resources are shared. For example, if you create two notebooks using the same Spark environment template, two Spark clusters are started, one for each notebook, which means that each notebook has its own Spark driver and set of Spark executors.

If you want to pass environment variables to your Spark environment or control Spark behavior by using Spark variables, you must create a custom Spark environment. For more information on customizng environments, see Creating non-standard environment templates.

File system on a Spark cluster

If you want to share files across executors and the driver or kernel of a Spark cluster, you can use the shared file system at /home/spark/shared.

If you want to use your own custom libraries, you can store them under /home/spark/shared/user-libs/. There are four subdirectories under /home/spark/shared/user-libs/ that are pre-configured to be made available to Python and R or Java runtimes.

The following table lists the pre-configured subdirectories where you can add your custom libraries.

Table 2. Pre-configured subdirectories for custom libraries
Directory Type of library
/home/spark/shared/user-libs/python3/ Python 3 libraries
/home/spark/shared/user-libs/R/ R packages
/home/spark/shared/user-libs/spark2/ Java JAR files

To share libraries across a Spark driver and executors:

  1. Download your custom libraries or JAR files to the appropriate pre-configured directory.
  2. Restart the kernel from the notebook menu by clicking Kernel > Restart Kernel. This loads your custom libraries or JAR files in Spark.

Note that these libraries are not persisted. When you stop the environment runtime and restart it again later, you must load the libraries again.

Next steps

Parent topic: Environments