Spark environments (Watson Studio)

If your notebook includes Spark APIs, or you want to create machine learning models or model flows with Spark runtimes, you need to associate the tool that you are using with a Spark environment. With Spark environments, you can configure the size of the Spark driver, and the size and number of the executors.

Service Spark environments are not available by default. An administrator must install the Analytics Engine Powered by Apache Spark service on the IBM Cloud Pak for Data platform. To determine whether the service is installed, open the Services catalog and check whether the service is enabled.

Included environment templates

You can use the included Spark environment templates to quickly get started with Spark notebooks in Watson Studio tools, without having to create your own environment templates. The included environment templates are listed under Templates on the Environments page on the Manage tab of your project.

~ Indicates that the environment includes libraries from the 22.1 Runtime release.

* Indicates that the environment is deprecated.

Table 1. Environment templates available in Watson Studio for Spark with Python, R and Scala
Name Hardware configuration Note
Default Spark 3.2 & Python 3.9 ~ 2 Executors each: 1 vCPU and 4 GB RAM;
Driver: 1 vCPU and 4 GB RAM
Default Spark 3.2 & R 3.6 ~ 2 Executors each: 1 vCPU and 4 GB RAM;
Driver: 1 vCPU and 4 GB RAM
Default Spark 3.2 & Scala 2.12 2 Executors each: 1 vCPU and 4 GB RAM;
Driver: 1 vCPU and 4 GB RAM
Default Spark 3.0 & Python 3.9 ~ * 2 Executors each: 1 vCPU and 4 GB RAM;
Driver: 1 vCPU and 4 GB RAM
Spark 3.0 is available only if you are running Cloud Pak for Data 4.5.0.
Default Spark 3.0 & R 3.6 ~ * 2 Executors each: 1 vCPU and 4 GB RAM;
Driver: 1 vCPU and 4 GB RAM
Spark 3.0 is available only if you are running Cloud Pak for Data 4.5.0.
Default Spark 3.0 & Scala 2.12 * 2 Executors each: 1 vCPU and 4 GB RAM;
Driver: 1 vCPU and 4 GB RAM
Spark 3.0 is available only if you are running Cloud Pak for Data 4.5.0.

Note: When you start a Spark environment, extra resources are needed for the Jupyter Enterprise Gateway, Spark Master, and the Spark worker daemons. These extra resources amount to 1 vCPU and 2 GB of RAM for the driver and 1 GB RAM for each executor. You need to take these extra resources into account when selecting the hardware size of a Spark environment. For example: if you create a notebook and select Default Spark 3.2 & Python 3.9, the Spark cluster consumes 3 vCPU and 12 GB RAM but, as 1 vCPU and 4 GB RAM are required for the extra resources, the resources remaining for the notebook are 2 vCPU and 8 GB RAM.

Notebooks and Spark environments

When you create a notebook, you can select the Spark runtime you want the notebook to run in. You can select an included Spark environment template or a Spark environment template you created from the Environments page of your project.

You can create more than one notebook and select the same Spark environment template. Every notebook associated with the environment has its own dedicated Spark cluster and no resources are shared. For example, if you create two notebooks using the same Spark environment template, two Spark clusters are started, one for each notebook, which means that each notebook has its own Spark driver and set of Spark executors.

File system on a Spark cluster

If you want to share files across executors and the driver or kernel of a Spark cluster, you can use the shared file system at /home/spark/shared.

If you want to use your own custom libraries, you can store them under /home/spark/shared/user-libs/. There are four subdirectories under /home/spark/shared/user-libs/ that are pre-configured to be made available to Python, R and Scala or Java runtimes.

The following tables lists the pre-configured subdirectories where you can add your custom libaries.

Table 1. Pre-configured subdirectories for custom libraries
Directory Type of library
/home/spark/shared/user-libs/python3/ Python 3 libraries
/home/spark/shared/user-libs/R/ R packages
/home/spark/shared/user-libs/spark2/ Java or Scala JAR files

To share libraries across a Spark driver and executors:

  1. Download your custom libraries or JAR files to the appropriate pre-configured directory.
  2. Restart the kernel from the notebook menu by clicking Kernel > Restart Kernel. This loads your custom libraries or JAR files in Spark.

Note that these libraries are not persisted. When you stop the environment runtime and restart it again later, you need to load the libraries again.

Next steps

Parent topic: Environments