Spark environments (Watson Studio)
If your notebook includes Spark APIs, or you want to create machine learning models or model flows with Spark runtimes, you need to associate the tool that you are using with a Spark environment. With Spark environments, you can configure the size of the Spark driver, and the size and number of the executors.
Spark environments are not available by default. You need IBM Analytics Engine powered by Apache Spark service.
Service This service is not available by default. An administrator must install the service. To determine whether the service is installed, open the Services catalog.
If the service is installed and ready to use, the tile in the catalog shows Ready to use
.
You can run Spark workloads in two ways:
-
Outside Watson Studio, in an IBM Analytics Engine powered by Apache Spark instance using Spark job APIs. For details, see Extending analytics using Spark.
-
In a notebook that runs in a Spark environment in a project in Watson Studio. This is described in this topic in the following sections:
Included environment templates
You can use the included Spark environment templates to quickly get started with Spark in Watson Studio tools, without having to create your own environment templates. The included environment templates are listed under Templates on the Environments page on the Manage tab of your project.
- R-based runtimes are not supported on IBM Z (s390x) platforms.
- Spark 3.3 in Notebooks and JupyterLab is deprecated. Although you can still use Spark 3.3 to run your notebooks and scripts, you should consider moving to Spark 3.4.
- Runtime environments based on Spark 3.4 and Python 3.10 or R 4.2 (
Default Spark 3.4 & Python 3.10
andDefault Spark 3.4 & R 4.2
) are deprecated and will be removed in a future release.
~ Indicates that the environment includes libraries from a 22.2 Runtime release.
Name | Hardware configuration |
---|---|
From release 5.1.2: Default Spark 3.5 & Python 3.11 |
2 Executors each: 1 vCPU and 4 GB RAM; Driver: 1 vCPU and 4 GB RAM |
Default Spark 3.4 & Python 3.11 |
2 Executors each: 1 vCPU and 4 GB RAM; Driver: 1 vCPU and 4 GB RAM |
Default Spark 3.4 & Python 3.10 (deprecated) |
2 Executors each: 1 vCPU and 4 GB RAM; Driver: 1 vCPU and 4 GB RAM |
Default Spark 3.3 & Python 3.10 (deprecated) ~ |
2 Executors each: 1 vCPU and 4 GB RAM; Driver: 1 vCPU and 4 GB RAM |
From release 5.1.2: Default Spark 3.5 & R 4.3 |
2 Executors each: 1 vCPU and 4 GB RAM; Driver: 1 vCPU and 4 GB RAM |
Default Spark 3.4 & R 4.3 |
2 Executors each: 1 vCPU and 4 GB RAM; Driver: 1 vCPU and 4 GB RAM |
Default Spark 3.4 & R 4.2 |
2 Executors each: 1 vCPU and 4 GB RAM; Driver: 1 vCPU and 4 GB RAM |
Default Spark 3.3 & R 4.2 (deprecated) ~ |
2 Executors each: 1 vCPU and 4 GB RAM; Driver: 1 vCPU and 4 GB RAM |
Masking Flow Spark |
2 Executors each: 1 vCPU and 4 GB RAM; Driver: 1 vCPU and 4 GB RAM |
When you start a Spark environment, extra resources are needed for the Jupyter Enterprise Gateway, Spark Master, and the Spark worker daemons. These extra resources amount to 1 vCPU and 2 GB of RAM for the driver and 1 GB RAM for each executor.
You need to take these extra resources into account when selecting the hardware size of a Spark environment. For example: if you create a notebook and select Default Spark 3.4 & Python 3.11
, the Spark cluster consumes 3 vCPU
and 12 GB RAM but, as 1 vCPU and 4 GB RAM are required for the extra resources, the resources remaining for the notebook are 2 vCPU and 8 GB RAM.
Notebooks and Spark environments
- You cannot use the
Masking Flow Spark
environment template with notebooks. This environment is designed for the Masking flow tool. - Spark runtimes for notebooks are currently not supported on Z (s390x) configurations.
When you create a notebook, you can select the Spark runtime that you want the notebook to run in. From the Environments page of your project, select an included Spark environment template or a Spark environment template that you created.
You can create more than one notebook and select the same Spark environment template. Every notebook associated with the environment has its own dedicated Spark cluster and no resources are shared. For example, if you create two notebooks using the same Spark environment template, two Spark clusters are started, one for each notebook, which means that each notebook has its own Spark driver and set of Spark executors.
If you want to pass environment variables to your Spark environment or control Spark behavior by using Spark variables, you must create a custom Spark environment. For more information on customizng environments, see Creating non-standard environment templates.
File system on a Spark cluster
If you want to share files across executors and the driver or kernel of a Spark cluster, you can use the shared file system at /home/spark/shared
.
If you want to use your own custom libraries, you can store them under /home/spark/shared/user-libs/
. There are four subdirectories under /home/spark/shared/user-libs/
that are pre-configured to be made available to
Python and R or Java runtimes.
The following table lists the pre-configured subdirectories where you can add your custom libraries.
Directory | Type of library |
---|---|
/home/spark/shared/user-libs/python3/ |
Python 3 libraries |
/home/spark/shared/user-libs/R/ |
R packages |
/home/spark/shared/user-libs/spark2/ |
Java JAR files |
To share libraries across a Spark driver and executors:
- Download your custom libraries or JAR files to the appropriate pre-configured directory.
- Restart the kernel from the notebook menu by clicking Kernel > Restart Kernel. This loads your custom libraries or JAR files in Spark.
Note that these libraries are not persisted. When you stop the environment runtime and restart it again later, you must load the libraries again.
Next steps
- Creating a notebook
- Creating your own environment template
- Customizing an environment template
- Changing the environment template of a notebook
- Stopping active notebook runtimes
Parent topic: Environments