Spark environments (Watson Studio)
If your notebook includes Spark APIs, or you want to create machine learning models or model flows with Spark runtimes, you need to associate the tool with a Spark service or environment. With Spark environments, you can configure the size of the Spark driver and the size and number of the executors.
Service Spark environments are not available by default. An administrator must install the Analytics Engine Powered by Apache Spark service on the IBM Cloud Pak for Data platform. To determine whether the service is installed, open the Services catalog and check whether the service is enabled.
Default environment definitions
You can use the default Spark environment definitions to quickly get started with Spark notebooks in Watson Studio tools, without having to create your own environment definitions. The default environment definitions are listed on the project’s Environments page.
Note: When you start a Spark environment, extra resources are needed for the Jupyter Enterprise Gateway, Spark Master, and the Spark worker daemons. These extra resources amount to 1 vCPU and 2 GB of RAM for the driver and 1 GB RAM for each executor. You need to take these extra resources into account when selecting the hardware size of a Spark environment. For example: if you create a notebook and select
Default Spark 3.0 & Python 3.7, the Spark cluster consumes 3 vCPU and 12 GB RAM but, as 1 vCPU and 4 GB RAM are required for the extra resources, the resources remaining for the notebook are 2 vCPU and 8 GB RAM.
Notebooks and Spark environments
When you create a notebook, you can select the Spark runtime you want the notebook to run in. You can select a default Spark environment definition or a Spark environment definition you created from the Environments page of your project.
You can create more than one notebook and select the same Spark environment definition. Every notebook associated with the environment has its own dedicated Spark cluster and no resources are shared. For example, if you create two notebooks using the same Spark environment definition, two Spark clusters are started, one for each notebook, which means that each notebook has its own Spark driver and set of Spark executors.
File system on a Spark cluster
If you want to share files across executors and the driver or kernel of a Spark cluster, you can use the shared file system at
If you want to use your own custom libraries, you can store them under
/home/spark/shared/user-libs/. There are four subdirectories under
/home/spark/shared/user-libs/ that are pre-configured to be made available to Python, R and Scala or Java runtimes.
The following tables lists the pre-configured subdirectories where you can add your custom libaries.
To share libraries across a Spark driver and executors:
- Download your custom libraries or JAR files to the appropriate pre-configured directory.
- Restart the kernel from the notebook menu by clicking Kernel > Restart Kernel. This loads your custom libraries or JAR files in Spark.
Note that these libraries are not persisted. When you stop the environment runtime and restart it again later, you need to load the libraries again.