Create a notebook in Watson Studio Local
To create a notebook in Watson Studio Local, set up a project, create the notebook file, and use the notebook interface to develop your notebook.
- Create the notebook file
- Create the SparkContext
- Set Watson Studio Local Spark resources
- Analyze data in the notebook
Create the notebook file
To create or get a notebook file to add to the project:
- From your project assets view, click the add notebook link.
- In the Create Notebook window, specify the method to use to create your notebook.
- You can create a blank notebook, upload a notebook file from your file system, or upload a notebook file from a URL. The notebook you create or select must be a .ipynb file.
- Specify the rest of the details for your notebook.
- Click Create Notebook.
Alternatively, you can copy a sample notebook from the community page. The sample notebooks are
based on real-world scenarios and contain many useful examples of computations and visualizations
that you can adapt to your analysis needs. To work with a copy of the sample notebook, click the
Open Notebook icon (
) and specify your project and the Spark service for the notebook.
For information about the notebook interface, see parts of a notebook.
Create the SparkContext
A SparkContext is required if you want to analyze data using Spark. The SparkContext can either point to the Spark running in cluster mode or Spark running in local mode within the user environment.
Python 2.7, Python 3.5, Python 3.6
Watson Studio provides five different Spark environments that differ in Spark's version and the deploy mode.
| Spark Version | Deploy Mode | Environment | Kernel |
|---|---|---|---|
| 2.3 | Local | Jupyter with Python 3.6 for GPU | Python 3.6 |
| 2.2.1 | Local | Jupyter with Python 3.5 and Spark 2.2.1 | Python 3.5 |
| 2.2.1 | Cluster | Jupyter with Python 3.5 and Spark 2.2.1 | Python 3.5 with Spark 2.2.1 |
| 2.0.2 | Local | Jupyter with Python 2.7 and Spark 2.0.2 | Python 2.7 with Spark 2.0.2 |
| 2.0.2 | Cluster | Jupyter with Python 2.7 and Spark 2.0.2 | Python 2.7 |
A default SparkContext is set up in a variable called sc for Python 2.7, 3.5 and
GPU notebooks when a user environment starts up.
If you choose the Python 2.7 with Watson Studio Spark 2.0.2 or
Python 3.5 with Watson Studio Spark 2.2.1 kernel, sc
points to Spark running in cluster mode. If you choose the Python 2.7 or Python 3.5 or Python 3.6
kernel, sc points to Spark running in local mode within the user environment. You
can run the sc.getConf().getAll() command to see the configuration properties of
the SparkContext. The configuration properties of the default SparkContext cannot be modified.
AutoStartJupyterSC
- Each SparkContext can take up to 400 MB of memory. If data scientists do not intend to use Spark for their analytics, disabling the creation of a SparkContext by default can save a significant amount of RAM and improve system stability.
- The configuration properties of the default SparkContext cannot be modified. If data scientists want to customize the configurations of the SparkConext to work with large sets of data, the default SparkContext should not be used.
If your data scientists don't intend to use Spark for analytics, the cluster admin can disable
the creation of the default SparkContext by setting the AutoStartJupyterSC property
to false. Complete these steps to set the property.
- SSH to the Watson Studio Local cluster.
- Run
kubectl execto ausermgmtpod. - Set the configuration
AutoStartJupyterSCto false in the /user-home/global/config/config.properties file to disable the creation of the SparkContext. The default value forAutoStartJupyterSCis true. - Notify data scientists to restart their Jupyter environment.
import pyspark
conf = pyspark.SparkConf()
conf.setMaster("local[*]")
conf.set("spark.driver.extraClassPath", "/dbdrivers/*")
conf.set("spark.executor.memory", "4g")
conf.set("spark.driver.memory", "4g")
sc = pyspark.SparkContext(conf=conf)
sc.getConf().getAll()By default, SparkContext is not set up for R notebooks.
Watson Studio Local users can modify one of the following templates to create a SparkContext setup for their R notebooks:
- sparklyr library
- For Spark 2.0.2:
- SparkR library
- For Spark 2.0.2, use
master="spark://spark-master-svc:7077". For Spark 2.2.1, usemaster="spark://spark-master221-svc:7077".
Set Watson Studio Local Spark resources
Based on the user cases, you might need to change the resources allocated for the Spark application. The default settings of Watson Studio Local Spark are as follows:
| Parameter | Watson Studio Local Defaults | Meaning |
|---|---|---|
| spark.cores.max | 3 | The maximum amount of CPU cores to request for the application from across the cluster (not from each machine). |
| spark.dynamicAllocation.initialExecutors | 3 | Initial number of executors to run. |
| spark.executor.cores | 1 | The number of cores to use on each executor. |
| spark.executor.memory | 4g | Amount of memory to use per executor process. |
By default, Watson Studio Local uses three Spark workers on the compute nodes. If you add more compute nodes, one additional Spark worker will be started on each added compute node.
To change the resources for the Spark application in the notebook:
First, stop the pre-created sc and then create a new spark context with the
proper resource configuration. Python example:
sc.stop()
from pyspark import SparkConf, SparkContext
conf = (SparkConf()
.set("spark.cores.max", "15")
.set("spark.dynamicAllocation.initialExecutors", "3")
.set("spark.executor.cores", "5")
.set("spark.executor.memory", "6g"))
sc=SparkContext(conf=conf)
Then you can verify the new settings by running the following command in a cell using the new
sc:
for item in sorted(sc._conf.getAll()):
print(item)
Note that the resource settings also apply to running notebooks for scheduling jobs.
See Spark Configuration for more information.
Analyze data in the notebook
Now you're ready for the real work to begin!
Typically, you'll install any necessary libraries, load the data, and then start analyzing it. You and your collaborators can prepare the data, visualize data, make predictions, make prescriptive recommendations, and more.
%autosave magic command in the cell, for example,
%autosave 5.%%javascript
Jupyter.notebook.session.delete(); command to stop the kernel, note that the preceding cell
might still appear to be running ( [*]) even though it has actually
finished.