Creating notebook packages

Create the notebook packages that contain the components that are required to run a notebook. This task is not required for the built-in notebooks. All the files that are required to run the built-in notebooks are installed with IBM® Spectrum Conductor.

Before you begin

If you have a local environment with a mixed cluster that uses both Linux and Linux on POWER, the Jupyter notebook packages for Linux must be in a different resource group than the ones for Linux on Power, since they are different.

About this task

Create the package for a notebook, which you can add to IBM Spectrum Conductor and make available to a instance group.

Creating a package involves bundling all the files that the notebook requires to run. Ensure that the package size does not exceed 4 GB and that it uses only one of the following supported formats:

.zip
.tar
.taz
.tar.zip
.tar.Z
.tar.gz
.tgz
.jar
.gz
.exe

Procedure

Create scripts for the notebook, such as the scripts that are required to start or stop the notebook or for package deployment.

When you are creating your scripts, ensure that all the scripts have execution permission. Also, reference the correct environment variables for your notebook to work. IBM Spectrum Conductor provides the following environment variables for use in your scripts. Note that the environment variables are available for deployment scripts, service scripts, or both. These environment variables are defined when you create the notebook type or overridden when you add a notebook to a instance group:

Environment variables name and description	Can be used in deployment scripts (for deploying or undeploying)	Can be used in service scripts (for starting, stopping, and job monitoring)
ANACONDA_DEPLOY_DIR: Specifies the deployment directory for the Anaconda distribution instance to use. This environment variable is only available and applicable to notebooks using Anaconda.	Yes	Yes
ANACONDA_RELATIVE_DIR: Specifies the Anaconda directory relative to the ANACONDA_DEPLOY_DIR. This environment variable is only available and applicable to notebooks using Anaconda.	Yes	Yes
ASCD_REST_CACERT_PATH: Specifies the path to the ascd service's certificate authority (CA) certificate, for clusters with SSL enabled.	No	Yes
CONDA_ENV_NAME: Specifies the conda environment for the notebook. This environment variable is only available and applicable to notebooks using Anaconda.	Yes	Yes
DEPLOY_NB_ADMIN_USER_GROUP: Specifies the administrator user group for deploying the notebook. Takes the value of the Administrator user group field when creating a notebook using the cluster management console, if that field is set. If the field does not have a value, then this DEPLOY_NB_ADMIN_USER_GROUP environment variable is also not set.	Yes	No
EGO_MASTER_LIST_PEM: Specifies a space-separated list of management hosts.	No	Yes (only for services scripts for Dockerized notebooks)
EGO_REST_URL and CONDUCTOR_REST_URL: Specifies the URLs on which the RESTful APIs are available. EGO_REST_URL: Specifies the URL on which the resource management RESTful APIs are available. This URL is by default `https://HOSTNAME:8543/platform/rest/ego/v1`. CONDUCTOR_REST_URL: Specifies the URL on which the instance group RESTful APIs are available. This URL is by default `https://HOSTNAME:8643/platform/rest/platform/rest/conductor/v1`. Note: Both these URLs are dynamically generated when notebook services are started. After the services start, take manual steps in the following cases: If failover occurs for the REST and ascd services that manage the APIs, manually restart the notebook services to pick up the new URLs. If you change the port for the REST or ascd services or switch between enabling and disabling SSL, unassign or assign the notebook users to ensure that the CONDUCTOR_REST_URL environment variable references the updated URL.	No	Yes
IBM_PLATFORM_DEPLOY_HOOK_EXEC_USER: Specifies the notebook execution user who is deploying the notebook.	Yes	No
NOTEBOOK_BASE_PORT: Specifies the base port from which the system tries to find available ports for use by the notebook.	Yes	Yes
NOTEBOOK_DATA_BASE_DIR: Specifies the top-level directory to store notebook data. Each notebook service then gets a unique directory within this directory, which is the NOTEBOOK_DATA_DIR environment variable available to service scripts.	Yes	No
NOTEBOOK_DATA_DIR: Specifies the directory to store notebook data.	No	Yes
NOTEBOOK_DEPLOY_DIR: Specifies the directory to which the notebook is deployed.	Yes	Yes
NOTEBOOK_EXTRA_CONF_FILE: Specifies a file that contains additional configuration variables or steps required by the notebook.	Yes	Yes
NOTEBOOK_SSL_ENABLED: Specifies whether SSL is enabled (that is, set to true) for the notebook.	No	Yes
SPARK_EGO_USER: Specifies the user who is assigned to the notebook.	No	Yes
SPARK_HOME: Specifies the Spark installation directory.	Yes	Yes
SPARK_INSTANCE_GROUP_UUID: Specifies the UUID of the instance group.	No	Yes
SPARK_INSTANCE_GROUP_NAME: Specifies the name of the instance group.	No	Yes
SPARKMS_HOST: Specifies the name of the host on which the Spark notebook master service is running. This host name is used to construct the Spark master URL in the format spark://`HOST`:`PORT`.	No	Yes

In addition to these environment variables available to service scripts, when you add a notebook to your cluster or to the instance group, any environment variables that you add to that notebook will be available to service scripts. Therefore, if you have a custom notebook that requires additional environment variables, you can set them when adding a notebook to your cluster or to the instance group, or when configuring each individual assigned notebook.

To perform HTTP calls to the EGO and CONDUCTOR REST servers, use cURL to obtain a CSRF token, which acts as authentication for certain REST calls that require authentication permissions.
For example, perform a GET REST call to CONDUCTOR_REST_URL and parse the return message to obtain the CSRF token, run:
```
curl -XGET -H'Accept: application/json' ${CONDUCTOR_REST_URL}conductor/v1/auth/logon ${tlsVersion}
```
The CSRF token can be used as authentication for subsequent POST, PUT, and REMOVE REST calls to avoid logging on multiple times.

Save the scripts in a local directory.
Download the required binaries for the system, which is based on network accessibility of the target environment. Or if applicable, run the scripts to download the binaries.
Collect all the files that are required for the notebook to work, which might include:
- Third-party binaries, which are based on network accessibility of the target environment.
- Scripts that describe how to manage the notebook service lifecycle, such as commands to start or stop the service.
- The deployment configuration file (deployment.xml).
- The deploy script (deploy.sh). This script is called when a instance group is deployed to help deploy the notebook package onto the client host.
- The undeploy script (undeploy.sh). This script is called when a instance group is removed to remove the notebook package and all notebook related files from the client host..
Go to the directory where the files are located. For example:
```
deployment_dir
  deployment.xml
  scripts/
  package/
```
Copy the deployment.xml file that is provided at the root of the samples folder.

Note: The $EGO_CONFDIR/../../ascd/conf/samples/deployment.xml file is only a sample and must be customized for each package.
Create a package_scripts subfolder and copy the scripts to that folder.
Create a package subfolder:
1. Optional: Copy any sample packages to this folder.
2. Optional: If hosts are connected to the Internet:
  - Create a server subfolder and copy the server dependency software there.
  - Create an agent subfolder and copy the agent dependency software there.
Generate the package in one of the supported formats. For example:

tar czvf testconductor.tar.gz deployment.xml package_scripts package

Results

The notebook package that contains the binaries, scripts, and other files that are required by the notebook is created.

What to do next

Add the notebook to the cluster. See Adding notebooks.