Planning your notebooks and scripts experience

To make a plan for using Jupyter notebooks and scripts, first understand the choices that you have, the implications of those choices, and how those choices affect the order of implementation tasks.

You can perform most tasks related to notebooks and scripts with Editor or Admin role in an analytics project. Only if you need to install services or create custom runtime images, you will need IBM Cloud Pak for Data Administrator user role.

Before you start working with notebooks and scripts, consider the following questions as most tasks need to be completed in a particular order:

  • Which programming language do you want to work in?
  • Which tool is your preferred development environment tool?
  • Do you want to collaborate with others through Git?
  • What will your notebooks be doing?
  • What libraries do you want to work with?
  • Do you want to work in the product UI, automate the entire proces, or use a mixture of both methods?
  • How can you use the notebook or script in Cloud Pak for Data?

To create a plan for using Jupyter notebooks or scripts, determine which of the following tasks you must complete.

Task Mandatory? Timing
Selecting the project type Yes This must be your very first task
Adding data assets to the project Yes Before you begin creating notebooks
Picking a programming language Yes Before you select the tool
Selecting a tool Yes After you've picked the language
Checking the library packages Yes Before you select a runtime environment
Choosing an appropriate runtime environment Yes Before you open the development environment
Automating the lifecycle of a notebook or script No You can automate the entire lifecycle or parts of it
Managing the notebooks and scripts lifecycle No When the notebook is ready
Uses for notebooks and scripts after creation No When the notebook is ready

Selecting the project type

The type of project you create affects the way collaboration works and the tools you can use.

Projects without Git integration You can create an empty project or create one from file. In projects without Git integration:

  • You can use the Jupyter Notebook editor and RStudio.
  • Notebooks are run as standalone files, with no direct access to any other notebook or script in the project
  • Notebook collaboration is based on locking by user.
  • You can run R scripts and Shiny apps interactively in RStudio.
  • There is no collaboration on R scripts or Shiny apps.
  • Notebooks can be shared with others by publishing them to a catalog, to a GitHub repository, as a Gist, or by sharing the URL.
  • You can't use JupyterLab or the Visual Studio Code editor.

See Creating a project.

Git-integrated projects You can create a project that is associated with a Git repository. In git-integrated projects:

  • You can use JupyterLab, RStudio, and the Visual Studio Code editor.
  • Collaboration is available across all files in all branches of the Git repository that is associated with the project.
  • When you run notebooks and scripts, you can directly refer to any other notebook or script in the project.
  • Notebooks and scripts can't be published to a catalog.
  • You can't use the Jupyter Notebook editor.

See Accessing a Git repository for creating a project with Git integration.

Picking a programming language

You can choose to work in the following languages:

Python
Python is always included when you install Watson Studio.
R
R is not available by default. An administrator must install an R Notebook runtime or the RStudio Server Runtimes service on the IBM Cloud Pak for Data platform. To determine whether an R runtime or RStudio Server Runtimes is installed, open the Services catalog and check whether the service is enabled.

Selecting a tool

You can work with notebooks and scripts in the following tools. Your tool choice is influenced by the programming language and the development environment that you want to work in, which determines the type of project you need to create.

Tool Programming language Project Type Collaboration Why pick this tool?
Jupyter Notebook editor Python or R Project without Git integration Collaboration only at the project level. The notebook is locked by a user and can only be unlocked by the same user or a project admin. Matter of preference: The Jupyter Notebook editor feels more standalone, as new notebooks are opened in new tabs in the project. The editor is easy to use as it just consists of a file browser and an editor view.
RStudio R Project with or without Git integration Without Git integration, no collaboration. With Git integration, collaboration across all files. Ideal environment for writing R scripts, navigating the files on our computer, visualizing your results, and supporting version control, developing packages, and writing Shiny apps.
JupyterLab Python Git-integrated project Git-based collaboration across files in the associated repository. JupyterLab is an IDE with a modular structure, where you can open several notebooks or scripts as tabs in the same window. JupyterLab supports useful extensions like Git and Elyra.
Visual Studio Code editor Python Git-integrated project Git-based collaboration across all files in the repository. You create the notebook or script in the Visual Studio (VS) Code editor on your workstation and then run and debug the code in a Watson Studio runtime directly from the VS Code editor. VS Code offers a huge ecosystem with more than 30,000 extensions, for example to help you analyze your code, find vulnerabilities, detect bad code patterns, enforce code style guides, and provide code suggestions (AI assisted programming).

Checking the library packages

When you open a notebook in a runtime environment, you have access to a large selection of preinstalled data science library packages. Many environments also include libraries provided by IBM at no extra charge, such as:

  • The Watson Natural Language Processing library in Python environments
  • Libraries to help you access project assets
  • Libraries for time series or geo-spatial analysis in Spark environments

For a list of the library packages and the versions included in an environment template, select the template on the Templates page from the Manage tab on the project's Environments page.

If libraries are missing in a template, you can add them:

Through the notebook or script
You can use familiar package install commands for your environment. For example, in Python notebooks, you can use mamba, conda or pip.
By creating a custom environment template
When you create a custom template, you can either add a software customization with your libraries, or a custom runtime image that you build with the libraries you want to include. For details, see Customizing environment templates.

Choosing a runtime environment

Choosing the compute environment for your notebook depends on the amount of data you want to process and the complexity of the data analysis processes.

Watson Studio offers many default environment templates with different hardware sizes and software configurations to help you quickly get started, without having to create your own templates. These included templates are listed on the Templates page from the Manage tab on the project's Environments page. For more information about the included environments, see Environments.

If the available templates don't suit your needs, you can create custom templates and determine the hardware size and software configuration. For details, see Customizing environment templates.

Important: Make sure that the environment has enough memory to store the data that you load to the notebook. Oftentimes this means that the environment must have significantly more memory than the total size of the data loaded to the notebook because some data frameworks, like pandas, can hold multiple copies of the data in memory.

Working with data

To work with data in a notebook, you need to:

  • Add the data to your project, which turns the data into a project asset. See Adding data to a project for the different methods for adding data to a project.
  • Use generated code that loads data from the asset to a data structure in your notebook. For a list of the supported data types, see Data load support
  • Write your own code to load data if the data source isn't added as a project asset or support for adding generated code isn't available for the project asset.

The following notebook tools support generating code to load data to a data structure:

Tool Generated code supported? When to write code
Jupyter Notebook editor Yes - Generating code for the file type or database connection isn't supported.
- The file or database connection isn't a project asset.
JupyterLab Yes - Generating code for the file type or database connection isn't supported.
- The file or database connection isn't a project asset.
Visual Studio Code editor No At all times. You can copy the generated code added to a notebook cell from the Jupyter Notebook editor or JupyterLab to a notebook in the Visual Studio Code editor.

Automating the lifecycle of a notebook and script

You can use CPDCTL, a command-line interface, to manage the lifecycle of a notebook or script in Cloud Pak for Data. You can automate the entire flow, or only parts of the flow. For details, see Automating the lifecycle of notebooks and scripts.

Managing the notebooks and scripts lifecycle

After you have created and tested your notebooks or scripts in your tool in a project, you can:

  • Move notebooks and scripts into a deployment space.
  • [Notebooks only] Publish the notebook to a catalog so that other catalog members can use it in their projects. See Publishing assets from a project into a catalog.
  • [Notebooks only] Share a read-only copy outside of Watson Studio so that people who aren't collaborators in your projects can see and use it. See Sharing notebooks with a URL.

Uses for notebooks and scripts after creation

The options for a notebook or a script that is created and ready to use in IBM Cloud Pak for Data include:

To ensure that a notebook or script can be run as a job or in a pipeline (notebooks only):

  • Ensure that no cells require interactive input by a user.
  • Ensure that enough detailed information is logged to enable understanding the progress and any failures by looking at the log.
  • Use environment variables in the code to access configurations if a notebook or script requires them, for example the input data file or the number of training runs.
  • If you're loading data from data sources as part of your code, make sure to properly handle error cases such as network connection or timeout errors.

The following table shows the differences between running Python or R scripts as platform jobs or as batch deployments in a deployment space.

Job Which variables can I pass? Are mounted storage volumes supported? Required compute support
Platform jobs Can pass environment variables and command-line type arguments Yes Environment runtimes without Watson Machine Learning
Batch deployment jobs Can only pass parameters that match a predefined pattern No Software specifications in Watson Machine Learning

Parent topic: Notebooks and scripts