Configuring pipeline components

Configuring pipeline components

Configure the components of your pipeline to specify inputs and to create outputs as part of your pipeline.

Beta notice: The Watson Studio Pipelines service is provided as a Beta, solely for purposes of testing and providing feedback to IBM prior to general availability. It is not intended for production use. You can download this service from the Watson Studio Pipelines Early Access Site.

Specifying the Cloud Pak for Data scope

By default, the scope for a pipeline is the project containing the pipeline. You can explicitly specify a scope other than the default, to locate an asset used in the pipeline. The scope is the project, catalog, or space containing the asset. From the user interface, you can browse for the scope. In a notebook, specify the scope as part of the path to an asset, as follows:

[cpd://]/(projects|spaces|catalogs)/<scope-id>/<resource-type>/<resource-ID>

where:

  • context is the URL for the cluster or server where you access Cloud Pak for Data.
  • scope is the project, space, or catalog containing the asset
  • asset can be identified with the asset name or asset ID

Changing the input mode

When you are configuring a node, you can specify any resources including data and notebooks in various ways, such as directly entering a name or ID, browsing for an asset, or by using the output from a prior node in the pipeline to populate a field. To see what options are available for a field, click the input icon for the field. Depending on the context, options can include:

  • Select resource: use the asset browser to find an asset such as a data file.
  • Assign pipeline parameter: assign a value by using a variable configured with a pipeline parameter. For more information, see Configuring global objects.
  • Select from another node: use the output from a node earlier in the pipeline as the value for this field.
  • Enter expression: enter code to assign values or identify resources. For details, see Coding elements.

Configure the following types of pipeline components:

Copy

Use Copy nodes to add assets to your pipeline or to export pipeline assets.

Copy assets

Copy selected assets from a project or space to a non-empty space. You can copy these assets to a space:

  • Data asset
  • Model
  • ML Pipeline
  • Function
  • Connection
  • Software specification
  • Data Refinery flow
  • Environment
  • Script

Input parameters

Parameter Description
Source assets Browse or search for the source asset to add to the list. You can also specify an asset with a pipeline parameter, with the output of another node, or by entering the asset ID
Target Browse or search for the target space
Copy mode Choose how to handle a case where the flow tries to copy an asset and one of the same name exists. One of: ignore, fail, overwrite

Output parameters

Parameter Description
Output assets Path to the file for the copied assets

Export assets

Export selected assets from the scope, for example, a project or deployment space. The operation exports all the assets by default. You can limit asset selection by building a list of resources to export.

Input parameters

Parameter Description
Assets Path to project, catalog, or space. You can export all exportable items or create a list of specific items
Source project or space Name of project or space containing the assets to export
Exported file File location for storing the export file
Creation mode (optional) Choose how to handle a case where the flow tries to create an asset and one of the same name exists. One of: ignore, fail, overwrite

Output parameters

Parameter Description
Exported file Path to exported file

Notes:

  • If you export a project that contains a notebook, the latest version of the notebook is included in the export file. If the Pipeline configured with the Run notebook node was configured to use a different notebook version other than the latest version, the exported Pipeline is automatically reconfigured to use the latest version when imported. This might produce unexpected results or require some reconfiguration following the import.
  • If assets are self-contained in the exported project, they are retained when importing new project. Otherwise, some configuration might be required following an import of exported assets.

Import Assets

Import assets from the exported location.

Input parameters

Parameter Description
Path to import target Browse or search for the assets to import
Archive file to import Specify the path to a ZIP file or archive

Notes: After you import a file, paths and references to the imported assets are updated, following these rules:

  • References to assets from the exported project or space are updated in the new project or space after the import.
  • If assets from the exported project refer to external assets (included in a different project), the reference to the external asset will persist after the import.
  • If the external asset no longer exists, the parameter is replaced with an empty value and you have to reconfigure the field to point to a valid asset.

Send

Communicate with other users about events in your pipeline.

Send email

Send email to notify recipients about an event in a pipeline. Provide a string of comma-separated addresses. Input the subject, body, and attachments.

Prerequisites

  • Users must configure their user preferences to receive emails.
  • SMTP service must be configured on the Cloud Pak for Data cluster. For details, see Enabling email notifications. Note that you must configure an admin for SMTP before completing the mail setup.

Input parameters

Parameter Description
Recipient input type Input users as a list or as a string.
Recipient email addresses List of recipients emails. Enter in prefix@domain format
Email subject Title for the message
Attachments (optional) Optionally, attach a document to the message
Email body Message text you want to send
Error policy (optional) Optionally, override the default error policy for the node

Output parameters

Parameter Description
Execution status Returns a value of: Completed, Completed with warnings, Completed with errors, Failed, or Canceled
Status message Message associated with the status

Create

Configure the nodes for creating assets in your pipeline.

Create AutoAI experiment

Use this component to train an AutoAI experiment and generate model-candidate pipelines.

 

Input parameters

Parameter Description
AutoAI experiment name Name of the new experiment
Scope A project or a space, where the pipeline is going to be created
Prediction type The type of model for the given data: binary, classification, or regression
Prediction column (label) The prediction column name
Positive class (optional) Specify a positive class for a binary classification experiment
Training data split ratio (optional) The percentage of data to hold back from training and use to test the pipelines(float: 0.0 - 1.0)
Algorithms to include (optional) Limit the list of estimators to be used (the list depends on the learning type)
Algorithms to use Specify the list of estimators to be used (the list depends on the learning type)
Optimize metric (optional) The metric used for model ranking
Hardware specification (optional) Specify a hardware specification for the experiment
AutoAI experiment description Description of the experiment
AutoAI experiment tags (optional) Tags to identify the experiment
Creation mode (optional) Choose how to handle a case where the pipeline tries to create an experiment and one of the same name exists. One of: ignore, fail, overwrite

Output parameters

Parameter Description
AutoAI experiment Path to the saved model

Create batch deployment

Use this component to create a batch deployment

Input parameters

Parameter Description
ML asset Name or ID of the machine learning asset to deploy
New deployment name (optional) Name of the new job, with optional description and tags
Creation mode (optional) How to handle a case where the pipeline tries to create a job and one of the same name exists. One of: ignore, fail, overwrite
New deployment description (optional) Description of the deployment
New deployment tags (optional) Tags to identify the deployment
Hardware specification (optional) Specify a hardware specification for the job

Output parameters

Parameter Description
New deployment Path of the newly created deployment

Create data asset

Use this component to create a data asset.

Input parameters

Parameter Description
File Path to file in a file storage
Target scope Path to the target space or project
Name (optional) Name of the data source with optional description, country of origin, and tags
Description (optional) Description for the asset
Origin country (optional) Origin country for data regulations
Tags (optional) Tags to identify assets
Creation mode How to handle a case where the pipeline tries to create a job and one of the same name exists. One of: ignore, fail, overwrite

Output parameters

Parameter Description
Data asset The newly created data asset

Create deployment space

Use this component to create and configure a space that you can use to organize and create deployments:

Input parameters

Parameter Description
New space name Name of the new space with optional description and tags
New space tags (optional) Tags to identify the space
New space COS instance CRN CRN of the COS service instance
New space WML instance CRN (optional) CRN of the Watson Machine Learning service instance
Creation mode (optional) How to handle a case where the pipeline tries to create a space and one of the same name exists. One of: ignore, fail, overwrite
Space description (optional) Description of the space

Output parameters

Parameter Description
Space CPD Path of the newly created space

Create web service

Use this component to create an online deployment where you can submit test data directly to a web service REST API endpoint.

Input parameters

Parameter Description
ML asset Name or ID of the machine learning asset to deploy
New deployment name (optional) Name of the new job, with optional description and tags
Creation mode (optional) How to handle a case where the pipeline tries to create a job and one of the same name exists. One of: ignore, fail, overwrite
New deployment description (optional) Description of the deployment
New deployment tags (optional) Tags to identify the deployment
Hardware specification (optional) Specify a hardware specification for the job

Output parameters

Parameter Description
New deployment CPD Path of the newly created deployment

Wait

Use nodes in this section to pause a pipeline until an asset is available in the location specified in the path.

Wait for all results

Use this node to wait until all results from the previous nodes in the pipeline are available so the pipeline can continue.

This node takes no inputs and produces no output. When the results are all available, the pipeline continues automatically.

Wait for any results

Use this node to wait until any result from the previous nodes in the pipeline is available so the pipeline can continue. Run the downstream nodes as soon as any of the upstream conditions have been met.

This node takes no inputs and produces no output. When any results are available, the pipeline continues automatically.

Wait for file

Wait for an asset to be created or updated in the location that is specified in the path from a job or process earlier in the pipeline. Specify a timeout length to wait for the condition to be met. If 00:00:00 is the specified timeout length, the flow waits indefinitely.

Input parameters

Parameter Description
File location Specify the location in the asset browser where the asset resides. Use the format data_asset/filename where the path is relative to the root. The file must exist and be in the location you specify or the node fails with an error.
Wait mode By default the mode is for the file to appear. You can change to waiting for the file to disappear
Timeout length (optional) Specify the length of time to wait before proceeding with the pipeline. Use the format hh:mm:ss
Error policy (optional) See Handling errors

Output parameters

Parameter Description
Return value Return value from the component
Execution status Returns a value of: Completed, Completed with warnings, Completed with errors, Failed, or Canceled
Status message Message associated with the status

Control

Control the pipeline by adding error handling and logic.

Loops

Loops are a node in a Pipeline that operates like a coded loop.

There are two types of loops: parallel and sequential.

You can use loops when the number of iterations for an operation is dynamic. For example, if you don't know the number of notebooks to process, or you want to choose the number of notebooks at run time, you can use a loop to iterate through the list of notebooks.

You can also use a loop to iterate through the output of a node or through elements in a data array.

Loop in parallel

Add a parallel looping construct to the pipeline. A parallel loop runs the iterating nodes independently and possibly simultaneously.

For example, to train a machine learning model with a set of hyperparameters to find the best performer, you can use a loop to iterate over a list of hyperparameters to use for training the notebook variations in parallel. The results can be compared later in the flow to find the best notebook. To see limits on the number of loops you can run simultaneously, see Limitations.

In the following example, a Run Bash script node searches for and retrieves notebooks that match specified criteria. A Run DataStage flow node retrieves data from a Git repository. When input from each node is available, the loop process begins, running each notebook retrieved by the search and processing the data retrieved from the Git repository.

Example of parallel loop

If you click the lower part of the node where the green and pink badges are, it displays the full loop process. The colored badges summarize the types of nodes that are in the loop. As the notebooks run, any errors in the notebook is captured in a condition called Poor quality. The condition triggers a Bash script to increment a user variable named Increase error count. When the value of the Increase error count variable meets the specified threshold, the loop is terminated.

Example of parallel loop

Since the flow is executed in parallel for each notebook, it will return results faster than a sequential loop.

Input parameters
Table 1. Loop input of type list
Parameter Description
List input List literal that the loop iterates over or a standard link to pipeline input or pipeline output
Parallelism Maximum number of tasks to be run simultaneously. Must be greater than zero
Table 2. Loop input of type string
Parameter Description
Text input Text data that the loop reads from
Separator A char used to split the text
Parallelism (optional) Maximum number of tasks to be run simultaneously. Must be greater than zero

If the input arrays element type is JSON or any type that is represented as such, this field might decompose it as dictionary. Keys are the original element keys and values are the aliases for output names.

Loop in sequence

Add a sequential loop construct to the pipeline. Iterates over the given range in range, for example 1 - 100 by 4 and perform action like: invoke notebook, script, or other flow per iteration. Loops can iterate over a numeric range, a list, or text with a delimiter. Define a subpipeline flow inside the loop to run until the loop is complete.

A use case for sequential loops is if you want to try an operation 3 times before determining whether an operation has failed.

Input parameters
Table 1. Loop input of type list
Parameter Description
List input List literal that the loop iterates over or a standard link to pipeline input or pipeline output
Table 2. Loop input of type string
Parameter Description
Text input Text data that the loop reads from
Separator A char used to split the text
Table 3. Loop with range
Parameter Description
Start The integer that the loop begins its counter
Step (optional) The increment of the counter for each loop. If unspecified, default is 1.
End The integer that the loop ends when counter reaches

Set user variables

Configure a user variable with a key/value pair, then add the list of dynamic variables for this node.

For details on creating a user variable, see Configuring global objects.

Input parameters

Parameter Description
Name Enter the name, or key, for the variable
Input type Choose Expression or Pipeline parameter as the input type.
  • For expressions, use the built-in Expression Builder to create a variable that results from a custom expression.
  • For pipeline parameters, assign a pipeline parameter and use the parameter value as input for the user variable.

Terminate pipeline

You can initiate and control the termination of a pipeline with a Terminate pipeline node from the Control category. When the error flow runs, you can optionally specify how to handle notebook or training jobs that were initiated by components in the pipeline. You must specify whether to wait for jobs to finish, cancel the jobs then stop the pipeline, or just stop everything without canceling. Specify the options for the Terminate pipeline node.

Input parameters

Parameter Description
Terminator mode (optional) Choose the behavior for the error flow

Terminator mode can be:

  • Terminate pipeline run and all running jobs stops all jobs and stops the pipeline.
  • Cancel all running jobs then terminate pipeline cancels any running jobs before stopping the pipeline.
  • Terminate pipeline run after running jobs finish waits for running jobs to finish, then stops the pipeline.
  • Terminate pipeline that is run without stopping jobs stops the pipeline but allows running jobs to continue.

Update

Use update components to replace or update assets to improve performance. For example, if you want to standardize your tags, you can update to replace a tag with a new tag.

Update AutoAI experiment

Update the training details for an AutoAI experiment.

Input parameters

Parameter Description
AutoAI experiment Path to a project or a space, where the experiment resides
AutoAI experiment name (optional) Name of the experiment to be updated, with optional description and tags
AutoAI experiment description (optional) Description of the experiment
AutoAI experiment tags (optional) Tags to identify the experiment

Output parameters

Parameter Description
AutoAI experiment CPD Path of the updated experiment

Update batch deployment

Use these parameters to update a batch deployment.

Input parameters

Parameter Description
Deployment Path to the deployment to be updated
New name for the deployment (optional) Name or ID of the deployment to be updated
New description for the deployment (optional) Description of the deployment
New tags for the deployment (optional) Tags to identify the deployment
ML asset Name or ID of the machine learning asset to deploy
Hardware specification Update the hardware specification for the job

Output parameters

Parameter Description
Deployment CPD Path of the updated deployment

Update deployment space

Update the details for a space.

Input parameters

Parameter Description
Space Path of the existing space
Space name (optional) Update the space name
Space description (optional) Description of the space
Space tags (optional) Tags to identify the space
WML Instance (optional) Specify a new Machine Learning instance

Output parameters

Parameter Description
Space CPD Path of the updated space

Update web service

Use these parameters to update a web service (online deployment).

Input parameters

Parameter Description
Deployment Path of the existing deployment
Deployment name (optional) Update the deployment name
Deployment description (optional) Description of the deployment
Deployment tags (optional) Tags to identify the deployment
Asset (optional) Machine learning asset (or version) to be redeployed

Output parameters

Parameter Description
Deployment CPD Path of the updated deployment

Delete

Configure parameters for delete operations. You can delete:

  • AutoAI experiment
  • Batch deployment
  • Deployment space
  • Web service

For each item, choose the asset for deletion.

Run

Use these nodes to train an experiment, execute a script, or run a data flow.

Run AutoAI Experiment

Trains and stores AutoAI experiment pipelines and models.

Input parameters

Parameter Description
AutoAI experiment Browse for the ML Pipeline asset or get the experiment from a pipeline parameter or the output from a previous node.
Training data assets Browse or search for the data to train the experiment. Note that you can supply data at runtime by using a pipeline parameter
Models count (optional) Specify how many models to save from best performing pipelines. The limit is 3 models
Run name (optional) Name of the experiment and optional description and tags
Model name prefix (optional) Prefix used to name trained models. Defaults to <(experiment name)>
Run description (optional) Description of the new training run
Run tags (optional) Tags for new training run
Creation mode (optional) Choose how to handle a case where the pipeline flow tries to create an asset and one of the same name exists. One of: ignore, fail, overwrite
Error policy (optional) Optionally, override the default error policy for the node

Output parameters

Parameter Description
Models List of paths of top N trained and persisted model (ordered by selected evaluation metric)
Best model path of the winning model (based on selected evaluation metric)
Model metrics a list of trained model metrics (each item is a nested object with metrics like: holdout_accuracy, holdout_average_precision, ...)
Winning model metric elected evaluation metric of the winning model
Optimized metric Metric used to tune the model
Execution status Information on the state of the job: pending, starting, running, completed, canceled, or failed with errors
Status message Information about the state of the job

Run Bash script

Run an inline Bash script to automate a function or process for the pipeline. You can enter the Bash script code manually, or you can import the bash script from a resource, pipeline parameter, or the output of another node.

In the following example, the user entered the inline script code manually. The script uses the cpdctl tool to search all notebooks with a set variable tag and aggregates the results in a JSON list. The list can then be used in another node, such as running the notebooks returned from the search.

Example of a bash script node

Input parameters

Parameter Description
Inline script code Enter a Bash script in the inline code editor. Optional: Alternatively, you can select a resource, assign a pipeline parameter, or select from another node.
Environment variables (optional) Specify a variable name (the key) and a data type and add to the list of variables to use in the script.
Error policy (optional) Optionally, override the default error policy for the node

Output parameters

Parameter Description
Output variables Configure a key/value pair for each custom variable, then click the Add button to populate the list of dynamic variables for the node
Return value Return value from the component
Standard output Standard output from the script
Execution status Information on the state of the job: pending, starting, running, completed, canceled, or failed with errors
Status message Message associated with the status

Run batch deployment

Configure this component to run selected deployment jobs.

Input parameters

Parameter Description
Deployment Browse or search for the deployment
Input data assets Specify the data used for the batch job
Output data asset Name of the output file for the results of the batch job
Hardware specification (optional) Browse for a hardware specification to apply for the job
Error policy (optional) Optionally, override the default error policy for the node

Output parameters

Parameter Description
Output asset Path to the file with results from the deployment job
Execution status Information on the state of the job: pending, starting, running, completed, canceled, or failed with errors
Status message Information about the state of the job

Run DataStage flow

IBM DataStage is a data integration tool for designing, developing, and running jobs that move and transform data. Run a DataStage flow and use the output in a later node. For details on building a DataStage pipeline, see Orchestrating flows with Watson Pipelines.

For example, the following flow shows a Run DataStage node that retrieves and runs a flow from a Git repository. If the flow completes successfully, the pipeline executes the next node and creates a deployment space. If the flow fails, a notification email is triggered, and the loop is terminated.

Running a DataStage flow in a pipeline

Parameter Description
DataStage flow Specify the flow to run
Environment Find and select the environment used to run the DataStage flow
Environment variables (optional) Specify a variable name (the key) and a data type and add to the list of variables to use in the flow
Job parameters (optional) Additional parameter to pass to the flow when it runs. Specify a key/value pair and add to the list
Error policy (optional) Optionally, override the default error policy for the node

Output parameters

Parameter Description
Job Path to the results from the DataStage job
Job run Information about the job run
Job name Name of the job
Execution status Information on the state of the flow: pending, starting, running, completed, canceled, or failed with errors
Status message Information about the state of the flow

Run Data Refinery Flow

This component runs a specified Data Refinery flow.

Input parameters

Parameter Description
Data-refinery flow Name or ID of the Data Refinery flow asset
Environment Path of the environment used to run the flow
Error policy (optional) Optionally, override the default error policy for the node

Output parameters

Parameter Description
Output asset Path to the results from the Data Refinery job
Execution status Information on the state of the flow: pending, starting, running, completed, canceled, or failed with errors
Status message Information about the state of the flow

Run Notebook

Use these configuration options to specify how to run a Jupyter Notebook in a pipeline.

Input parameters

Parameter Description
Notebook Path of the notebook asset. The path cannot be to a notebook in a space
Environment Path of the environment used to run the notebook
Environment variables (optional) List of environment variables used to run the notebook
Error policy (optional) Optionally, override the default error policy for the node

Notes:

  • Environment variables you define in a pipeline cannot be used for notebook jobs you run outside of Watson Studio Pipelines.
  • You cannot run notebooks from a space.

Output parameters

Parameter Description
Output variables Configure a key/value pair for each custom variable, then click Add to populate the list of dynamic variables for the node
Execution status Information on the state of the run: pending, starting, running, completed, canceled, or failed with errors
Status message Information about the state of the notebook run

Run pipeline

Starting in 4.5.2, you can add a pipeline to run a nested pipeline job as part of a containing pipeline. This is a way of adding reusable processes to multiple pipelines. You can use the output from a nested pipeline run as input for a node in the containing pipeline.

Input parameters

Parameter Description
Pipeline Select or enter a path to an existing pipeline.
Job parameter Optionally specify the name and type of a parameter used in the nested pipeline. Note: The parameters must match both name and type of the global parameter as defined in the nested pipeline or the job will not run.

Output parameters

Parameter Description
Job Path to the results from the pipeline job
Job run Information about the job run
Job name Name of the job
Execution status Returns a value of: Completed, Completed with warnings, Completed with errors, Failed, or Canceled
Status message Message associated with the status

Notes for running nested pipeline jobs

If you create a pipeline with nested pipelines and run a pipeline job from the top-level, the pipelines are named and saved as project assets using this convention:

  • The top-level pipeline job is named "Trial job - pipeline guid".
  • All subsequent jobs are named "pipeline_ pipeline guid".

Parent topic: Creating a pipeline