Known issues and limitations for Orchestration Pipelines

The following known issues and limitations apply to Orchestration Pipelines.

Known issues

IBM Orchestration Pipelines is incompatible with OpenShift Pipelines
Job runs with large loads can cause cluster access problems
Deleting an AutoAI experiment fails under some conditions
Cache is not restored following upgrade
Dynamic parameter set names are incompatible with general names
Cache reset on pipeline version change incompatible with "latest"
Inadequate resource causes timeout issues
Pipelines cannot run tasks because quota is applied to the namespace
Local parameter not updating without updating parameter set
Hardware specification cannot override project level settings
IBM Orchestration Pipelines installation failing with TRANSIENT_ERROR
Pipelines settings do not display in the Manage tab for projects
Override of custom pipeline environment after upgrade
Artifact-store pod crashing after upgrading to 5.1.2 on Openshift version 4.14.X
After the migration to 5.2.0 run bash script nodes are not able to run

Limitations

Failed job runs might require increasing quota
Limitations by configuration size
Variable size limit
Large logs are truncated
Environment variable size limit
Adding an SPSS node to an existing pipeline can break batch job
Pipeline versioning incompatible with trial runs
Email attachments size limit
Limit on duration of pipeline job runs
Cached data for storage volume does not update immediately
Encrypted parameter set values are lost after import
Encrypted parameter values not supported in some run job nodes
Limited use of Terminate pipeline node for exception handlers
Limited use of options for Bash commands
Latency for cache updates
Environment variables cannot be exported to other clusters

Known issues for Pipelines

The following are known issues for Pipelines.

IBM Orchestration Pipelines is incompatible with OpenShift Pipelines

Applies to: 5.2.0

You can not install both Orchestration Pipelines and OpenShift Pipelines as these two services are incompatible and leverage the same resources. This limitation applies to the default installation.

Workaround: You can switch to an embedded runtime option which is configured in the custom resource. This will remove the limitation and allow you to use Orchestration Pipelines on the same cluster where Red Hat OpenShift Pipelines is installed. For more information, see Installing Orchestration Pipelines in the IBM Software Hub documentation.

Job runs with large loads can cause cluster access problems

Applies to: 5.2.0 or later

Pipeline jobs with large loads run in parallel can create an excessive number of objects in etcd, which can result in problems accessing the cluster. If this happens, follow the steps in this troubleshooting article to perform cleanup and fix the situation.

Deleting an AutoAI experiment fails under some conditions

Applies to: 5.2.0 or later

Using a Delete AutoAI experiment node to delete an AutoAI experiment that was created from the Projects UI does not delete the AutoAI asset. However, the rest of the flow can complete successfully.

Cache is not restored following upgrade

Applies to: 5.2.0 or later

Upgrading Cloud Pak for Data from a later version before 5.1.0 clears the cache. The cached data is not restored following the upgrade.

Dynamic parameter set names are incompatible with general names

Applies to: 5.2.0 or later

Dynamic names for parameter sets do not allow numerical values as its prefix, while general parameter set names do. You cannot override a general parameter set with a name with numerical prefix.

Cache reset on pipeline version change incompatible with "latest"

Applies to: 5.2.0 or later

With cache configured to be reset on pipeline version change while pipeline job configuration uses the "latest" version does not work. Changes in a pipeline that are not stored as a new version are not recognised as a modification invalidating the cache.

Inadequate resource causes timeout issues

Applies to: 5.2.0 or later

Your cluster might overloaded because of inadequate resources to run jobs. Often, this results in request timeout errors. The following are some examples of error messages you might get with an unhealthy cluster:

context deadline exceeded
context canceled
resource quota evaluation timed out
http2: client connection lost
etcdserver: request timed out, possibly due to connection lost
etcdserver: request timed out
etcdserver: too many requests
keepalive ping failed to receive ACK within timeout
read: connection timed out
dial tcp 172.30.0.1:443: connect: connection refused
dial tcp 172.30.0.1:443: i/o timeout
net/http: TLS handshake timeout
no endpoints available for service

To resolve this issue, you must delete some jobs.

Pipelines cannot run tasks because quota is applied to the namespace

Applies to: 5.2.0 or later

Pipelines Redis pods might fail to start or cannot run Tekton tasks if quota is applied to namespace but limitrange is not set. Because Redis init containers do not have limits.cpu, limits.memory, requests.cpu, or requests.memory, an error occurs.

To solve the issue, apply limit range with defaults for limits and requests. Modify the namespace in the following yaml to the namespace where Cloud Pak for Data is installed:

apiVersion: v1
kind: LimitRange
metadata:
  name: cpu-resource-limits
  namespace:  zen  #Change it to the namespace where CPD is installed
spec:
  limits:
  - default:
      cpu: 300m
      memory: 200Mi
    defaultRequest:
      cpu: 200m
      memory: 200Mi
    type: Container

After applying the solution, manually restart your pods.

Local parameter not updating without updating parameter set

Applies to: 5.2.0 or later

If you have both a local parameter and a parameter set associated with your pipeline and you update a local parameter with a new value, the local parameter continues to use the value from the previous run instead of the new value. To work around this, ensure that your parameter set is updated after you update the local parameter.

Hardware specification cannot override project level settings

Applies to: 5.2.0 or later

If you have set your pipeline flow to use defined custom environments for Pipelines, you cannot override hardware spec for the Run Bash script node. The checkbox to enable hardware spec in the node details panel is not available.

To work around this issue, you must set another data type such as a parameter with the hardware spec. In the node details panel, choose the hardware spec data by clicking the folder icon.

IBM Orchestration Pipelines installation failing with `TRANSIENT_ERROR`

Applies to: 5.2.0 or later

Your Pipelines installation might fail with the catalog source showing TRANSIENT_FAILURE. This is due to updated security measures and catalog pod could not run with runAsNonRoot security context.

To work around this issue, change your specs before running the apply-olm command. On the CatalogSource, set spec. grpcPodConfig. securityContextConfig as legacy instead of restricted.

Pipelines settings do not display in the Manage tab for projects

Applies to: 5.2.0 or later

If you are working in the watsonx.ai context, you will not see settings specific to Pipelines on the Manage tab for projects. The project-level settings for runtime variables and save frequency display when you are working in the Cloud Pak for Data context.

Override of custom pipeline environment after upgrade or import

If you upgrade from a software version prior to 5.1 or if you import an exported project file that was created in a version prior to 5.1, note this behavior for a custom pipeline environment. If a pipeline contains a Run Bash script node that was not configured with a hardware specification, importing or upgrading the pipeline to a project for verison 5.1 or later with a custom pipeline environment defined on the project-level, the hardware specification is overriden with a default of Extra extra small: 1 CPU and 2 GB RAM. You must reconfigure the environment.

Artifact-store pod crashing after upgrading to 5.1.2 on Openshift 4.14.X version

If your Openshift Cluster is running on version 4.14.X during installation or upgrade of Orchestration Pipelines to version 5.1.2 GA, you can face artifact-store microservice pod restart:

oc get pod | grep artifact
artifact-store-0 0/1 CrashLoopBackOff 5 (23s ago) 3m18s

The wspipelines-cr custom resource is stuck at 20% of progress:

oc get -n <PROJECT_CPD_INST_OPERANDS> wspipelines.wspipelines.cpd.ibm.com wspipelines-cr
NAME VERSION RECONCILED STATUS PERCENT AGE
wspipelines-cr 5.1.2 5.1.2 InProgress 20% 113m

Solution: If you plan to upgrade the Openshift Cluster from 4.14.X version, you can preapply custom change for artfact-store statefulset in order to prevent from failure during the upgrade.

If you already faced this problem during installation or upgrade, then:

Export operands namespace:

  export PROJECT_CPD_INST_OPERANDS="<operands_namespace>"

Edit artifact-store statefulset:

oc project $PROJECT_CPD_INST_OPERANDS
oc edit statefulset artifact-store

Change .spec.containers shell command.

The default artifact-store statefulset spec.containers set:

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/arch
            operator: In
            values:
            - amd64
  automountServiceAccountToken: true
  containers:
  - env:
    - name: DATA_PLANE_NAMESPACE
      value: cpd-instance
    - name: ZEN_WATCH_DOG_SVC
      value: https://zen-watchdog-svc.cpd-instance.svc:4444

Add - args: lines under containers before env:

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/arch
            operator: In
            values:
            - amd64
  automountServiceAccountToken: true
  containers:
  - args:
    - -c
    - export POD_INDEX=${HOSTNAME/artifact-store-/}; echo "$POD_INDEX" ; /opt/ibm/service/bin/artifact_store
    command:
    - /bin/sh
    env:
    - name: DATA_PLANE_NAMESPACE
      value: cpd-instance
    - name: ZEN_WATCH_DOG_SVC
      value: https://zen-watchdog-svc.cpd-instance.svc:4444

Wait until artifact-store pods are recreated.

Note: If you want persistent resolution, please apply 5.1.2 Hotfix 1, which introduces artifact-store persistent fix for this particular problem.

After the migration to 5.2.0 run bash script nodes are not able to run

After the upgrade from the previous CPD release, pipelines with a run bash script job start but remain in a running state. To resolve this issue, please apply the following patch:

export PROJECT_CPD_INST_OPERANDS="<operands_namespace>"

oc -n $PROJECT_CPD_INST_OPERANDS patch wspipelines wspipelines-cr --type=merge -p '{"spec": {"enableTaskRunManager": true }}'

After that wait about 10-15 minutes for changes to apply.

Limitations for Pipelines

The following are limitations for Pipelines.

Failed job runs might require increasing cluster resources

If your manual or scheduled pipeline jobs fail to run or do not complete, check the log for one of these error:

Failure Internal error occurred: resource quota evaluation timed out InternalError
the server was unable to return a response in the time allotted, but may still be processing the request
Unable to load original variables, error: 1 error occurred: * context deadline exceeded
Internal error occurred: resource quota evaluation timed out
Internal error occurred: admission plugin "OwnerReferencesPermissionEnforcement" failed to complete validation
Resuming pipeline execution after activity failure
cannot retrieve Run resources

Any of these issues can indicate that you need to increase resources on your cluster. Refer to the Red Hat OpenShift documentation for guidance how to manage and scale a cluster. See Scaling your OpenShift Container Platform cluster and tuning performance in production environments.

Limitations by configuration size

Note: Resources consumed during each loop iteration are released and cleaned after the iteration completes.

Warning: Due to technical limitations, do not use more than 1200 nodes per standard pipeline.

Small configuration

A SMALL configuration supports 6000 standard nodes across all active pipelines. For example:

2 parallel pipelines containing a single loop node with 500 iterations and executing 6 nodes per iteration = (2 * 500 * 6 = 6000) nodes across active pipelines.

Medium configuration

A MEDIUM configuration supports 9000 standard nodes across all active pipelines. For example:

3 parallel pipelines containing a single loop node with 500 iterations and executing 6 nodes per iteration = (3 * 500 * 6 = 9000) nodes across active pipelines.

Large configuration

A LARGE configuration supports 12000 standard nodes across all active pipelines. For example:

2 parallel runs of a pipeline containing a single loop node with 1000 iterations and executing 6 nodes per iteration = (2 * 1000 * 6 = 12000) nodes across active pipelines.

To scale pipeline resources for suitable sizing, see giudance for scaling in Resource Management for Administering Orchestration Pipelines.

Variable size limit

User variables and parameter values such as RunJob stage parameters cannot exceed 2K, including the name.

To work around this issue, see Configuring the size limit for a user variable.

Logs larger than 4MB are truncated

Applies to: 5.1.0 or later

Orchestration Pipelines logs larger than about 4MB are truncated to avoid breaking the consolidated logs view.

Environment variable size limit

Environment variables cannot exceed 128 KB.

Adding an SPSS node to an existing pipeline can break batch job

Applies to: 5.1.0 or later

If you configured a batch deployment job for a pipeline in a Cloud Pak for Data version prior to 4.7, then you add an SPSS flow to the pipeline, you might encounter a runtime error indicating a missing value for a connection field when you run the pipeline. Watson Machine Learning runtime for SPSS requires that a connection field be passed as part of data_asset too. The connection field is mandatory. In the case of data_assets where connection is not used, an empty json ({ })can be used as value. A full input_data_reference looks like this:

[
  {
      "connection": {},
      "location": {
          "href": "/v2/assets/a066a855-dea4-40e7-ad93-e54c87de2bd8?space_id=fbdbb348-b88b-4374-8531-0de331bf587d"
      },
      "type": "data_asset"
  }
]

Pipeline versioning incompatible with trial runs

Applies to: 5.1.0 or later

Pipeline trial runs do not use the assigned version IDs in its payload. Using a previous version of a trial run or cache from a previous run is unavailable. Create a job with the previous version instead to use the cache.

Email attachments size limit

Applies to: 5.1.0 or later

When you add attachments by using the Send email node in a pipeine, your total attachment size cannot exceed 25 MB.

Limit on duration of pipeline job runs

Applies to: 5.1.0 or later

If a pipeline job run does not complete within 72 hours, it will fail, as the pipeline time-out is now set to 72 hours.

Cached data for storage volume does not update immediately

Applies to: 5.1.0 or later

Deleting or adding a new storage volume does not impact the cache immediately after deletion or creation. The cache might take up to ten minutes to be updated.

Encrypted parameter set values are lost after import

Applies to: 5.1.0 or later

Encrypted parameter set values are lost when importing a Pipelines project. You must reconfigure or override the default parameter values on import.

Encrypted parameter values not supported in some run job nodes

Applies to: 5.1.0 or later

Encrypted parameter values are supported for decryption in job runs.

They are supported in:

Run Bash script
Run DataStage job
Run Pipelines job.

Limited use of Terminate pipeline node for exception handlers

Applies to: 5.1.0 or later

The Terminate pipeline node terminates the main pipeline only, and does not work for exception handlers. To terminate exception handlers with this node, you need to set the exception handler mode to abort (Set the Terminator mode to Terminate pipeline run without stopping jobs).

Limited use of options for Bash commands

Applies to: 5.1.0 or later

Using certain command options when running a Bash script in Run Bash script gives an error, including but not limited to:

-l
ls
-la
ls
-lrt

This results from certain character sequences blocked by security verification which prevents the pipeline from validation.

Latency for cache updates

The availability for updating parameter values is limited with the cache as the cache is updated around every 5 minutes. Any processes that are dependent on dynamic parameter values updating quickly will throw errors. Ensure your parameter set remains static if your pipeline uses caching.

Environment variables cannot be exported to other clusters

Environment variables are exported in an encrypted form that can only be decrypted by the cluster they were created on. Environment variables can be shared between services in the scope of the same cluster, but not used in different clusters.