Table of contents

What's new in IBM Cloud Pak for Data?

See what new features and improvements are available in the latest release of IBM® Cloud Pak for Data.

Quick links

December 2020 refresh of Version 3.5.0

Service What's new
Watson™ OpenScale
Batch processing
Configure Watson OpenScale to work in batch mode by connecting a custom Watson OpenScale machine learning engine to an Apache Hive database and an Apache Spark analytics engine. Unlike online scoring, where the scoring is done in real-time and the payload data can be logged into the Watson OpenScale data mart, batch processing is done asynchronously. The batch processor reads the scored data and derives various model metrics. This means that millions of transactions can be processed without bringing the data into the datamart. To enable batch processing, you must apply the cpd-aiopenscale-3.5.0-patch-1 patch.

For details about installing the patch, see Available patches for Watson OpenScale. For details about using the batch processor, see Batch processing.

What's new in Version 3.5.0

IBM Cloud Pak for Data 3.5.0 introduces a home page that is now customizable, simpler navigation, improved platform and production workload management, and broader support for connections from the platform with easier connection management. The release also includes support for zLinux, a vault to store sensitive data, several new services, and numerous updates to existing services.

Platform enhancements

The following table lists the new features that were introduced in Cloud Pak for Data Version 3.5.0.

What's new What does it mean for me?
Customize the home page In Cloud Pak for Data Version 3.5.0, you can customize the home page in two ways:
Platform-level customization
A Cloud Pak for Data administrator can specify which cards and links to display on the home page.
Cards
The cards that are available from the home page are determined by the services that are installed on the platform.

You can disable cards if you don't want users to see them. The changes apply to all users. However, the cards that an individual user sees are determined by their permissions and the services that they have access to.

Resource links
You can customize the links that are displayed in the Resources section of the home page.

For details, see Customizing the home page.

Personal customization
Each user can specify the cards that are displayed on their home page. (However, the list of cards that they can choose from is determined by the Cloud Pak for Data administrator.)

In addition, each user can specify which links to display in the Quick navigation section of their home page.

Home page

These features are offered in addition to the branding features that were introduced in Cloud Pak for Data 3.0.1.

Create user groups A Cloud Pak for Data administrator can create user groups to make it easier to manage large numbers of users who need similar permissions.

When you create a user group, you specify the roles that all of the members of the group have.

If you configure a connection to an LDAP server, user groups can include:
  • Existing platform users
  • LDAP users
  • LDAP groups
You can assign a user group access to various assets on the platform in the same way that you assign an individual user access. The benefit of a group is that it is easier to:
  • Give many users access to an asset.
  • Remove a user's access to assets by removing them from the user group.
Manage your cluster resources with quotas Cloud Pak for Data Version 3.5.0 makes it easier to manage and monitor your Cloud Pak for Data deployment.

The Platform management page gives you a quick overview of the services, service instances, environments, and pods running in your Cloud Pak for Data deployment. The Platform management page also shows any unhealthy or pending pods. If you see an issue, you can use the cards on the page to drill down to get more information about the problem.

The Platform management page

In addition, you can see your current vCPU and memory use. You can optionally set quotas to help you track your actual use against your target use. When you set quotas, you specify alert thresholds for vCPU and memory use. When you reach the alert threshold, the platform sends you an alert so that you aren't surprised by unexpected spikes in resource use.

Manage and monitor production workloads The Deployment spaces page gives you a dashboard that you can use to monitor and manage production workloads in multiple deployment spaces.
This page makes it easier for Operations Engineers to manage jobs and online deployments, regardless of where they are running. The dashboard helps you assess the status of workloads, identify issues, and manage workloads. You can use this page to:
  • Compare jobs.
  • Identify issues as they surface.
  • Accelerate problem resolution.

Deployment spaces page

Common core services This feature is available only when the Cloud Pak for Data common core services are installed. The common core services are automatically installed by services that rely on them. If you don't see the Deployment spaces page, it's because none of the services that are installed on your environment rely on the common core services.

Store secrets in a secure vault Cloud Pak for Data introduces a new set of APIs that you can use to protect access to sensitive data. You can create a vault that you can use to store:
  • Tokens
  • Database credentials
  • API keys
  • Passwords
  • Certificates

For more information, see Credentials and secrets API.

Improved navigation The Cloud Pak for Data navigation menu is organized to focus on the objects that you need to access, such as:
  • Projects
  • Catalogs
  • Data
  • Services
  • Your task inbox

The items in the navigation depend on the services that are installed.

Manage connections more easily The Connections page makes it easier for administrators to define and manage connections and for users to find connections.

The Connections page is a catalog of connections that can be used by various services across the platform. Any user who has access to the platform can see the connections on this page. However, only users with the credentials for the underlying data source can use a connection.

Example of connections users might choose from

Users who have the Admin role on the connections catalog can create and manage these connections. Unlike previous releases of Cloud Pak for Data, services can refer to these connections, rather than creating local copies. This means that any changes you make on the Connections page are automatically cascaded to the services that use the connection.

Common core services This feature is available only when the Cloud Pak for Data common core services are installed. The common core services are automatically installed by services that rely on them. If you don't see the Connections page, it's because none of the services that are installed on your environment rely on the common core services.

Workflows for managing business processes You can use workflows to manage your business processes. For example, when you install Watson Knowledge Catalog, the service includes predefined workflow templates that you can use to control the process of creating, updating, and deleting governance artifacts.

From the Workflow management page, you can define and configure the types of workflows that you need to support your business processes.

You can import and configure BPMN files from Flowable.

Service The feature is available only if Watson Knowledge Catalog is installed.

For details, see Workflows.

Connect to storage volumes In Cloud Pak for Data Version 3.5.0, you can connect to storage volumes from the Connections page or from services that support storage volume connections.

The storage volumes can be on external Network File System (NFS) storage or persistent volume claims (PVCs). This feature lets you access the files that are stored in these volumes from Jupyter Notebooks, Spark jobs, projects, and more. For details, see Connecting to data sources.

You can also create and manage volumes from the Storage volumes page. For more information, see Managing storage volumes.

Improved backup and restore process The backup and restore utility can now call hooks provided by Cloud Pak for Data services to perform the quiesce operation. Quiesce hooks offer optimizations and other enhancements compared to scaling down all Kubernetes resources. Services might be quiesced and unquiesced in a certain order, or services might be suspended without having to bring down pods to reduce the time it takes to bring down applications and bring them back up. For more information, see Backing up the file system to a local repository or object store.
Audit service enhancements The Audit Logging Service in Cloud Pak for Data now supports increased events monitoring in the zen-audit-config configmap.

If you updated the zen-audit-config configmap to forward auditable events to an external security information and event management (SIEM) solution using the Cloud Pak for Data Audit Logging Service, you must update the zen-audit-config configmap to continue forwarding auditable events.

From:

<match export export.**>

To:

<match export export.** records records.** syslog syslog.**>

You can also use the oc patch configmap command to update the zen-audit-config configmap. For more information, see Export IBM Cloud Pak for Data audit records to your security information and event management solution.

Configure the idle web session timeout A Cloud Pak for Data administrator can configure the idle web session timeout in accordance with your security and compliance requirements. If a user leaves their session idle in a web browser for the specified length of time, the user is automatically logged out of the web client.
Auditing assets with IBM Guardium® The method for integrating with IBM Guardium has changed. IBM Guardium is no longer available as an option from the Connections page. Instead, you can connect to your IBM Guardium appliances from the Platform configuration page.

For details, see Auditing your sensitive data with IBM Guardium.

Common core services Common core services can be installed once and used by multiple services. The common core services support:
  • Connections
  • Deployment management
  • Job management
  • Notifications
  • Search
  • Projects
  • Metadata repositories

The common core services are automatically installed by services that rely on them. If you don't see these features in the web client, it's because the common core services are not supported by any of the services that are installed on your environment.

New cpd-cli commands You can use the Cloud Pak for Data command line interface to:
  • Manage service instances
  • Back up and restore the project where Cloud Pak for Data is deployed
  • Export and import Cloud Pak for Data metadata
Use your Cloud Pak for Data credentials to authenticate to a data source Some data sources now allow you to use your Cloud Pak for Data credentials for authentication. Log in to Cloud Pak for Data and never enter credentials for the data source connection. If you change your Cloud Pak for Data password, you don't need to change the password for each data source connection. Data sources that support Cloud Pak for Data credentials have the selection Use your Cloud Pak for Data credentials to authenticate to the data source on the data source connection page. When you add a new connection to a project, the selection is available under Personal credentials.

The following data sources support Cloud Pak for Data credentials:

  • HDFS via Execution Engine for Hadoop *
  • Hive via Execution Engine for Hadoop *
  • IBM Cognos® Analytics
  • IBM Data Virtualization
  • IBM Db2®
  • Storage volume *

* HDFS via Execution Engine for Hadoop, Hive via Execution Engine for Hadoop, and Storage volume support only Cloud Pak for Data credentials. 

Service enhancements

The following table lists the new features that are introduced for existing services in Cloud Pak for Data Version 3.5.0:

What's new What does it mean for me?
Analytics Engine Powered by Apache Spark
Spark 3.0
Analytics Engine Powered by Apache Spark now supports Spark 3.0. You can select:
  • The Spark 3.0 template to run Spark jobs or applications that run on your Cloud Pak for Data cluster by using the Spark jobs REST APIs.
  • A Spark 3 environment to run analytical assets in Watson Studio analytics projects.
Data Refinery
Use personal credentials for connections
If you create a connection and select the Personal credentials option, other users can use that connection only if they supply their own credentials for the data source.
Users who have credentials for the underlying data source can:
  • Select the connection to create a Data Refinery flow
  • Edit or change a location when modifying a Data Refinery flow
  • Select a data source for the Join operation

For information about creating a project-level connection with personal credentials, see Adding connections to analytics projects.

Use the Union operation to combine rows from two data sets that share the same schema

Union operation

The Union operation is in the ORGANIZE category. For more information, see GUI operations in Data Refinery.

Perform aggregate calculations on multiple columns in Data Refinery
You can now select multiple columns in the Aggregate operation. Previously all aggregate calculations applied to one column.

Aggregate operation

The Aggregate operation is in the ORGANIZE category. For more information, see Aggregate in GUI operations in Data Refinery.

Automatically detect and convert date and timestamp data types
When you open a file in Data Refinery, the Convert column type GUI operation is automatically applied as the first step if it detects any non-string data types in the data. In this release, date and timestamp data are detected and are automatically converted to inferred data types. You can change the automatic conversion for selected columns or undo the step. For information about the supported inferred date and timestamp formats, see the FREQUENTLY USED category in Convert column type in GUI operations in Data Refinery.
Change the decimal and thousands grouping symbols in all applicable columns
When you use the Convert column type GUI operation to detect and convert the data types for all the columns in a data asset, you can now also choose the decimal symbol and the thousands grouping symbol if the data is converted to an Integer data type or to a Decimal data type. Previously you had to select individual columns to specify the symbols.

For more information, see the FREQUENTLY USED category in Convert column type in GUI operations in Data Refinery.

Filter values in a Boolean column
You can now use the following operators in the Filter GUI operation to filter Boolean (logical) data:
  • Is false
  • Is true

Filter operation

For more information, see the FREQUENTLY USED category in Filter in GUI operations in Data Refinery.

In addition, Data Refinery includes a new template for filtering by Boolean values in the filter coding operation:
filter(`<column>`== <logical>)

For more information about the filter templates, see Interactive code templates in Data Refinery.

Data Refinery flows are supported in deployment spaces
You can now promote a Data Refinery flow from a project to a deployment space. Deployment spaces are used to manage a set of related assets in a separate environment from your projects. You can promote Data Refinery flows from multiple projects to a space. You run a job for the Data Refinery flow in the space and then use the shaped output as input for deployment jobs in Watson Machine Learning.

For instructions, see Promote a Data Refinery flow to a space in Managing Data Refinery flows.

Support for TSV files
You can now refine data in files that use the tab-separated-value (TSV) format. TSV files are read-only.
SJIS encoding available for input and output
SJIS (short for Shift JIS or Shift Japanese Industrial Standards) encoding is an encoding for the Japanese language. SJIS encoding is supported only for CSV and delimited files.

You can change the encoding of input files and output files.

To change the encoding of the input file, click the "Specify data format" icon when you open the file in Data Refinery. See Specifying the format of your data in Data Refinery.

To change the encoding of the output (target) file in Data Refinery, open the Information pane and click the Details tab. Click the Edit button. In the DATA REFINERY FLOW OUTPUT pane, click the Edit icon.

New jobs user interface for running and scheduling flows
For more information, see the What's new entry for Watson Studio.
New visualization charts
For more information, see the What's new entry for Watson Studio.
Data Virtualization
Improve query performance by using cache recommendations
If your queries take a long time to run but your data doesn't change constantly, you can cache the results of queries to make your queries more performant. Data Virtualization analyzes your queries and provides cache recommendations to improve query performance.

For details, see Cache recommendations.

Optimize query performance by using distributed processing
Data Virtualization can determine the optimal number of worker nodes required to process a query. The number of worker nodes is determined based on the number of data sources connected to the service, available service resources, and the estimated size of the query result.
Manage your virtual data by using Data Virtualization APIs
With the Data Virtualization REST API, you can manage your virtual data, data sources, and user roles. Additionally, you can use the API to virtualize and publish data to the catalog.

For details, see Data Virtualization REST API.

Governance and security enhancements for virtual objects
When Watson Knowledge Catalog is installed, you can use policies and data protection rules from Watson Knowledge Catalog to govern your virtual data. Data asset owners are now exempt from data protection rules and policy enforcement in Data Virtualization.

You can also publish your virtual objects to the catalog more easily and efficiently. For example, when you create your virtual objects by using the Data Virtualization user interface, your virtual objects are published automatically to the default catalog in Watson Knowledge Catalog.

Optionally, you can now publish your virtual objects by using the Data Virtualization REST APIs.

For details, see Governing virtual data.

Support for single sign-on and JWT authentication
You can now authenticate to Data Virtualization by using the same credentials you use for the Cloud Pak for Data platform. Additionally, Data Virtualization now supports authentication by using a JSON Web Token (JWT).

For details, see User credentials and authentication methods.

Support for additional data sources
You can now connect to the following data sources:
  • Greenplum
  • Salesforce.com
  • SAP OData

For details, see Adding data sources.

Scale your deployment
You can use the cpd-cli scale command to adjust the number of worker nodes that the Data Virtualization service is running on. When you scale the service it up, it makes the service highly available and increases the processing capacity.

For details, see Provisioning Data Virtualization.

Monitor the service by using Db2 Data Management Console
You can use the integrated monitoring dashboard to ensure that the Data Virtualization service is working correctly. The monitoring dashboard is powered by Db2 Data Management Console. Additionally, the monitoring dashboard provides useful information about databases connected to Data Virtualization.

For details, see Monitoring Data Virtualization.

DataStage®audit
Support for additional connectors
You can now connect to the following data sources:
  • Microsoft Azure Data Lake Store
  • Amazon Redshift
  • Unstructured Data
  • SAP Packs
    • A license is required to use SAP Packs in Cloud Pak for Data. SAP Packs require the legacy Windows DataStage Client to design the jobs. Jobs can then be run in Cloud Pak for Data.
    • User documentation is provided with the license for SAP Packs.
    • For more information on SAP Packs, see:

For more details, see Supported connectors.

Additional improvements and updates
  • You can now access DataStage from your projects page. You can create a DataStage project by following the path Projects > All Projects, then creating a new project of type Data transform.
  • Project creation and deletion are now asynchronous. Previously, the DataStage UI was blocked during the time that is required to create or delete a project. Now, you see a notification that says that the request to create or delete the project is submitted. The project appears after the creation or deletion process completes successfully.
  • You can now set up an NFS mount in DataStage pods to pass data files such as CSV and XML between DataStage and source or target systems.
  • You can now use dynamic configuration files without enabling PXRuntime. With this support, the nodes or pods that are used in the job are dynamically decided based on the availability of resources on them at the time of running the job. You can run your jobs by automatically using nodes that have highest resources available, increasing speed and performance.
  • You can change the resource allocation for the number of CPUs and memory to be used in your jobs.
  • Support is provided for SSL/TLS communication with RPC connection by using Nginx as a proxy server. This support provides greater security for connecting the legacy DataStage Designer client to Cloud Pak for Data. You can then use the Designer client to edit jobs in Cloud Pak for Data
  • You can create custom images to support third-party drivers. Custom images have the benefits of being unchangeable after they are built and reliably consistent across different environments. You can also scan the images for vulnerability.
  • You can use a PersistentVolume (PV) to support third-party libraries and drivers.
  • The Operations Console is enabled for stand-alone DataStage installation on Cloud Pak for Data.
  • Non en-US language packs are now supported.
  • Notification with mailx is supported. Notifications can be sent out by mailx after an activity completes in a job sequence.
  • The FileConnector heap size setting and the message handler settings are now persistent and will not be lost if pods are restarted.
  • You can now add parameters and parameter sets in the transformer dialog box.
  • LongVarChar lengths of up to 3,000,000 characters are now supported in the Transformer stage.
Db2
Deployment with operator
Db2 is now deployed by using an operator, providing better consistency and predictability and faster deployment times. You can also deploy multiple Db2 databases on the same worker node.
Reduced footprint
Db2 consumes fewer resources than in previous releases. The minimum requirement is now 1.5 VPCs per Db2 database.
Db2 REST and Db2 Graph support
You can set up your Db2 service so that application programmers can create Representational State Transfer (REST) endpoints that can be used to interact with Db2 and run most SQL statements, including DDL and DML. You can also set up the service to use Db2 Graph, so that you can query your Db2 data to perform graph analytics without requiring any changes to the underlying database structure.
Run on zLinux
You can deploy Db2 on Red Hat® OpenShift® clusters that run on the zLinux (s390x) operating system.
Version upgrade
The Db2 service runs Db2 Version 11.5.5.
Storage enhancements
Db2 now supports the following storage options:
  • IBM Spectrum® Scale CSI 2.0
  • Red Hat OpenShift Container Storage 4.5
  • Portworx 2.5.5
More backup and restore options
You can back up or restore by using remote storage such as IBM Cloud Object Storage or Amazon S3. Db2 also now offers the option of restoring an encrypted database.
Security enhancements
You can directly authenticate with the Db2® service by using your Cloud Pak for Data user ID and password. The Db2 service uses Cloud Pak for Data authentication and authorization and supports TLS certificates. You can also authenticate with JWT tokens and API keys, and you can download the Db2 SSL certificate directly from the web console.
Db2 Data Management Console support
You can use the Db2 Data Management Console service on Cloud Pak for Data to administer, monitor, manage, and optimize the performance of Db2 databases.
Db2 Big SQL
Support for Cloudera Hadoop clusters
You can query data that is stored in remote CDP and CDH clusters, in addition to HDP clusters, which were already supported. For details, see Remote Hadoop cluster or public or private object store.
Improved integration with the web client
When logged on as an administrator, you can now use the Cloud Pak for Data web client to complete the following tasks:
  • After you install the Db2 Big SQL service, you can use the web client to provision one or more instances of the service. Each instance can use a different resource configuration, be accessed by different users, or point to a different Hadoop cluster.
  • Update an instance configuration. For each instance, you can optionally:
    • Scale the instance up or down by allocating additional or fewer resources.
    • Scale the instance out or in by adding or removing workers.
  • Track Db2 Big SQL resource usage at the instance level.
  • Gather diagnostic information, such as logs.

For details, see Provisioning Db2 Big SQL, Using the Cloud Pak for Data web client to administer the Db2 Big SQL service, and Gathering diagnostic information.

Monitor the service by using Db2 Data Management Console
You can use the integrated monitoring dashboard to ensure that the Db2 Big SQL service is working correctly. The monitoring dashboard is powered by Db2 Data Management Console.

For details, see Monitoring Db2 Big SQL.

Db2 Data Gate
Improved installation experience
It's now easier to install and configure Db2 Data Gate with simplified security setup and certificate generation on z/OS®.

It's also easier to provision instances of Db2 Data Gate with a streamlined process.

Improved performance
The Db2 Data Gate service has increased throughput and lower latency when loading and synchronizing data from Db2 for z/OS to the target database.
Run on zLinux
You can deploy Db2 Data Gate on Red Hat OpenShift clusters that run on the zLinux (s390x) operating system.
Db2 Event Store
Run on larger clusters
Db2 Event Store can run on Red Hat OpenShift clusters with more than 3 worker nodes for increased performance and scalability.
Support for new data types
Db2 Event Store now supports the decimal data type.
Support for Apache Spark 2.4.6
Db2 Event Store supports the Apache Spark 2.4.6 unified analytics engine for big data processing.
Db2 Warehouse
Deployment with operator
Db2 Warehouse is now deployed by using an operator, providing better consistency and predictability and faster deployment times. You can also deploy multiple Db2 Warehouse databases on the same worker node.
Reduced footprint
Db2 Warehouse consumes fewer resources than in previous releases. The minimum requirement is now 1.5 VPCs per Db2 Warehouse database.
Db2 REST and Db2 Graph support
You can set up your Db2 Warehouse service so that application programmers can create Representational State Transfer (REST) endpoints that can be used to interact with Db2 Warehouse and run most SQL statements, including DDL and DML. You can also set up the service to use Db2 Graph, so that you can query your Db2 Warehouse data to perform graph analytics without requiring any changes to the underlying database structure.
Support for object storage providers (MPP only)
The Db2 Warehouse service in a massively parallel processing (MPP) configuration can work with data in external tables in cloud object storage providers such as Amazon S3 and Microsoft Azure Blob Storage, or any other S3 compatible storage such as IBM® Cloud Object Storage or MinIO. This option is available for Db2 Warehouse MPP deployments.
Db2 Data Management Console support
You can use the Db2 Data Management Console service on Cloud Pak for Data to administer, monitor, manage, and optimize the performance of Db2 Warehouse databases.
Run on zLinux
You can deploy Db2 Warehouse on Red Hat OpenShift clusters that run on the zLinux (s390x) operating system.
Version upgrade
The Db2 Warehouse service runs Db2 Warehouse Version 11.5.5.
Storage enhancements
Db2 Warehouse now supports the following storage options:
  • IBM Spectrum Scale CSI 2.0
  • Microsoft Azure Blob Storage (object storage)
  • Amazon S3 Cloud object storage
  • Red Hat OpenShift Container Storage 4.5
  • Portworx 2.5.5
More backup and restore options
You can back up or restore by using remote storage such as IBM Cloud Object Storage or Amazon S3. Db2 Warehouse also now offers the option of restoring an encrypted database.
Security enhancements
You can directly authenticate with the Db2® service by using your Cloud Pak for Data user ID and password. The Db2 Warehouse service uses Cloud Pak for Data authentication and authorization and supports TLS certificates. You can also authenticate with JWT tokens and API keys, and you can download the Db2 Warehouse SSL certificate directly from the web console.
Decision Optimization
Overview pane in the model builder
The overview pane provides you with model, data and solution summary information for all your scenarios at a glance. From this view you can also open an information pane where you can create or choose your deployment space.

For details, see the Overview section in Decision Optimization model builder views and scenarios .

Enhanced Explore solution view in the model builder
The Explore solution view of the model builder shows you more information about the objectives (or KPIs), solution tables, constraint or bounds relaxations or conflicts, engine statistics, and log files.

For details, see the Explore solution view section in Decision Optimization model builder views and scenarios .

Gantt charts available for any type of data
From the Visualization view of the model builder, you can create Gantt charts for any type of data, where it is meaningful. Gantt charts are no longer restricted to scheduling models only.

Visualization view with Gantt chart

For details, see the Gantt chart widget section in Visualization view.

Support for Python 3.7
The Decision Optimization model builder now targets the Python 3.7 runtime when generating notebooks from scenarios. In Watson Machine Learning, the Decision Optimization runtime now runs Python 3.7.
Improved data schema editing in the Modeling Assistant
You can now define data types for table columns and edit data schema when you use the Modeling Assistant.
Delegation of CPLEX engine solve to Watson Machine Learning
You can now delegate the Decision Optimization solve to run on Watson Machine Learning from your Java CPLEX or CPO models.
Language support

The Decision Optimization interface is now translated into multiple languages.

Execution Engine for Apache Hadoop
Integration with IBM Spectrum Conductor with Spark clusters
IBM Spectrum Conductor with Spark is now supported. You can integrate IBM Spectrum Conductor with Spark and Watson Studio by using Jupyter Endpoint Gateway endpoints. Users can open a notebook in Watson Studio to access Jupyter Endpoint Gateway instances that are running on IBM Spectrum Conductor with Spark. For details, see Spectrum environments.
New configurations that allow you to use your own certificates
The configurations convert DSXHI to do the following customizations:
  • Provide a custom Keystore to generate the required .crt.
  • Provide any custom truststore (CACERTS), where DSXHI certificates will be added.
  • Provide options to either add the host certificate to the truststore yourself or have DSXHI add it.

For details, see Installing the Execution Engine for Apache Hadoop service on Apache Hadoop clusters or on Spectrum Conductor clusters.

Support for additional types of security
Execution Engine for Apache Hadoop supports:
  • The JSON Web Tokens to Kerberos delegation token provider, which provides authentication to HiveServer2, HDFS, and HMS resources. For details, see Using delegation token endpoints.
  • The updated versions for Jupyter Endpoint Gateway 2.3 and Knox 1.4.
Improved validation
The system_check.py scripts were introduced to validate your Hadoop configuration.
Guardium External S-TAP®
Improved integration with the Cloud Pak for Data web client
You can now create and manage your Guardium External S-TAP instances from the Cloud Pak for Data web client.
Support for new target databases
You can use the Guardium External S-TAP to monitor additional databases. For details, see External S-TAP supported platforms on the IBM Support portal.
Jupyter Notebooks with Python 3.7 for GPU This service now provides environments for Python 3.7 instead of Python 3.6.
Jupyter Notebooks with R 3.6
Support for loading data from database connections
You can use the insert to code function to load data to a notebook from the following database connections:
  • Cognos Analytics
  • HTTP
  • Apache Cassandra
  • Amazon RDS for PostgreSQL
  • Amazon RDS for MySQL
  • Mounted storage volumes
  • IBM Cloud Object Storage

For details see, Data load support for database connections.

RStudio® Server with R 3.6
Configure RStudio idle timeout
A Cloud Pak for Data administrator can disable or change the idle timeout of RStudio runtimes.

For details see, Disabling or changing RStudio idle timeout.

Support for RMySQL library functions
You can connect to a MySQL database and use MySQL library functions in RStudio.

For details see, Using RMySQL library functions.

SPSS® Modeler
SPSS Analytic Server
A new SPSS Analytic Server connection type is available for SPSS Modeler. With this connection type, you can import and run SPSS Modeler streams (.str) that were created in SPSS Modeler classic to run on SPSS Analytic Server. See Supported data sources for SPSS Modeler for more information.
Jobs
You can now create and schedule jobs as a way of running SPSS Modeler flows. Click the Jobs icon from the SPSS Modeler toolbar and select Create a job. See Creating and scheduling jobs for more information.
New and changed nodes
SPSS Modeler includes the following new and changed nodes:
  • CPLEX® Optimization node: With this new node, you can use complex mathematical (CPLEX) based optimization via an Optimization Programming Language (OPL) model file.

    CPLEX Optimization node

  • Kernel Density Estimation (KDE) Simulation node: This new node uses the Ball Tree or KD Tree algorithms for efficient queries, and walks the line between unsupervised learning, feature engineering, and data modeling.

    KDE Simulation node

  • Data Asset Export node: This node has been redesigned. Use the node to write to remote data sources using connections, write to a data file on your local computer, or write data to your project.

    Data Asset Export node

Support for database functions
You can run SPSS Modeler desktop stream files (STR) that contain database functions.
New visualization charts
For more information, see the What's new entry for Watson Studio.
Deploy Text Analytics models to Watson Machine Learning Server
You can now deploy Text Analytics models to a Watson Machine Learning Server as you can with other model types. Deployment is the final stage of the lifecycle of a model - making a copy of the model available to test and use. For example, you can create a deployment for a model so you can submit new data to it and get a score or prediction back.
Python 3.7
SPSS Modeler now uses Python 3.7.9. Note that the Python schema has changed, so you may need to review and adjust any Python scripts you use in SPSS Modeler.
Streams
Application resource customization
You can customize the resources that are used by your application. You can:
  • Create a custom application image for dependencies, such as software packages or libraries, that are not included in the default application image.
  • Customize the resources, such as CPU or memory, that your Streams applications use by creating custom application resource templates.
For more information, see Customizing the application image.
Production workload management and monitoring
The Deployment spaces page provides a dashboard that you can use to monitor and manage Streams jobs in multiple deployment spaces.
Edge Analytics 1.1.0 beta integration
OpenShift image build integration
You can build a Docker image loaded with your streaming data application, ready for Edge Analytics deployment. For more information, see Packaging an Edge Analytics application or service for deployment.
Enhanced development environments
Use your favorite Streams development environment (streamsx Python API, notebooks, Visual Studio Code, or Streams Flows) to build your edge application and image. For more information, see Developing edge applications with IBM Edge Analytics.
Enhanced Streams standalone applications
Application metrics, such as data tuple counters, operator costs, and user-defined metrics, can be exposed and the default threading model can be specified for standalone Streams applications.
Edge-aware samples
Explore new application samples designed for the edge.
Streams jobs as Cloud Pak for Data services
This release introduces the ability to enable a Streams job as a Cloud Pak for Data service. A streams-application service can be used to insert data into and retrieve data from a Streams job. A streams-application service is created by inserting one or more endpoint operators into an application and submitting the application to run as a job. Exchanging data with the job is done by using a REST API. The streams-application service instances are included in the Services > Instances page of the Cloud Pak for Data web client. Selecting a service entry in the list opens the REST API documentation for the service.

For additional information, restrictions, and sample applications, see Resources for Streams developers in the IBM Community.

Streams Flows
  • Support for class style in code operators
  • Support for flat map that allows returning multiple events
  • New window and aggregation operators
Watson Knowledge Catalog
Reference data set enhancements
You can customize your reference data sets in the following ways:
  • Configure hierarchies between reference data sets and between values within a reference data set.
  • Add custom columns.
  • Create values mappings, or crosswalks, between values of multiple reference data sets in 1:1, n:1, and 1:n relationships.

For details, see Reference data sets.

Catalog enhancements
Catalogs are enhanced in the following ways:
  • Additional information is shown on the new Overview page for assets, such as, the asset's path and related assets.
  • More activities are shown on the Activities page for assets.
  • COBOL copybook is now a supported asset type. You can preview the contents of copybooks.
  • You can add more types of assets and metadata to catalogs by coding custom attributes for assets and custom asset types with APIs.
New connections
Watson Knowledge Catalog can connect to:
  • Amazon RDS for MySQL
  • Amazon RDS for PostgreSQL
  • Apache Cassandra
  • Apache Derby
  • Box
  • Elasticsearch
  • HTTP
  • IBM Data Virtualization Manager for z/OS
  • IBM Db2 Event Store
  • IBM SPSS Analytic Server
  • MariaDB
  • Microsoft Azure Blob Storage
  • Microsoft Azure Cosmos DB
  • MongoDB
  • SAP HANA
  • Storage volume
In addition, the following connection names have changed:
  • PureData System for Analytics is now called Netezza® (PureData® System for Analytics)

    Your previous settings for the connection remain the same. Only the name for the connection type changed.

New SSL encryption support for connections
The following connections now support SSL encryption in Watson Knowledge Catalog:
  • Amazon Redshift
  • Cloudera Impala
  • IBM Db2 for z/OS
  • IBM Db2 Warehouse
  • IBM Informix®
  • IBM Netezza (PureData System for Analytics)
  • Microsoft Azure SQL Database
  • Microsoft SQL Server
  • Pivotal Greenplum
  • PostgreSQL
  • Sybase
Category roles control governance artifacts
The permissions to view and manage all types of governance artifacts, except for data protection rules, are now controlled by collaborator roles in the categories that are assigned to the artifacts.
To view or manage governance artifacts, users must meet these conditions:
  • Have a user role with one of the following permissions:
    • Access governance artifacts
    • Manage governance categories
  • Be a collaborator in a category

Category collaborators have roles with permissions that control whether they can view artifacts, manage artifacts, manage categories, and manage category collaborators. Subcategories inherit collaborators from their parent categories. Subcategories can have other collaborators, and their collaborators can accumulate more roles. The predefined collaborator, All users, includes everyone with permission to access governance artifacts.

For details, see Categories.

Changes to user permissions
If you upgraded from Cloud Pak for Data Version 3.0.1, the following user permissions are automatically migrated as part of the upgrade :
  • Users who had the Manage governance categories permission continue to have that permission and also have the Owner role for all top-level categories.
  • Users who had the Manage governance artifacts permission now have the Access governance artifacts permission, the Editor role in all categories, and the new Manage data protection rules permission.
  • All users now have the Access governance artifacts permission. However, when you add new users, the Access governance artifacts permission is not included in all of the predefined roles. It is include in the Administrator, Data Engineer, Data Steward, and Data Quality Analyst roles.
  • All users who were listed as Authors in a governance workflow now have the Access governance artifacts permission and also the Editor role in all categories.
Workflows for governance artifacts support categories
Workflow configurations for governance artifacts now require categories to identify the governance artifacts and users for the workflow:
  • When you create a new workflow configuration for governance artifacts, you must select either one category or all categories as part of the triggering condition for the workflow, along with governance artifact types and events.
  • You no longer specify artifact authors in a workflow configuration. Artifact authors are all users who have permission to edit artifacts in a category that is specified in the workflow configuration.
  • You now specify one or more of these types of assignees to approve and review artifacts: the workflow requestor, users with specified roles in the categories for the workflow, users with the Data Steward role, or selected users.

For details, see Managing workflows for governance artifacts.

Discovery enhancements
Watson Knowledge Catalog includes the following changes for discovering data:
Automated discovery
The sample size is 1,000 records by default. Changes require specific permissions.
Quick scan
With the improved version, you can perform more scalable data discovery with richer analysis results that can be published to one or more catalogs directly from the quick scan results.

For details, see Running a quick scan.

Import metadata from an analytics project
You can use the metadata import asset type to import data assets from a connection so that you can analyze and enrich the assets later.

For details, see Importing metadata.

Import additional artifacts and properties
You can now import reference data sets. When you import a reference data set, you can also import secondary categories, effective dates, and custom attribute values for most artifacts.
For business terms, you can import:
  • Type of terms relationships
  • Assigned data classes
  • Synonyms

For details, see Importing governance artifacts.

Watson Machine Learning
Support for the V4 REST APIs and Python client
Watson Machine Learning supports the generally available releases of the Watson Machine Learning V4 REST APIs and the V4 Watson Machine Learning Python client, which give you programmatic access to all of the current machine learning features.

For details, see Watson Machine Learning APIs and Watson Machine Learning Python library.

Support for Data Refinery flows
You can run Data Refinery flows jobs in a deployment space and use the resulting data as input for deployment jobs.

For details, see Deployment spaces.

Use data from NFS
You can use data from Network File System (NFS) to train models and as input data for deployment jobs. For example, you can use a CSV file from a storage volume as the training data for an AutoAI model, and use a payload file from the volume to deploy and score the trained model.

For details, see Adding data sources.

Support for additional connections
Support for more types of data connections for use training and deploying models gives you greater flexibility when you create deployment jobs.

For details, see Batch deployment details.

Support for Python 3.7
Train and deploy models and functions using new frameworks and software specifications built with Python 3.7.

For details, see Supported frameworks.

Create batch deployments for R Scripts
In addition to Python scripts, you can now deploy R scripts as a means of working with Watson Machine Learning assets.

For details, see Batch deployment details.

Deployment spaces dashboard
View deployment activity across all spaces you can access in a new deployment spaces dashboard. Use the dashboard to monitor activity for all of your spaces and view visualizations to give you insights into deployments and jobs.

For details, see Deployments dashboard.

Federated learning
Tech preview This is a technology preview and is not supported for use in production environments.

Use federated learning to train a common model using remote, secure data sets. The data sets are not shared so full data security is maintained, while the resulting model gets the benefit of the expanded training.

For details, see Federated learning.

Multiple data sources for AutoAI experiments
Tech preview This is a technology preview and is not supported for use in production environments.

AutoAI experiments support multiple data sources as input for training an experiment. Use the data join canvas to combine the data sets based on common columns, or keys, to build a unified data set. Deploy a data join model using multiple data sets as input for your jobs.

For details, see Joining data.

Save AutoAI as a Watson Machine Learning notebook
Tech preview This is a technology preview and is not supported for use in production environments.

Save an AutoAI as a Watson Machine Learning notebook so you can review the code for developing the pipelines.

For details, see Saving as a notebook.

Watson OpenScale
Enhanced explainability
The updated explainability panel is based on extensive customer feedback and focus groups. It includes the ability run "What if" scenarios.

For details, see Explaining transactions.

Indirect bias
Watson OpenScale analyzes indirect bias, which occurs when one feature can be used to stand for another. For example, one feature in a model might approximate another feature that is a protected attribute. Although it is illegal to discriminate based on race, race can sometimes correlate closely with postal code, which might be the cause of indirect bias.

For details, see Indirect bias.

Dashboard filtering for large deployments
For Watson OpenScale dashboards with a large number of deployments, you can use the new controls to filter and sort tiles.
Role-based access control
You can assign users varying levels of permissions based on the actions they need to perform.

For details, see User roles.

Support for multiple instances
You can now deploy multiple instances of the Watson OpenScale service in a single namespace.

For details, see Setting up multiple instances.

Auto config
When you use resources that are already part of the Cloud Pak for Data cluster, such as Watson Machine Learning, many of the values are supplied for you when you configure Watson OpenScale.
The Drift Monitor also completes many of the values for you during configuration and setup
New version of the Python SDK
This release includes a new, more integrated version of the Watson OpenScale Python SDK.

The new Python SDK replaces the Version 1 SDK, eliminates separate APIs for each monitor, and standardizes many of the classes and methods used for monitor configuration and subscription to machine learning providers.

For details, see the Python SDK documentation.

Streamlined Fairness user interface
The Fairness Monitor has undergone extensive redesign based on your feedback! Now you can use the enhanced charts to determine balanced data and perfect equality at a glance. You can even do what-if scenarios with scoring in real time.

Fairness monitor page

Model Risk Management notifications
There are several enhancements to Model Risk Management. You can set thresholds for receiving email notifications of violations. There are enhanced PDF reports. And, when integrated with IBM OpenScale, you can now set when to send metrics (immediately, daily, or weekly).
Debiasing support for regression models
Along with classification models, Watson OpenScale now detects bias in regression models. You can both detect and mitigate bias.
Watson Studio
New jobs interface for running and scheduling notebooks and Data Refinery flows
The user interface gives you a unified view of the job information.

You can create the jobs from either of the following locations:

  • The user interface for the service
  • The Assets page of a project

For details, see Jobs in a project.

New visualization charts
You can use the following visualization charts with Data Refinery and SPSS Modeler:
Evaluation charts
Evaluation charts are combination charts that measure the quality of a binary classifier. You need three columns for input: actual (target) value, predict value, and confidence (0 or 1). Move the slider in the Cutoff chart to dynamically update the other charts. The ROC and other charts are standard measurements of the classifier.

Evaluation chart

Math curve charts
Math curve charts display a group of curves based on equations that you enter. You do not use a data set with this chart. Instead, you use it to compare the results with the data set in another chart, like the scatter plot chart.

Math curve chart

Sunburst charts
Sunburst charts display different depths of hierarchical groups. The Sunburst chart was formerly an option in the Treemap chart.

Sunburst chart

Tree charts
Tree charts represent a hierarchy in a tree-like structure. The Tree chart consists of a root node, line connections called branches that represent the relationships and connections between the members, and leaf nodes that do not have child nodes. The Tree chart was formerly an option in the Treemap chart.

Tree chart

For the full list of available charts, see Visualizing your data.

New project settings
When you create a project, you can select the following options:
Mark the project as sensitive
Marking a project as sensitive prevents members of a project from moving data assets out of the project.

For details, see Marking a project as sensitive.

Log all project activities
Logging all project activity tracks detailed project activity and creates a full activities log, which you can download to view.

For details, see Logging project activity.

New connections
Watson Studio can connect to:
  • Amazon RDS for MySQL
  • Amazon RDS for PostgreSQL
  • Apache Cassandra
  • Apache Derby
  • Box
  • Elasticsearch
  • HTTP
  • IBM Data Virtualization Manager for z/OS
  • IBM Db2 Event Store
  • IBM SPSS Analytic Server
  • MariaDB
  • Microsoft Azure Blob Storage
  • Microsoft Azure Cosmos DB
  • MongoDB
  • SAP HANA
  • Storage volume
In addition, the following connection names have changed:
  • PureData System for Analytics is now called Netezza (PureData System for Analytics).

    Your previous settings for the connection remain the same. Only the name for the connection type changed.

New SSL encryption support for connections
The following connections now support SSL encryption in Watson Studio:
  • Amazon Redshift
  • Cloudera Impala
  • IBM Db2 for z/OS
  • IBM Db2 Warehouse
  • IBM Informix
  • IBM Netezza (PureData System for Analytics)
  • Microsoft Azure SQL Database
  • Microsoft SQL Server
  • Pivotal Greenplum
  • PostgreSQL
  • Sybase
Support for Python 3.7
The default Python environment version in Watson Studio Watson Studio is now Python 3.7.

Python 3.6 is being deprecated. You can continue to use the Python 3.6 environments; however you will be notified that you should move to a Python 3.7 environment.

When you switch from Python 3.6 to Python 3.7, you might need to update your code if the versions of open source libraries that you use are different in Python 3.7.

Spark 3.0
You can run analytical assets from Watson Studio analytics projects in a Spark 3 environment.

If you use the Spark Jobs REST APIs, provided by Analytics Engine Powered by Apache Spark, to run Spark jobs or applications on your Cloud Pak for Data cluster, you can use the Spark 3.0 template.

Notebook execution progress restored
If you accidentally close the browser window while your notebook is still running, or if you are logged out by the system during a long running job, the notebook will continue running and all output cells are restored when you open the notebook again. The execution progress of a notebook can be restored only for notebooks that run in a local kernel. If your notebook runs on a Spark or Hadoop cluster, and you open the notebook again, any notebook changes that were not saved are lost.
Use a self-signed certificate to authenticate to enterprise Git repositories
If you want to store your analytics project in an enterprise-grade instance of Git, such as GitHub Enterprise, and your instance uses a self-signed certificate for authentication, you can specify the self-signed certificate in PEM format when you add your personal access token to Cloud Pak for Data.

New services

The following table lists the new services that are introduced in Cloud Pak for Data Version 3.5.0:

Category Service Pricing What does it mean for me?
Data source Db2 Data Management Console Included with Cloud Pak for Data Use Db2 Data Management Console to administer, monitor, manage, and optimize your integrated Db2 databases, including Db2 Big SQL and Data Virtualization, from a single user interface. The console helps you improve your productivity by providing a simplified process for managing and maintaining your complex database ecosystem across Cloud Pak for Data.

The console home page provides an overview of all of the databases that you are monitoring. The home page includes the status of database connections and monitoring metrics that you can use to analyze and improve the performance of your databases.

Db2 Data Management Console home page

From the console, you can also:
  • Administer databases
  • Work with database objects and utilities
  • Develop and run SQL scripts
  • Move and load large amounts of data into databases for in-depth analysis
  • Monitor the performance of your Db2 databases

Learn more about Db2 Data Management Console.

Industry solutions OpenPages® Separately priced You can use OpenPages to manage risk and regulatory challenges across your organization. OpenPages is an integrated governance, risk, and compliance (GRC) suite that can help your organization identify, manage, monitor, and report on risk and compliance initiatives that span your enterprise. The service provides a powerful, scalable, and dynamic set of tools that can help you with:
  • Business continuity management
  • Financial controls management
  • Internal audit management
  • IT governance
  • Model risk governance
  • Operation risk management
  • Policy management
  • Regulatory compliance management
  • Third-party risk management

OpenPages dashboard

Learn more about OpenPages.

Industry solutions

IBM Open Data for Industries Separately priced Collect, describe, and provide your data according to Oil & Gas industry standards.

IBM Open Data for Industries provides a toolset that supports an industry-standard methodology for collecting and describing Oil & Gas data and serving that data to various applications and services that consume it.

IBM Open Data for Industries provides a reference implementation for a data platform to integrate silos and simplify access to this data for stakeholders. It standardizes the data schemas and provides a set of unified APIs for bringing data into Cloud Pak for Data, describing, validating, finding, and retrieving data elements and their metadata. Effectively, Open Data for Industries becomes a system of record for subsurface and wells data.

Application developers can use these APIs to create applications that are directly connected to the stakeholder's data sets. After the application is developed, it requires minimal or no customization to deploy it for multiple stakeholders that adhere to the same APIs and data schemas.

In addition, stakeholders can use these APIs to connect their applications with the platform and take advantage of the seamless data lifecycle in Cloud Pak for Data.

Learn more about IBM Open Data for Industries.

AI Watson Machine Learning Accelerator Included with Cloud Pak for Data Watson Machine Learning Accelerator is a deep learning platform that data scientists can use to optimize training models and monitor deep learning workloads.

Watson Machine Learning Accelerator can be connected to Watson Machine Learning to take advantage of the multi-tenant resource plans that manage resource sharing across Watson Machine Learning projects. With this integration, data scientists can use the Watson Machine Learning Experiment Builder and Watson Machine Learning Accelerator hyperparameter optimization.

Learn more about Watson Machine Learning Accelerator.

Installation enhancements

What's new What does it mean for me?
Red Hat OpenShift support You can deploy Cloud Pak for Data Version 3.5 on the following versions of Red Hat OpenShift:
  • Version 3.11
  • Version 4.5
Support for zLinux You can deploy the following Cloud Pak for Data software on zLinux (s390x):
  • The Cloud Pak for Data control plane
  • Db2
  • Db2 Warehouse
  • Db2 for z/OS Connector
  • Db2 Data Gate
Simplified and updated installation commands The Cloud Pak for Data command-line interface uses a simplified syntax. The cpd-Operating_System command is replaced by the cpd-cli command.

When you download the installation files, you must select the appropriate package for the operating system where you will run the commands. For details, see Obtaining the installation files.

Many of the cpd-cli commands have different syntax. Review the installation documentation carefully to ensure that you use the correct syntax.

For example:
  • On air-gapped clusters, the cpd-Operating_System preloadImages command is now cpd-cli preload-images.
  • When you run the install or upgrade commands, you specify the --latest-depenency flag to ensure that the latest prerequisite components are installed.
Upgrading the Cloud Pak for Data metadata Before you can upgrade to Cloud Pak for Data Version 3.5, you must upgrade the Cloud Pak for Data metadata by running the cpd-cli operator-upgrade command.

For details, see Preparing to upgrade the control plane.

New service account The Cloud Pak for Data control plane requires an additional service account: cpd-norbac-sa, which is bound to a restricted security context constraint (SCC).

This security account is specified in the cpd-cli adm command for the control plane.

Simplified storage overrides If an assembly requires an override for Portworx or OpenShift Container Storage, the assembly includes predefined override files. The instructions for the assembly will include information on how to install the service with the appropriate override file for your environment.
Rolling back patches Whether a patch succeeded or failed, you can now revert a service to the state before the patch was applied by running the cpd-cli patch rollback command.

For details, see Applying patches.

Operator-based installation on the Red Hat Marketplace If you want to install Cloud Pak for Data from the Red Hat Marketplace, you can use the Cloud Pak for Data operator. You can use the operator to install, scale, and upgrade the Cloud Pak for Data control plane and services using a custom resource (CR).

The operator will be available through the Red Hat Marketplace and is compatible with the Red Hat Operator Lifecycle Manager.

For details, see Installing the Cloud Pak for Data control plane from the OpenShift Console.

Deprecated features

What's changed What does it mean for me?
Open Source Management This service is deprecated and cannot be deployed on Cloud Pak for Data Version 3.5.
Regulatory Accelerator This service is deprecated and cannot be deployed on Cloud Pak for Data Version 3.5.
Extracting business terms and governance rules from PDF files This feature was provided as a technology preview in Watson Knowledge Catalog and is no longer supported.
Generating terms from assets This feature was provided as a technology preview in Watson Knowledge Catalog and is no longer supported.

LDAP group roles

You can no longer map an LDAP group directly to a Cloud Pak for Data role.

Instead, you can create user groups and add an LDAP group to the user group. When you create a user group, you can assign one or more roles to the user group.

Previous releases

Looking for information about what we've done in previous releases? See the following topics in IBM Knowledge Center: