Configure Watson OpenScale to work in batch mode by
connecting a custom Watson OpenScale machine learning
engine to an Apache Hive database and an Apache Spark analytics engine. Unlike online scoring, where
the scoring is done in real-time and the payload data can be logged into the Watson OpenScale data mart, batch processing is done
asynchronously. The batch processor reads the scored data and derives various model metrics. This
means that millions of transactions can be processed without bringing the data into the datamart. To
enable batch processing, you must apply the cpd-aiopenscale-3.5.0-patch-1
patch.
IBM Cloud Pak for Data 3.5.0 introduces a home page
that is now customizable, simpler navigation, improved platform and production workload management,
and broader support for connections from the platform with easier connection management. The release
also includes support for zLinux, a vault to store sensitive data, several new services, and
numerous updates to existing services.
The following table lists the new features that were introduced in Cloud Pak for Data Version 3.5.0.
What's new
What does it mean for me?
Customize the home page
In Cloud Pak for Data Version 3.5.0, you can
customize the home page in two ways:
Platform-level customization
A Cloud Pak for Data administrator can specify which
cards and links to display on the home page.
Cards
The cards that are available from the home page are determined by the services that are
installed on the platform.
You can disable cards if you don't want users to see them. The changes
apply to all users. However, the cards that an individual user sees are determined by their
permissions and the services that they have access to.
Resource links
You can customize the links that are displayed in the Resources section
of the home page.
Each user can specify the cards that are displayed on their home page. (However, the list of
cards that they can choose from is determined by the Cloud Pak for Data administrator.)
In addition, each user can
specify which links to display in the Quick navigation section of their home
page.
These features are offered in addition to the branding features that were
introduced in Cloud Pak for Data
3.0.1.
Create user groups
A Cloud Pak for Data administrator can create
user groups to make it easier to manage large numbers of users who need similar permissions.
When
you create a user group, you specify the roles that all of the members of the group have.
If
you configure a connection to an LDAP server, user groups can include:
Existing platform users
LDAP users
LDAP groups
You can assign a user group access to various assets on the platform in the same way
that you assign an individual user access. The benefit of a group is that it is easier to:
Give many users access to an asset.
Remove a user's access to assets by removing them from the user group.
Cloud Pak for Data Version 3.5.0 makes it easier
to manage and monitor your Cloud Pak for Data deployment.
The Platform management page gives you a quick overview of the services,
service instances, environments, and pods running in your Cloud Pak for Data deployment. The Platform
management page also shows any unhealthy or pending pods. If you see an issue, you can
use the cards on the page to drill down to get more information about the problem.
In addition, you can see your current vCPU and memory use. You can
optionally set quotas to help you track your actual use against your target use. When you set
quotas, you specify alert thresholds for vCPU and memory use. When you reach the alert threshold,
the platform sends you an alert so that you aren't surprised by unexpected spikes in resource use.
The Deployment spaces page gives you a dashboard that you can use to
monitor and manage production workloads in multiple deployment spaces.
This page makes it easier
for Operations Engineers to manage jobs and online deployments, regardless of where they are
running. The dashboard helps you assess the status of workloads, identify issues, and manage
workloads. You can use this page to:
Compare jobs.
Identify issues as they surface.
Accelerate problem resolution.
Common core services This feature is available only when
the Cloud Pak for Datacommon core services are installed. The common core services are automatically installed by services
that rely on them. If you don't see the Deployment spaces page, it's because
none of the services that are installed on your environment rely on the common core services.
Store secrets in a secure vault
Cloud Pak for Data introduces a new set of APIs
that you can use to protect access to sensitive data. You can create a vault that you can use to
store:
The Cloud Pak for Data navigation menu is
organized to focus on the objects that you need to access, such as:
Projects
Catalogs
Data
Services
Your task inbox
The items in the navigation depend on the services that are installed.
Manage connections more easily
The Connections page makes it easier for administrators to define
and manage connections and for users to find connections.
The Connections
page is a catalog of connections that can be used by various services across the platform. Any user
who has access to the platform can see the connections on this page. However, only users with the
credentials for the underlying data source can use a connection.
Users who have the Admin role on the connections
catalog can create and manage these connections. Unlike previous releases of Cloud Pak for Data, services can refer to these connections,
rather than creating local copies. This means that any changes you make on the
Connections page are automatically cascaded to the services that use the
connection.
Common core services This feature is available only when
the Cloud Pak for Datacommon core services are installed. The common core services are automatically installed by services
that rely on them. If you don't see the Connections page, it's because none
of the services that are installed on your environment rely on the common core services.
You can use workflows to manage your business processes. For example, when you install
Watson Knowledge
Catalog, the service includes
predefined workflow templates that you can use to control the process of creating, updating, and
deleting governance artifacts.
From the Workflow management page, you can
define and configure the types of workflows that you need to support your business
processes.
You can import and configure BPMN files from
Flowable.
Service The feature is available only if
Watson Knowledge
Catalog is installed.
In Cloud Pak for Data Version 3.5.0, you can
connect to storage volumes from the Connections page or from services that
support storage volume connections.
The storage volumes can be on external Network File System
(NFS) storage or persistent volume claims (PVCs). This feature lets you access the files that are
stored in these volumes from Jupyter Notebooks, Spark jobs, projects, and more. For details, see
Connecting to data sources.
You can also create and manage volumes
from the Storage volumes page. For more information, see Managing storage
volumes.
Improved backup and restore process
The backup and restore utility can now call hooks provided by Cloud Pak for Data services to perform the quiesce operation.
Quiesce hooks offer optimizations and other enhancements compared to scaling down all Kubernetes resources. Services might be quiesced and
unquiesced in a certain order, or services might be suspended without having to bring down pods to
reduce the time it takes to bring down applications and bring them back up. For more information,
see Backing up the file system to a local
repository or object store.
Audit service enhancements
The Audit Logging Service in Cloud Pak for Data
now supports increased events monitoring in the zen-audit-config
configmap.
If you updated the zen-audit-config configmap to forward auditable
events to an external security information and event management (SIEM) solution using the Cloud Pak for Data Audit Logging Service, you must update the
zen-audit-config configmap to continue forwarding auditable
events.
From:
<match export
export.**>
To:
<match export export.** records records.** syslog
syslog.**>
A Cloud Pak for Data administrator can configure
the idle web session timeout in accordance with your security and compliance requirements. If a user
leaves their session idle in a web browser for the specified length of time, the user is
automatically logged out of the web client.
Auditing assets with IBM Guardium®
The method for integrating with IBM Guardium has changed. IBM Guardium is no longer available as an option from
the Connections page. Instead, you can connect to your IBM Guardium appliances from the Platform
configuration page.
Common core services can be installed once
and used by multiple services. The common core services support:
Connections
Deployment management
Job management
Notifications
Search
Projects
Metadata repositories
The common core services are automatically
installed by services that rely on them. If you don't see these features in the web client, it's
because the common core services are not supported by
any of the services that are installed on your environment.
New cpd-cli commands
You can use the Cloud Pak for Data command line
interface to:
Manage service instances
Back up and restore the project where Cloud Pak for Data is deployed
Export and import Cloud Pak for Data metadata
Use your Cloud Pak for Data credentials to
authenticate to a data source
Some data sources now allow you to use your Cloud Pak for Data credentials for authentication. Log in to
Cloud Pak for Data and never enter credentials for the
data source connection. If you change your Cloud Pak for Data password, you don't need to change the password
for each data source connection. Data sources that support Cloud Pak for Data credentials have the selection Use
your Cloud Pak for Data credentials to authenticate to
the data source on the data source connection page. When you add a new connection to a
project, the selection is available under Personal credentials.
The
following data sources support Cloud Pak for Data
credentials:
HDFS via Execution Engine for Hadoop *
Hive via Execution Engine for Hadoop *
IBM Cognos® Analytics
IBM Data Virtualization
IBM
Db2®
Storage volume *
* HDFS via Execution Engine for Hadoop, Hive via Execution Engine for Hadoop, and Storage volume support only Cloud Pak for Data credentials.
Service enhancements
The following table lists the new features that are introduced for existing services in Cloud Pak for Data Version 3.5.0:
What's new
What does it mean for me?
Analytics Engine Powered by Apache Spark
Spark 3.0
Analytics Engine Powered by Apache Spark now supports Spark 3.0.
You can select:
The Spark 3.0 template to run Spark jobs or applications that run on your Cloud Pak for Data cluster by using the Spark jobs REST APIs.
A Spark 3 environment to run analytical assets in Watson Studio analytics projects.
Data Refinery
Use personal credentials for connections
If you create a connection and select the Personal credentials option,
other users can use that connection only if they supply their own credentials for the data source.
Users who have credentials for the underlying data source can:
Select the connection to create a Data Refinery flow
Edit or change a location when modifying a Data Refinery flow
Automatically detect and convert date and timestamp data types
When you open a file in Data Refinery,
the Convert column type GUI operation is automatically applied as the first
step if it detects any non-string data types in the data. In this release, date and timestamp data
are detected and are automatically converted to inferred data types. You can change the automatic
conversion for selected columns or undo the step. For information about the supported inferred date
and timestamp formats, see the FREQUENTLY USED category in Convert
column type in GUI operations in Data Refinery.
Change the decimal and thousands grouping symbols in all applicable columns
When you use the Convert column type GUI operation to detect and convert
the data types for all the columns in a data asset, you can now also choose the decimal symbol and
the thousands grouping symbol if the data is converted to an Integer data
type or to a Decimal data type. Previously you had to select individual
columns to specify the symbols.
Data Refinery flows are supported in
deployment spaces
You can now promote a Data Refinery
flow from a project to a deployment space. Deployment spaces are used to manage a set of related
assets in a separate environment from your projects. You can promote Data Refinery flows from multiple projects to a
space. You run a job for the Data Refinery
flow in the space and then use the shaped output as input for deployment jobs in Watson Machine Learning.
You can now refine data in files that use the tab-separated-value (TSV) format. TSV files are
read-only.
SJIS encoding available for input and output
SJIS (short for Shift JIS or Shift Japanese Industrial Standards)
encoding is an encoding for the Japanese language. SJIS encoding is supported only for CSV and
delimited files.
You can change the encoding of input files and output files.
To change the encoding of the output (target)
file in Data Refinery, open the Information pane and click the Details tab.
Click the Edit button. In the DATA REFINERY FLOW OUTPUT pane, click the Edit
icon.
New jobs user interface for running and scheduling flows
For more information, see the What's new entry for Watson Studio.
New visualization charts
For more information, see the What's new entry for Watson Studio.
Data Virtualization
Improve query performance by using cache recommendations
If your queries take a long time to run but your data doesn't change constantly, you can cache
the results of queries to make your queries more performant. Data Virtualization analyzes your queries and provides cache
recommendations to improve query performance.
Optimize query performance by using distributed processing
Data Virtualization can determine the optimal
number of worker nodes required to process a query. The number of worker nodes is determined based
on the number of data sources connected to the service, available service resources, and the
estimated size of the query result.
Manage your virtual data by using Data Virtualization APIs
With the Data Virtualization REST API, you can
manage your virtual data, data sources, and user roles. Additionally, you can use the API to
virtualize and publish data to the catalog.
Governance and security enhancements for virtual objects
When Watson Knowledge
Catalog is installed, you can use
policies and data protection rules from Watson Knowledge
Catalog to govern your virtual data. Data asset
owners are now exempt from data protection rules and policy enforcement in Data Virtualization.
You can also publish your virtual
objects to the catalog more easily and efficiently. For example, when you create your virtual
objects by using the Data Virtualization user
interface, your virtual objects are published automatically to the default catalog in Watson Knowledge
Catalog.
Optionally, you can now publish your
virtual objects by using the Data Virtualization REST
APIs.
You can now authenticate to Data Virtualization by
using the same credentials you use for the Cloud Pak for Data platform. Additionally, Data Virtualization now supports authentication by using a JSON
Web Token (JWT).
You can use the cpd-cli scale command to adjust the number of worker nodes that
the Data Virtualization service is running on. When
you scale the service it up, it makes the service highly available and increases the processing
capacity.
Monitor the service by using Db2 Data Management Console
You can use the integrated monitoring dashboard to ensure that the Data Virtualization service is working correctly. The
monitoring dashboard is powered by Db2 Data Management
Console. Additionally, the monitoring dashboard provides useful information about databases
connected to Data Virtualization.
You can now connect to the following data sources:
Microsoft Azure Data Lake Store
Amazon Redshift
Unstructured Data
SAP Packs
A license is required to use SAP Packs in Cloud Pak for Data. SAP Packs require the legacy Windows DataStage Client to design the jobs. Jobs can then
be run in Cloud Pak for Data.
User documentation is provided with the license for SAP Packs.
You can now access DataStage from your
projects page. You can create a DataStage
project by following the path Projects > All
Projects, then creating a new project of type Data
transform.
Project creation and deletion are now asynchronous. Previously, the DataStage UI was blocked during the time that is required to create or delete a
project. Now, you see a notification that says that the request to create or delete the project is
submitted. The project appears after the creation or deletion process completes successfully.
You can now set up an NFS mount in DataStage pods to
pass data files such as CSV and XML between DataStage
and source or target systems.
You can now use dynamic configuration files without enabling PXRuntime. With this support, the
nodes or pods that are used in the job are dynamically decided based on the availability of
resources on them at the time of running the job. You can run your jobs by automatically using nodes
that have highest resources available, increasing speed and performance.
You can change the resource allocation for the number of CPUs and memory to be used in your
jobs.
Support is provided for SSL/TLS communication with RPC connection by using Nginx as a proxy
server. This support provides greater security for connecting the legacy DataStage Designer client to Cloud Pak for Data. You can then use the Designer client to edit
jobs in Cloud Pak for Data
You can create custom images to support third-party drivers. Custom images have the benefits of
being unchangeable after they are built and reliably consistent across different environments. You
can also scan the images for vulnerability.
You can use a PersistentVolume (PV) to support third-party libraries and drivers.
The Operations Console is enabled for stand-alone DataStage installation on Cloud Pak for Data.
Non en-US language packs are now supported.
Notification with mailx is supported. Notifications can be sent out by mailx after an activity
completes in a job sequence.
The FileConnector heap size setting and the message handler settings are now persistent and will
not be lost if pods are restarted.
You can now add parameters and parameter sets in the transformer dialog box.
LongVarChar lengths of up to 3,000,000 characters are now supported in the Transformer
stage.
Db2
Deployment with operator
Db2 is now deployed by using an
operator, providing better consistency and predictability and faster deployment times. You can also
deploy multiple Db2 databases on the
same worker node.
Reduced footprint
Db2 consumes fewer resources
than in previous releases. The minimum requirement is now 1.5 VPCs per Db2 database.
Db2 REST and Db2 Graph support
You can set up your Db2 service
so that application programmers can create Representational State Transfer (REST) endpoints that can
be used to interact with Db2 and run
most SQL statements, including DDL and DML. You can also set up the service to use Db2 Graph, so that you can query your
Db2 data to perform graph analytics
without requiring any changes to the underlying database structure.
Run on zLinux
You can deploy Db2 on Red Hat® OpenShift® clusters that run on the zLinux (s390x)
operating system.
Version upgrade
The Db2 service runs Db2 Version 11.5.5.
Storage enhancements
Db2 now supports the following
storage options:
IBM Spectrum®
Scale CSI 2.0
Red Hat OpenShift Container Storage 4.5
Portworx 2.5.5
More backup and restore options
You can back up or restore by using remote storage such as IBM Cloud Object Storage or Amazon S3. Db2 also now offers the option of restoring
an encrypted database.
Security enhancements
You can directly authenticate with the Db2® service by using your Cloud Pak for Data user ID and password. The Db2 service uses Cloud Pak for Data authentication and authorization and supports
TLS certificates. You can also authenticate with JWT tokens and API keys, and you can download the
Db2 SSL certificate directly from
the web console.
Db2 Data Management Console support
You can use the Db2 Data Management Console service on Cloud Pak for Data to administer, monitor, manage, and optimize
the performance of Db2
databases.
When logged on as an administrator, you can now use the Cloud Pak for Data web client to complete the following tasks:
After you install the Db2 Big SQL
service, you can use the web client to provision one or more instances of the service. Each instance
can use a different resource configuration, be accessed by different users, or point to a different
Hadoop cluster.
Update an instance configuration. For each instance, you can optionally:
Scale the instance up or down by allocating additional or fewer resources.
Scale the instance out or in by adding or removing workers.
Track Db2 Big SQL resource usage at the
instance level.
Monitor the service by using Db2 Data Management
Console
You can use the integrated monitoring dashboard to ensure that the Db2 Big SQL service is working correctly. The
monitoring dashboard is powered by Db2 Data Management
Console.
It's now easier to install and configure Db2 Data Gate with simplified security setup and
certificate generation on z/OS®.
It's also
easier to provision instances of Db2 Data Gate
with a streamlined process.
Improved performance
The Db2 Data Gate service has increased
throughput and lower latency when loading and synchronizing data from Db2 for z/OS to the target database.
Run on zLinux
You can deploy Db2 Data Gate on Red Hat OpenShift clusters that run on the zLinux (s390x)
operating system.
Db2 Event Store
Run on larger clusters
Db2 Event Store can run on Red Hat OpenShift clusters with more than 3 worker nodes for
increased performance and scalability.
Support for new data types
Db2 Event Store now supports the decimal
data type.
Support for Apache Spark 2.4.6
Db2 Event Store supports the Apache Spark 2.4.6 unified analytics engine for big
data processing.
Db2
Warehouse
Deployment with operator
Db2
Warehouse is now
deployed by using an operator, providing better consistency and predictability and faster deployment
times. You can also deploy multiple Db2
Warehouse databases on the same
worker node.
Reduced footprint
Db2
Warehouse consumes fewer
resources than in previous releases. The minimum requirement is now 1.5 VPCs per Db2
Warehouse database.
Db2 REST and Db2 Graph support
You can set up your Db2
Warehouse service so that application
programmers can create Representational State Transfer (REST) endpoints that can be used to interact
with Db2
Warehouse and run most
SQL statements, including DDL and DML. You can also set up the service to use Db2 Graph, so that you can query your
Db2
Warehouse data to perform
graph analytics without requiring any changes to the underlying database structure.
Support for object storage providers (MPP only)
The Db2
Warehouse service in
a massively parallel processing (MPP) configuration can work with data in external tables in cloud
object storage providers such as Amazon S3 and
Microsoft Azure Blob Storage, or any other S3 compatible
storage such as IBM® Cloud Object Storage or MinIO. This option is available for Db2
Warehouse MPP deployments.
Db2 Data Management Console support
You can use the Db2 Data Management Console service on Cloud Pak for Data to administer, monitor, manage, and optimize
the performance of Db2
Warehouse
databases.
Run on zLinux
You can deploy Db2
Warehouse
on Red Hat OpenShift clusters that run on the
zLinux (s390x) operating system.
Version upgrade
The Db2
Warehouse service
runs Db2
Warehouse Version
11.5.5.
Storage enhancements
Db2
Warehouse now supports
the following storage options:
IBM Spectrum
Scale CSI 2.0
Microsoft Azure Blob Storage (object storage)
Amazon S3 Cloud object storage
Red Hat OpenShift Container Storage 4.5
Portworx 2.5.5
More backup and restore options
You can back up or restore by using remote storage such as IBM Cloud Object Storage or Amazon S3. Db2
Warehouse also now offers the option
of restoring an encrypted database.
Security enhancements
You can directly authenticate with the Db2® service by using your Cloud Pak for Data user ID and password. The Db2
Warehouse service uses Cloud Pak for Data authentication and authorization and supports
TLS certificates. You can also authenticate with JWT tokens and API keys, and you can download the
Db2
Warehouse SSL certificate
directly from the web console.
Decision Optimization
Overview pane in the model builder
The overview pane provides you with model, data and solution summary information for all your
scenarios at a glance. From this view you can also open an information pane where you can create or
choose your deployment space.
Enhanced Explore solution view in the model builder
The Explore solution view of the model builder shows you more information
about the objectives (or KPIs), solution tables, constraint or bounds relaxations or conflicts,
engine statistics, and log files.
From the Visualization view of the model builder, you can create Gantt
charts for any type of data, where it is meaningful. Gantt charts are no longer restricted to
scheduling models only.
The Decision Optimization model builder now targets the
Python 3.7 runtime when generating notebooks from scenarios. In Watson Machine Learning, the Decision Optimization runtime now runs Python 3.7.
Improved data schema editing in the Modeling Assistant
You can now define data types for table columns and edit data schema when you use the Modeling
Assistant.
Delegation of CPLEX engine solve to Watson Machine Learning
You can now delegate the Decision Optimization solve to
run on Watson Machine Learning from your Java CPLEX or CPO
models.
Language support
The Decision Optimization interface is now translated into
multiple languages.
Execution Engine for Apache Hadoop
Integration with IBM Spectrum Conductor with Spark
clusters
IBM Spectrum Conductor with Spark is now supported. You can
integrate IBM Spectrum Conductor with Spark and Watson Studio by using Jupyter Endpoint Gateway
endpoints. Users can open a notebook in Watson Studio to access Jupyter Endpoint Gateway
instances that are running on IBM Spectrum Conductor with Spark.
For details, see Spectrum environments.
New configurations that allow you to use your own certificates
The configurations convert DSXHI to do the following customizations:
Provide a custom Keystore to generate the required .crt.
Provide any custom truststore (CACERTS), where DSXHI certificates will be added.
Provide options to either add the host certificate to the truststore yourself or have DSXHI add
it.
The JSON Web Tokens to Kerberos delegation token provider, which provides authentication to
HiveServer2, HDFS, and HMS resources. For details, see Using delegation token
endpoints.
The updated versions for Jupyter Endpoint Gateway 2.3 and Knox 1.4.
Improved validation
The system_check.py scripts were introduced to validate your Hadoop configuration.
Guardium External S-TAP®
Improved integration with the Cloud Pak for Data web
client
You can now create and manage your Guardium External S-TAP instances from the Cloud Pak for Data web client.
Support for new target databases
You can use the Guardium External S-TAP to monitor
additional databases. For details, see External S-TAP supported platforms on the IBM Support portal.
Jupyter Notebooks with Python 3.7 for GPU
This service now provides environments for Python 3.7 instead of Python 3.6.
Jupyter Notebooks with R 3.6
Support for loading data from database connections
You can use the insert to code function to load data to a notebook from the following database connections:
A new SPSS Analytic Server connection type is available for SPSS Modeler. With this connection type, you can import and
run SPSS Modeler streams
(.str) that were created in SPSS Modeler classic to run on SPSS Analytic Server. See
Supported data sources for
SPSS Modeler for more information.
Jobs
You can now create and schedule jobs as a way of running SPSS Modeler flows. Click the Jobs icon from the SPSS Modeler toolbar and select Create a
job. See Creating and
scheduling jobs for more information.
New and changed nodes
SPSS Modeler includes the following new and
changed nodes:
CPLEX® Optimization node: With this new node, you
can use complex mathematical (CPLEX) based optimization via an Optimization Programming Language
(OPL) model file.
Kernel Density Estimation (KDE) Simulation node:
This new node uses the Ball Tree or KD Tree algorithms for efficient queries, and walks the line
between unsupervised learning, feature engineering, and data modeling.
Data Asset Export node: This node has been redesigned. Use the node to write to remote
data sources using connections, write to a data file on your local computer, or write data to your
project.
Support for database functions
You can run SPSS Modeler desktop stream files
(STR) that contain database functions.
New visualization charts
For more information, see the What's new entry for Watson Studio.
Deploy Text Analytics models to Watson Machine Learning Server
You can now deploy Text Analytics models to a Watson Machine Learning Server as you can with other model types.
Deployment is the final stage of the lifecycle of a model - making a copy of the model available to
test and use. For example, you can create a deployment for a model so you can submit new data to it
and get a score or prediction back.
Python 3.7
SPSS Modeler now uses Python 3.7.9. Note that
the Python schema has changed, so you may need to review and adjust any Python scripts you use in
SPSS Modeler.
Streams
Application resource customization
You can customize the resources that are used by your application. You can:
Create a custom application image for dependencies, such as software packages or libraries, that
are not included in the default application image.
Customize the resources, such as CPU or memory, that your Streams applications use by creating custom
application resource templates.
Use your favorite Streams
development environment (streamsx Python API, notebooks, Visual Studio Code, or Streams Flows) to build your edge application and
image. For more information, see Developing edge applications with IBM Edge Analytics.
Enhanced Streams standalone applications
Application metrics, such as data tuple counters, operator costs, and user-defined metrics, can
be exposed and the default threading model can be specified for standalone Streams applications.
Edge-aware samples
Explore new application samples designed for the edge.
Streams jobs as Cloud Pak for Data services
This release introduces the ability to enable a Streams job as a Cloud Pak for Data service. A streams-application
service can be used to insert data into and retrieve data from a Streams job. A
streams-application service is created by inserting one or more endpoint operators
into an application and submitting the application to run as a job. Exchanging data with the job is
done by using a REST API. The streams-application service instances are included in
the Services > Instances page of the Cloud Pak for Data web client.
Selecting a service entry in the list opens the REST API documentation for the service.
Additional information is shown on the new Overview page for assets, such
as, the asset's path and related assets.
More activities are shown on the Activities page for assets.
COBOL copybook is now a supported asset type. You can preview the contents of copybooks.
You can add more types of assets and metadata to catalogs by coding custom attributes for assets
and custom asset types with APIs.
New connections
Watson Knowledge
Catalog can connect to:
Amazon RDS for MySQL
Amazon RDS for PostgreSQL
Apache Cassandra
Apache Derby
Box
Elasticsearch
HTTP
IBM Data Virtualization Manager for
z/OS
IBM Db2 Event Store
IBM SPSS Analytic Server
MariaDB
Microsoft Azure Blob Storage
Microsoft Azure Cosmos DB
MongoDB
SAP HANA
Storage volume
In addition, the following connection names have changed:
PureData System for Analytics is now called Netezza® (PureData® System for Analytics)
Your previous settings for the
connection remain the same. Only the name for the connection type changed.
New SSL encryption support for connections
The following connections now support SSL encryption in Watson Knowledge
Catalog:
Amazon Redshift
Cloudera Impala
IBM Db2 for z/OS
IBM
Db2 Warehouse
IBM Informix®
IBM Netezza (PureData System for Analytics)
Microsoft Azure SQL
Database
Microsoft SQL Server
Pivotal Greenplum
PostgreSQL
Sybase
Category roles control governance artifacts
The permissions to view and manage all types of governance artifacts, except for data protection
rules, are now controlled by collaborator roles in the categories that are assigned to the
artifacts.
To view or manage governance artifacts, users must meet these conditions:
Have a user role with one of the following permissions:
Access governance artifacts
Manage governance categories
Be a collaborator in a category
Category collaborators have roles with permissions that control whether they can view
artifacts, manage artifacts, manage categories, and manage category collaborators. Subcategories
inherit collaborators from their parent categories. Subcategories can have other collaborators, and
their collaborators can accumulate more roles. The predefined collaborator, All users, includes
everyone with permission to access governance artifacts.
If you upgraded from Cloud Pak for Data Version
3.0.1, the following user permissions are automatically migrated as part of the upgrade :
Users who had the Manage governance categories permission continue to
have that permission and also have the Owner role for all top-level
categories.
Users who had the Manage governance artifacts permission now have the
Access governance artifacts permission, the Editor
role in all categories, and the new Manage data protection rules
permission.
All users now have the Access governance artifacts permission. However,
when you add new users, the Access governance artifacts permission is not
included in all of the predefined roles. It is include in the Administrator,
Data Engineer, Data Steward, and Data
Quality Analyst roles.
All users who were listed as Authors in a governance workflow now have
the Access governance artifacts permission and also the
Editor role in all categories.
Workflows for governance artifacts support categories
Workflow configurations for governance artifacts now require categories to identify the
governance artifacts and users for the workflow:
When you create a new workflow configuration for governance artifacts, you must select either
one category or all categories as part of the triggering condition for the workflow, along with
governance artifact types and events.
You no longer specify artifact authors in a workflow configuration. Artifact authors are all
users who have permission to edit artifacts in a category that is specified in the workflow
configuration.
You now specify one or more of these types of assignees to approve and review artifacts: the
workflow requestor, users with specified roles in the categories for the workflow, users with the
Data Steward role, or selected users.
Watson Knowledge
Catalog includes the following
changes for discovering data:
Automated discovery
The sample size is 1,000 records by default. Changes require specific permissions.
Quick scan
With the improved version, you can perform more scalable data discovery with richer analysis
results that can be published to one or more catalogs directly from the quick scan results.
You can now import reference data sets. When you import a reference data set, you can also
import secondary categories, effective dates, and custom attribute values for most artifacts.
Watson Machine Learning supports the generally available
releases of the Watson Machine Learning V4 REST APIs and the V4
Watson Machine Learning Python client, which give you
programmatic access to all of the current machine learning features.
You can use data from Network File System (NFS) to train models and as input data for deployment
jobs. For example, you can use a CSV file from a storage volume as the training data for an AutoAI
model, and use a payload file from the volume to deploy and score the trained model.
View deployment activity across all spaces you can access in a new deployment spaces dashboard.
Use the dashboard to monitor activity for all of your spaces and view visualizations to give you
insights into deployments and jobs.
Tech preview This is a technology preview and is not supported for
use in production environments.
Use federated
learning to train a common model using remote, secure data sets. The data sets are not shared so
full data security is maintained, while the resulting model gets the benefit of the expanded
training.
Tech preview This is a technology preview and is not supported for
use in production environments.
AutoAI
experiments support multiple data sources as input for training an experiment. Use the data join
canvas to combine the data sets based on common columns, or keys, to build a unified data set.
Deploy a data join model using multiple data sets as input for your jobs.
Watson OpenScale analyzes indirect bias, which
occurs when one feature can be used to stand for another. For example, one feature in a model might
approximate another feature that is a protected attribute. Although it is illegal to discriminate
based on race, race can sometimes correlate closely with postal code, which might be the cause of
indirect bias.
When you use resources that are already part of the Cloud Pak for Data cluster, such as Watson Machine Learning, many of the values are supplied for you when
you configure Watson OpenScale.
The Drift Monitor also completes many of the values for you during configuration and setup
New version of the Python SDK
This release includes a new, more integrated version of the Watson OpenScale Python SDK.
The new Python SDK replaces the
Version 1 SDK, eliminates separate APIs for each monitor, and standardizes many of the classes and
methods used for monitor configuration and subscription to machine learning
providers.
The Fairness Monitor has undergone extensive redesign based on your feedback! Now you can use
the enhanced charts to determine balanced data and perfect equality at a glance. You can even do
what-if scenarios with scoring in real time.
Model Risk Management notifications
There are several enhancements to Model Risk Management. You can set thresholds for receiving
email notifications of violations. There are enhanced PDF reports. And, when integrated with IBM
OpenScale, you can now set when to send metrics (immediately, daily, or weekly).
Debiasing support for regression models
Along with classification models, Watson OpenScale
now detects bias in regression models. You can both detect and mitigate bias.
Watson Studio
New jobs interface for running and scheduling notebooks and Data Refinery flows
The user interface gives you a unified view of the job information.
You can create the jobs
from either of the following locations:
You can use the following visualization charts with Data Refinery and SPSS Modeler:
Evaluation charts
Evaluation charts are combination charts that measure the quality of a binary classifier. You
need three columns for input: actual (target) value, predict value, and confidence (0 or 1). Move
the slider in the Cutoff chart to dynamically update the other charts. The ROC and other charts are
standard measurements of the classifier.
Math curve charts
Math curve charts display a group of curves based on equations that you enter. You do not use a
data set with this chart. Instead, you use it to compare the results with the data set in another
chart, like the scatter plot chart.
Sunburst charts
Sunburst charts display different depths of hierarchical groups. The Sunburst chart was formerly
an option in the Treemap chart.
Tree charts
Tree charts represent a hierarchy in a tree-like structure. The Tree chart consists of a root
node, line connections called branches that represent the relationships and connections between the
members, and leaf nodes that do not have child nodes. The Tree chart was formerly an option in the
Treemap chart.
In addition, the following connection names have changed:
PureData System for Analytics is now called Netezza (PureData System for Analytics).
Your previous settings for the
connection remain the same. Only the name for the connection type changed.
New SSL encryption support for connections
The following
connections now support SSL encryption in Watson Studio:
Amazon Redshift
Cloudera Impala
IBM Db2 for z/OS
IBM
Db2 Warehouse
IBM Informix
IBM Netezza (PureData System for Analytics)
Microsoft Azure SQL
Database
Microsoft SQL Server
Pivotal Greenplum
PostgreSQL
Sybase
Support for Python 3.7
The default Python environment version in Watson Studio
Watson Studio is now Python 3.7.
Python 3.6 is being
deprecated. You can continue to use the Python 3.6 environments; however you will be notified that
you should move to a Python 3.7 environment.
When you switch from Python 3.6 to Python 3.7,
you might need to update your code if the versions of open source libraries that you use are
different in Python 3.7.
Spark 3.0
You can run analytical assets from Watson Studio analytics projects in a Spark 3
environment.
If you use the Spark Jobs REST APIs, provided by Analytics Engine Powered by Apache Spark, to run Spark jobs or applications on
your Cloud Pak for Data cluster, you can use the Spark
3.0 template.
Notebook execution progress restored
If you accidentally close the browser window while your notebook is still running, or if you are
logged out by the system during a long running job, the notebook will continue running and all
output cells are restored when you open the notebook again. The execution progress of a notebook can
be restored only for notebooks that run in a local kernel. If your notebook runs on a Spark or
Hadoop cluster, and you open the notebook again, any notebook changes that were not saved are
lost.
Use a self-signed certificate to authenticate to enterprise Git repositories
If you want to store your analytics project in an enterprise-grade instance of Git, such as
GitHub Enterprise, and your instance uses a self-signed certificate for authentication, you can
specify the self-signed certificate in PEM format when you add your personal access token to
Cloud Pak for Data.
New services
The following table lists the new services that are introduced in Cloud Pak for Data Version 3.5.0:
Category
Service
Pricing
What does it mean for me?
Data source
Db2 Data Management Console
Included with Cloud Pak for Data
Use Db2 Data Management Console to administer,
monitor, manage, and optimize your integrated Db2 databases, including Db2 Big SQL and Data Virtualization, from a single user interface. The console
helps you improve your productivity by providing a simplified process for managing and maintaining
your complex database ecosystem across Cloud Pak for Data.
The console home page provides an overview of all of the databases that you are monitoring.
The home page includes the status of database connections and monitoring metrics that you can use to
analyze and improve the performance of your databases.
From the console, you can also:
Administer databases
Work with database objects and utilities
Develop and run SQL scripts
Move and load large amounts of data into databases for in-depth analysis
You can use OpenPages to manage
risk and regulatory challenges across your organization. OpenPages is an integrated governance, risk, and
compliance (GRC) suite that can help your organization identify, manage, monitor, and report on risk
and compliance initiatives that span your enterprise. The service provides a powerful, scalable, and
dynamic set of tools that can help you with:
Collect, describe, and provide your data according to Oil & Gas industry
standards.
IBM Open Data for Industries provides a toolset
that supports an industry-standard methodology for collecting and describing Oil & Gas data and
serving that data to various applications and services that consume it.
IBM Open Data for Industries provides a reference implementation for a
data platform to integrate silos and simplify access to this data for stakeholders. It standardizes
the data schemas and provides a set of unified APIs for bringing data into Cloud Pak for Data, describing, validating, finding, and
retrieving data elements and their metadata. Effectively, Open Data for Industries becomes a system of record for subsurface and
wells data.
Application developers can use these APIs to create applications that are
directly connected to the stakeholder's data sets. After the application is developed, it requires
minimal or no customization to deploy it for multiple stakeholders that adhere to the same APIs and
data schemas.
In addition, stakeholders can use these APIs to connect their
applications with the platform and take advantage of the seamless data lifecycle in Cloud Pak for Data.
Watson Machine Learning
Accelerator is a deep learning platform
that data scientists can use to optimize training models and monitor deep learning workloads.
Watson Machine Learning
Accelerator can be connected to Watson Machine Learning to take advantage of the multi-tenant resource
plans that manage resource sharing across Watson Machine Learning projects. With this integration, data scientists can use the Watson Machine Learning Experiment
Builder and Watson Machine Learning Accelerator hyperparameter optimization.
You can deploy Cloud Pak for Data Version 3.5 on
the following versions of Red Hat OpenShift:
Version 3.11
Version 4.5
Support for zLinux
You can deploy the following Cloud Pak for Data
software on zLinux (s390x):
The Cloud Pak for Data control plane
Db2
Db2
Warehouse
Db2 for z/OS Connector
Db2 Data Gate
Simplified and updated installation commands
The Cloud Pak for Data command-line interface
uses a simplified syntax. The cpd-Operating_System command is
replaced by the cpd-cli command.
When you download the installation files, you
must select the appropriate package for the operating system where you will run the commands. For
details, see Obtaining the installation files.
Many of the
cpd-cli commands have different syntax. Review the installation documentation
carefully to ensure that you use the correct syntax.
For example:
On air-gapped clusters, the cpd-Operating_System
preloadImages command is now cpd-cli preload-images.
When you run the install or upgrade commands, you specify the
--latest-depenency flag to ensure that the latest prerequisite components are
installed.
Upgrading the Cloud Pak for Data
metadata
Before you can upgrade to Cloud Pak for Data
Version 3.5, you must upgrade the Cloud Pak for Data
metadata by running the cpd-cli operator-upgrade command.
The Cloud Pak for Data control plane requires an
additional service account: cpd-norbac-sa, which is bound to a restricted security
context constraint (SCC).
This security account is specified in the cpd-cli adm
command for the control plane.
If an assembly requires an override for Portworx or OpenShift Container Storage, the assembly includes predefined
override files. The instructions for the assembly will include information on how to install the
service with the appropriate override file for your environment.
Rolling back patches
Whether a patch succeeded or failed, you can now revert a service to the state before the
patch was applied by running the cpd-cli patch rollback command.
Operator-based installation on the Red Hat Marketplace
If you want to install Cloud Pak for Data from
the Red Hat Marketplace, you can use the
Cloud Pak for Data operator. You can use the operator to
install, scale, and upgrade the Cloud Pak for Data
control plane and services using a custom resource (CR).
The operator will be available through
the Red Hat Marketplace and is compatible
with the Red HatOperator Lifecycle Manager.
This service is deprecated and cannot be deployed on Cloud Pak for Data Version 3.5.
Regulatory Accelerator
This service is deprecated and cannot be deployed on Cloud Pak for Data Version 3.5.
Extracting business terms and governance rules from PDF files
This feature was provided as a technology preview in Watson Knowledge
Catalog and is no longer supported.
Generating terms from assets
This feature was provided as a technology preview in Watson Knowledge
Catalog and is no longer supported.
LDAP group roles
You can no longer map an LDAP group directly to a Cloud Pak for Data role.
Instead, you can create user groups
and add an LDAP group to the user group. When you create a user group, you can assign one or more
roles to the user group.
Previous releases
Looking for information about what we've done in previous releases? See the following topics in
IBM Knowledge Center: