Overview of Cloud Pak for Data

IBM Cloud Pak for Data is a set of services on IBM Software Hub that accomplishes all your data governance, data engineering, data analysis, and AI lifecycle tasks. Cloud Pak for Data implements a data fabric solution so that you can provide instant and secure access to trusted data to your organization, automate processes and compliance, and deliver trustworthy AI in your applications.

A data fabric architecture implements active metadata management to automate metadata processing with AI. The outcomes of the metadata analysis facilitate automated data discovery, improve confidence in data, and enable data protection and data governance at scale.

Cloud Pak for Data provides integrated tools for your organization to work with your data to improve your business. Your data engineers need tools to manage, prepare, integrate, and virtualize data. Your data quality analysts need tools to measure the quality of the data. Your governance team needs tools to control, protect, and enrich your data. Your data consumers, such as business analysts and data scientists, need tools to collaboratively develop insights and models.

For more information on the data fabric solution, see Use cases. To experience implementing the data fabric, take the data fabric tutorials.


Watch this video to see an overview of Cloud Pak for Data

This video provides a visual method to learn the concepts and tasks in this documentation.


Platform architecture

Cloud Pak for Data includes a set of integrated services on IBM Software Hub. The IBM Software Hub platform has multiple integrated experiences that share services and workspaces. The experiences that you can access depend on which services are installed on your IBM Software Hub cluster. An experience provides focused access to the tools for specific tasks.

The IBM Software Hub platform includes these integrated experiences:

  • watsonx, which contains the Watson Studio, Watson Machine Learning, and IBM watsonx.governance services for building and governing AI solutions.
  • Data Fabric, which contains the watsonx.data intelligence service for preparing and sharing high-quality, trusted data products.
  • watsonx.data, which contains the watsonx.data Premium, watsonx.data intelligence, watsonx.ai, and related services for preparing unstructured data for AI.
  • Cloud Pak for Data, which contains many of the same services as the other experiences but without generative AI or unstructured data processing capabilities.
  • Data Product Hub, which contains the Data Product Hub service for sharing data products without the rest of the Data Fabric capabilities.

Projects are shared between the experiences so that users with different tasks can work together. You can switch between experiences that you have permission to access to use different tools. Users who are collaborating in the same project can work in different experiences. For example, suppose a data engineer and an AI engineer are collaborators in the same project. The data engineer, who is working in the Data Fabric experience, prepares a data asset. The AI engineer, who is working in the watsonx experience, uses the data asset to train a model. See Switching between experiences.

The following illustration shows the architecture of the integrated experiences on the IBM Software Hub platform, the services and capabilities for each experience, and the shared functionality that provides an integrated user experience.

alt=""

Cloud Pak for Data services

Cloud Pak for Data services provide tools for managing data, integrating data, analyzing and building machine learning models with data, and governing data. Which tools and hardware and software resources that you have access to depend on which services are installed on your system.

  • Storing and managing data: Database administrators can manage the data sources that are installed on your cluster and create connections to many other types of data sources.

  • Preparing and integrating data: Data engineers can transform data, virtualize data, replicate data, and manage master data.

  • Analyzing data and building models: Data scientists can analyze and visualize data, train machine learning models, and govern AI solutions.

  • Governing data: Data stewards can curate data, manage data quality, protect data, and share data in catalogs.

Common Core Services

Many of the services across the experiences on IBM Software Hub require similar features and interfaces. These features are provided by the IBM Software Hub common core services. The common core services provide data source connections, workspaces such as projects and deployment spaces, job management, notifications, and search.

Common Core Services

Connectivity

You can create connections to remote data sources and import connected data. You can configure connections with personal or shared credentials. For a list of supported connectors, see Supported data sources.

You can share connections with others across the platform in the Platform assets catalog.

Administration

Your cluster administrators manage IBM watsonx through the IBM Software Hub. Administrators can perform the following types of tasks:

  • Installing, upgrading, or migrating the software
  • Backing up or restoring the software
  • Monitoring the platform
  • Securing the environment
  • Auditing events
  • Forwarding alerts, notifications, and announcements
  • Setting up services
  • Managing resources
  • Managing users

See Administering IBM Software Hub in the IBM Software Hub documentation.

Storage

IBM watsonx and Cloud Pak for Data require a persistent storage solution that is accessible to your Red Hat OpenShift cluster. All the assets that you create with watsonx.ai and watsonx.governance are stored in that persistent storage solution.

See Storage requirements in the IBM Software Hub documentation.

Workspaces and assets

Cloud Pak for Data is organized as a set of collaborative workspaces where you can work with your team or organization. Each workspace has a set of members with roles that provide permissions to perform actions. Most users work with assets, which are the items that users add to the platform. Data assets contain metadata that represents data, while assets that you create in tools, such as data pipelines and models, run code to work with data. The following diagram shows the main workspaces, their purposes, and how assets and other items move around the platform.

The main workspaces are projects, catalogs, deployment spaces, and categories. Assets move between projects and deployment spaces and catalogs. Governance artifacts are created in categories and are added as metadata to assets in catalogs.

You can work in these types of workspaces in Cloud Pak for Data:

  • Projects
  • Deployment spaces
  • Catalogs
  • Categories
  • Other workspaces for specific services

You can search for assets across all workspaces that you belong to.

Projects

Projects are where your data science, data engineering, or data curation teams work with data to create assets, such as, notebooks, dashboards, models, data pipelines, or enriched data assets.

If you have the watsonx experience, your projects appear in both experiences. However, you can view and run only those assets that are valid in the current experience. For example, in the Cloud Pak for Data experience, you can't inference a foundation model.

The following image shows what the Overview page of a project might look like.

A project contains assets and collaborators.

Catalogs

Catalogs are where your organization finds and stores high-quality, trusted data, and other assets, such as model factsheets. You can find data assets in a catalog and move them into a project to work with the data. Or you can curate data in projects and publish the high-quality data assets to a catalog for others to use. Catalogs require the IBM Knowledge Catalog service.

The following image shows what the Assets page of a catalog might look like.

A catalog contains a view of assets.

Deployment spaces

Deployment spaces are where your ModelOps team deploys models and other deployable assets to production and then tests and manages deployments in production. After you build models and deployable assets in projects, you promote them to deployment spaces.

The following image shows what the Overview page of a deployment space might look like.

A deployment space contains assets and collaborators.

Categories

Categories are where your governance team creates and manages governance artifacts that enrich data assets in catalogs. Categories require the IBM Knowledge Catalog service.

The following image shows what a category might look like.

A category contains governance artifacts.

Other workspaces

You can create specialized data assets in other workspaces and move them to projects and catalogs:

  • The Data Virtualization service provides a workspace to virtualize data assets over many data sources.
  • The Match360 service provides a workspace to configure and explore a 360-degree view of customer data.
  • The Data lineage service provides a workspace to configure and explore lineage.

Learn more