Architecture for IBM Cloud Pak for Data

IBM® Cloud Pak for Data is a modular platform for running integrated data and AI services. Cloud Pak for Data is composed of integrated microservices that run on a multi-node Red Hat® OpenShift® cluster, which manages resources elastically and runs with minimal downtime.

Cloud-native design

Many companies are embracing cloud concepts because they need reliable, scalable applications. Additionally, companies need to modernize their data workloads to use hardware effectively and efficiently.

By bringing together numerous data and AI services, Cloud Pak for Data reduces the cost and burden of maintaining multiple applications on disparate hardware. It also gives you the ability to assign resources to workloads as needed and reclaim those resources when not in use.

With a single, managed platform, Cloud Pak for Data makes it easier for your enterprise to adopt modern DevOps practices while simplifying your IT operations and reducing time to value.

Run on Red Hat OpenShift

Cloud Pak for Data runs on Red Hat OpenShift, which means that you can run Cloud Pak for Data on:

An on-premises, private cloud cluster
Any public cloud infrastructure that supports Red Hat OpenShift

For specific information about supported Red Hat OpenShift versions and installation types, see System requirements for Cloud Pak for Data.

Cloud Pak for Data uses the Kubernetes cluster within Red Hat OpenShift for container management.

Cluster architecture

Cloud Pak for Data is deployed on a multi-node cluster. Although you can deploy Cloud Pak for Data on a 3-node cluster for development or proof of concept environments, it is recommended that you deploy your production environment on a larger, highly available cluster with multiple dedicated master and worker nodes. This configuration provides better performance, better cluster stability, and increased ease of scaling the cluster to support workload growth. The specific requirements for a production-level cluster are identified in System requirements.

In a production-level cluster, there are three master + infrastructure nodes and three or more worker nodes. Using dedicated worker nodes means that resources on those nodes are used only for application workloads, which improves the performance of the cluster.

It is also easier to expand a 6-node cluster because each node has a specific role in the cluster. If you expand a 3-node cluster, the cluster has a mix of dedicated nodes and mixed-use nodes, which can cause some of the same issues that occur in a 3-node cluster. If you expand a 6-node cluster, each node has a dedicated purpose, which simplifies cluster management and workload management.

The following diagram illustrates the typical topology of a production-level cluster.

This illustration depicts the relationship between the nodes in the cluster. The diagram is explained in the subsequent text.

In this example, the load balancer can either be in the cluster or external to the cluster. However, in a production-level cluster, an enterprise-grade external load balancer is recommended. The load balancer distributes requests between the three master + infra nodes. The master nodes schedule workloads on the worker nodes that are available in the cluster. A production-level cluster must have at least three worker nodes, but you might need to deploy extra worker nodes to support your workload.

This topology is based on the minimum recommended requirements for a production-level cluster. However, you can implement a different topology. For example, you might want to separate the master nodes and the infrastructure nodes. Refer to the Red Hat OpenShift Container Platform documentation for other supported cluster configurations.

Operator installation architecture

The way that Cloud Pak for Data software operators are installed on your cluster depends on whether you want to enable the IBM Cloud Pak® for Data platform operator to complete specific tasks (express installation) or whether you want more control over how components are deployed (specialized installation).

Express installations

An express installation requires elevated permissions and does not enforce strict division between Red Hat OpenShift Container Platform projects (Kubernetes namespaces).

In an express installation, the IBM Cloud Pak foundational services operators and the Cloud Pak for Data operators are in the same project. The operators are included in the same operator group and use the same NamespaceScope Operator. Therefore, the settings that you use for IBM Cloud Pak foundational services are also used by the Cloud Pak for Data operators.

Specialized installations

A specialized installation allows a user with project administrator permissions to install the software after a cluster administrator completes the initial cluster setup.

A specialized installation also facilitates strict division between Red Hat OpenShift Container Platform projects (Kubernetes namespaces).

In a specialized installation, the IBM Cloud Pak foundational services operators are installed in the ibm-common-services project and the Cloud Pak for Data operators are installed in a separate project (typically cpd-operators). Each project has a dedicated:

Operator group, which specifies the OwnNamespace installation mode.
NamespaceScope Operator, which allows the operators in the project to manage operators and service workloads in specific projects.

In this way, you can specify different settings for the IBM Cloud Pak foundational services and for the Cloud Pak for Data operators.

Storage architecture

Cloud Pak for Data supports NFS, Portworx, Red Hat OpenShift Container Storage, and IBM Cloud File Storage.

If possible, choose a storage provider that is supported by all of the services that you plan to install. If that is not possible, your cluster can contain a mix of storage types. However, each service can target only one type of storage.

NFS storage

If you are using NFS storage, you can use either of the following configurations.

You can use an external NFS server. In this configuration, you must have a sufficiently fast network connection to reduce latency and ensure performance.
You can install NFS on a dedicated node in the same VLAN as the cluster.

OpenShift Container Storage

If you are using OpenShift Container Storage, you can use either of the following configurations.

You can use dedicated storage nodes.
Your storage nodes can co-exist with your worker nodes.

Because OpenShift Container Storage uses three replicas, it is recommended that you deploy OpenShift Container Storage on multiples of three. This makes it easier to scale up your storage capacity.

IBM Spectrum® Scale Container Native

IBM Spectrum Scale Container Native connects to your Spectrum Scale Storage Cluster through a remote network mount to provide access to the high-performance General Parallel File System (GPFS). IBM Spectrum Scale Container Native provides persistent data storage through the IBM Spectrum Scale Container Storage Interface Driver.

Both IBM Spectrum Scale Container Native and IBM Spectrum Scale Container Storage Interface Driver are deployed on the worker nodes of your OpenShift cluster.

Portworx storage

If you are using Portworx storage, you must add raw disks on the OpenShift worker nodes that you intend to use for storage. (These can be the same nodes as the worker nodes where you run services.) When you install Portworx on your cluster, the Portworx service will take over those disks automatically and use them for dynamic storage provisioning.

IBM Cloud File Storage

The ibmc-file-gold-gid and ibm-file-custom-gold-gidstorage classes are supported. The relative location of the storage is managed by your Red Hat OpenShift deployment on IBM Cloud.

For specific requirements and considerations, see:

Modular platform

The platform consists of a light-weight installation that is called the Cloud Pak for Data control plane. The control plane provides a command-line interface, an administration interface, a services catalog, and the central user experience.

Image depicting the components that are available when you install the Cloud Pak for Data control plane. The components are listed in the preceding text.

If you plan to install multiple instances of Cloud Pak for Data, you must install the control plane in each project (namespace) where you want to install Cloud Pak for Data. The control plane enables you to coordinate and interact with the services that are deployed in the project.

Common core services

Several Cloud Pak for Data services require similar features and interfaces. To streamline the platform, these features are provided by the Cloud Pak for Data common core services. These services are installed once in a project (namespace) and can be used by any service that requires one or more of the features.

The common core services provide data source connections, deployment management, job management, notifications, projects, and search.

Image depicting the features provided by the Cloud Pak for Datacommon core services. The features are listed in the preceding text. [

The common core services are automatically installed when you install a service that relies on them. If the common core services are already installed in the project (namespace), the service will use the existing installation.

If a particular feature is provided by the common core services, it is indicated in the documentation by the following label:

Common core services

Integrated data and AI services

The services catalog includes a broad range of offerings from IBM and from third-party vendors. The catalog contains the following types of services:

AI
Analytics
Dashboards
Data governance
Data sources
Developer tools
Industry solutions
Storage

Image depicting the different types of services in the catalog.

You can select the services that you want to install on the control plane. For more information, see Services and integrations.

For example, if you are concerned with data governance and data science, you might install several AI services and analytics services, data governance services, and developer tools that support the developers and data scientists who are using Cloud Pak for Data. In addition, you might want to deploy an integrated database to store the data science assets that you generate using Cloud Pak for Data.

Illustration showing several types of services installed on the Cloud Pak for Data control plane

The number of services that you install on the Cloud Pak for Data control plane and the workloads that you run for each service determine the resources that you need. You must ensure that you have the minimum resources required by each service. However, you should work with your IBM Sales representative to ensure that you have sufficient resources for your expected workloads.

Support for multitenancy

According to Gartner, multitenancy is:

Multitenancy is a reference to the mode of operation of software where multiple independent instances of one or multiple applications operate in a shared environment. The instances (tenants) are logically isolated, but physically integrated. The degree of logical isolation must be complete, but the degree of physical integration will vary.
https://www.gartner.com/it-glossary/multitenancy

Achieving multitenancy with multiple instances of Cloud Pak for Data (recommended)

In this pattern, you install multiple instances of Cloud Pak for Data on a single Red Hat OpenShift cluster. In this scenario, each instance of Cloud Pak for Data is installed in a separate Red Hat OpenShift project (namespace).

This configuration offers complete logical isolation of each instance of Cloud Pak for Data with limited physical integration between the instances.

When you set up your cluster, a Red Hat OpenShift cluster administrator can create multiple projects (Kubernetes namespaces) to partition your cluster. Within each project, you can assign resource quotas. Each project acts as a virtual cluster with its own security and network policies. In addition to being logically separated, you can use different authentication mechanisms for each Cloud Pak for Data deployment.

This tenancy model addresses the following use cases:

Partitioning your nonproduction environment from your production environment in a continuous integration, continuous delivery (CICD) pipeline. In this model, tenants work in discrete, isolated units with a clear separation of duties.
Creating instances for different departments or business units that have distinct roles and responsibilities within your enterprise. In this model, each tenant has their own authentication mechanism, resource quotas, and assets.

This tenancy model also offers several advantages:

You can minimize your overhead costs by deploying multiple instances on the same cluster.
The cluster administrator can establish tenant-specific quality of service characteristics in each instance.
The cluster administrator can assign project administrators to manage an instance of Cloud Pak for Data
The project administrator can control which services are deployed in the project and can manage the resources that are associated with the project. However, the project administrator does not have access to cluster-level settings and cannot change the resource quotas for their project.

Achieving multitenancy within a single instance of Cloud Pak for Data

In this pattern, you install a single instance of Cloud Pak for Data on your Red Hat OpenShift cluster. The instance uses a single authentication mechanism for all users, and each user is assigned to the appropriate role within the instance.

In this configuration, tenancy occurs at the resource level and users can see only resources that they are given access to. The following types of resources support logical isolation.

Analytics projects

Users must be explicitly added as collaborators to access the contents of a project. In this way, you can enforce logical isolation between projects. For example, you can create analytics projects to support specific teams or departments within your organization.

Analytics deployment spaces

Users must be explicitly added as collaborators to access the contents of an analytics deployment space. In this way, you can enforce logical isolation between deployment spaces.

Services that support service instances

Some services, such as integrated databases, can be deployed multiple times within a single deployment of Cloud Pak for Data. These deployments are called service instances. Users must be given explicit access to a service instance to interact with it. In this way, you can enforce logical isolation between service instances.

For information about services that support service instances, see Multitenancy support.

For an extra layer of isolation, service instances can be deployed to separate projects, called tethered projects.

However, some services do not support service instances. The resources that are associated with those services are available to any users who have access to the service. And in some cases, all of the users who have access to the instance of Cloud Pak for Data have access the service.

This configuration is physically integrated but does not support complete logical isolation. Additionally, you cannot partition the system to isolate tenant workloads or establish tenet-level resource quotas.

Tethered projects

If you need to isolate a service instance by deploying it to a separate project (namespace), you can create a tethered project for the service instance during installation. The service instance in the tethered project can be managed by Cloud Pak for Data but is otherwise isolated from Cloud Pak for Data and the other services that run in the Cloud Pak for Data project.

You might want to deploy a service instance to a tethered project in the following cases:

You are running a custom application that needs to access a specific service instance, but for security reasons, you don't want the application to access other services that are running in Cloud Pak for Data.
You are running a custom application or service instance that requires specific compute resources or a particular quality of service.

Because the tethered project is logically isolated from the main Cloud Pak for Data project, the tethered project can have its own network policies, security contexts, and quotas.

Restriction: Not all services support tethered projects. For details, see Multitenancy support.

Additionally, if a service supports tethered projects, the documentation for installing the service will include information on how to set up a tethered project beforeto installation.