Architecture for IBM Cloud Pak for Data
IBM Cloud Pak® for Data is a modular platform for running integrated data and AI services. Cloud Pak for Data is composed of integrated microservices that run on a multi-node Red Hat® OpenShift® cluster, which manages resources elastically and runs with minimal downtime.
Cloud-native design
Many companies are embracing cloud concepts because they need reliable, scalable applications. Additionally, companies need to modernize their data workloads to use hardware effectively and efficiently.
By bringing together numerous data and AI services, Cloud Pak for Data enables you to reduce the cost and burden of maintaining multiple applications on disparate hardware. It also gives you the ability to assign resources to workloads as needed and reclaim those resources when not in use.
With a single, managed platform, Cloud Pak for Data makes it easier for your enterprise to adopt modern DevOps practices while simplifying your IT operations and reducing time to value.
Run on OpenShift
- An on-premises, private cloud cluster
- Any public cloud infrastructure that supports Red Hat OpenShift
For specific information about supported Red Hat OpenShift versions and installation types, see System requirements for Cloud Pak for Data.
Cloud Pak for Data leverages the Kubernetes cluster within Red Hat OpenShift for container management.
Cluster architecture
Cloud Pak for Data is deployed on a multi-node cluster. Although you can deploy Cloud Pak for Data on a 3-node cluster for development or proof of concept environments, it is strongly recommended that you deploy your production environment on a larger, highly available cluster with multiple dedicated master and worker nodes. This configuration provides better performance, better cluster stability, and increased ease of scaling the cluster to support workload growth. The specific requirements for a production-level cluster are identified in System requirements.
In a production-level cluster, there are three master + infrastructure nodes and three or more worker nodes. Using dedicated worker nodes means that resources on those nodes are used only for application workloads, which improves the performance of the cluster.
It is also easier to expand a 6-node cluster, because each node has a specific role in the cluster. If you expand a 3-node cluster, the cluster has a mix of dedicated nodes and mixed-use nodes, which can cause some of the same issues that occur in a 3-node cluster. If you expand a 6-node cluster, each node has a dedicated purpose, which simplifies cluster management and workload management.
In this example, the load balancer can either be in the cluster or external to the cluster. However, in a production-level cluster, an enterprise-grade external load balancer is strongly recommended. The load balancer distributes requests between the three master + infra nodes. The master nodes schedule workloads on the worker nodes that are available in the cluster. A production-level cluster must have at least 3 worker nodes, but you might need to deploy additional worker nodes to support your workload.
This topology is based on the minimum recommended requirements for a production-level cluster. However, you could implement a different topology. For example, you might want to separate the master nodes and the infrastructure nodes. Refer to the Red Hat OpenShift Container Platform documentation for other supported cluster configurations.
Storage architecture
Cloud Pak for Data supports NFS, Portworx, Red Hat OpenShift Data Foundation, and IBM® Cloud File Storage.
- NFS storage
- If you are using NFS storage, you can
use either of the following configurations:
- You can use an external NFS server. In this configuration, you must have a sufficiently fast network connection to reduce latency and ensure performance.
- You can install NFS on a dedicated node in the same VLAN as the cluster.
- OpenShift Data Foundation
- If you are using OpenShift
Data Foundation, you can
use either of the following configurations:
- You can use dedicated storage nodes.
- Your storage nodes can co-exist with your worker nodes.
Because OpenShift Data Foundation uses 3 replicas, it is recommended that you deploy OpenShift Data Foundation on multiples of three. This makes it easier to scale up your storage capacity.
- IBM Storage Scale Container Native
- IBM Storage Scale
Container Native connects to your IBM Storage
Scale Storage Cluster through a remote network mount to
provide access to the high-performance General Parallel File System (GPFS). IBM Storage Scale
Container Native provides persistent data storage through the
IBM
Storage Scale Container Storage Interface Driver.
Both IBM Storage Scale Container Native and IBM Storage Scale Container Storage Interface Driver are deployed on the worker nodes of your OpenShift cluster.
- Portworx storage
- If you are using Portworx storage, you must add raw disks on the OpenShift worker nodes that you intend to use for storage. (These can be the same nodes as the worker nodes where you run services.) When you install Portworx on your cluster, the Portworx service will take over those disks automatically and use them for dynamic storage provisioning.
- IBM Cloud File Storage
- The
ibmc-file-gold-gid
andibm-file-custom-gold-gid
storage classes are supported. The relative location of the storage is managed by your Red Hat OpenShift deployment on IBM Cloud.
Modular platform
The platform consists of a light-weight installation called the Cloud Pak for Data control plane. The control plane provides a command-line interface, an administration interface, a services catalog, and the central user experience.
If you plan to install multiple instances of Cloud Pak for Data, you must install the control plane in each project (namespace) where you want to install Cloud Pak for Data. The control plane enables you to coordinate and interact with the services that are deployed in the project.
Common core services
Several Cloud Pak for Data services require similar features and interfaces. To streamline the platform, these features are provided by the Cloud Pak for Data common core services. These services are installed once in a given project (namespace) and can be used by any service that requires one or more of the features.
The common core services provide data source connections, deployment management, job management, notifications, projects, and search.
The common core services are automatically installed when you install a service that relies on them. If the common core services are already installed in the project (namespace), the service will use the existing installation.
If a particular feature is provided by the common core services, it is indicated in the documentation by the following label:
Common core services
Integrated data and AI services
- AI
- Analytics
- Dashboards
- Data governance
- Data sources
- Developer tools
- Industry solutions
- Storage
You can select the services that you want to install on the control plane. For more information, see Services and integrations.
For example, if you are concerned with data governance and data science, you might install several AI services and analytics services, data governance services, and developer tools that support the developers and data scientists who are using Cloud Pak for Data. In addition, you might want to deploy an integrated database to store the data science assets that you generate using Cloud Pak for Data.
The number of services that you install on the Cloud Pak for Data control plane and the workloads that you run for each service determine the resources that you need. You must ensure that you have the minimum resources required by each service. However, you should work with your IBM Sales representative to ensure that you have sufficient resources for your expected workloads.