Hardware requirements
Learn about the hardware requirements for a deployment of IBM Cloud Pak® for AIOps on Red Hat® OpenShift® Container Platform. Your hardware must be able to support Red Hat OpenShift, IBM Cloud Pak for AIOps, and your chosen storage solution.
Before you begin
- You cannot change your selected deployment size after installation.
- Multi-zone high availability disaster resolution (HADR) is available as a nonproduction technology preview.. For more information, see Installing IBM Cloud Pak for AIOps on a multi-zone architecture (multi-zone HADR).
- vCPU is defined as when one x86 CPU splits each of its physical cores into virtual cores (vCPU). It is assumed that one x86 CPU's physical core can be split into two logical vCPUs.
- If Red Hat OpenShift is installed on VMware virtual machines, set the value of the
sched.cpu.latencySensitivity
parameter to high. - If you are planning to install IBM Cognos® Analytics after you have installed IBM Cloud Pak for AIOps, then you will require extra resources. For more information, see Before you begin.
- Persistent storage is also required, for more information, see Storage requirements.
You can deploy a starter, production, or custom sized deployment of IBM Cloud Pak for AIOps. The processing abilities and hardware requirements vary for each deployment size. Review the information in the following sections:
Hardware requirements - Red Hat OpenShift
A Red Hat OpenShift cluster has master nodes and worker nodes. Tables 1 and 2 show the minimum node requirements for Red Hat OpenShift to withstand the demands of starter and production deployments of IBM Cloud Pak for AIOps.
IBM Cloud Pak for AIOps requires your Red Hat OpenShift cluster to have at least three master nodes. IBM Cloud Pak for AIOps has many Kubernetes operators that interface with the API server
and etcd
storage, and so
master nodes must be adequately sized.
The number of worker nodes that are required varies, and depends on the storage and processing requirements of your IBM Cloud Pak for AIOps deployment. The size of each of your worker nodes can vary, but the combined resources of your worker nodes must comply with the resource totals in Tables 3 and 4. Each worker node must meet the minimum requirements in Table 2 to accommodate the placement of larger IBM Cloud Pak for AIOps pods.
A higher recommended vCPU is also given in Table 2. When you are selecting the size of your worker nodes, you must balance cost considerations against the benefits of improved resiliency and the ease of scheduling workloads. Over-allocating resources can improve resiliency by enabling sufficient resources to be available if a worker node fails, and having well-sized worker nodes makes the placement of workloads easier. If your largest worker node becomes unavailable, resiliency is improved if your smallest worker node is able to handle the largest worker node's workloads. The degree of over-allocation of resources is correlated with the extent of failure scenarios that can be accommodated.
Resource | Starter | Production |
---|---|---|
Master node count | 3 | 3 |
vCPU per node | 4 | 4 |
Memory per node (GB) | 16 | 16 |
Disk per node (GB) | 120 | 120 |
Resource | Starter | Production |
---|---|---|
Minimum vCPU per node | 8 | 16 |
Recommended vCPU per node | 16 | 16 |
Memory per node (GB) | 12 | 20 |
Disk per node (GB) | 120 | 120 |
Note: The numbers that are specified are minimums. For high production workloads, you might need a larger disk per node (GB) for your worker loads.
Hardware requirements - IBM Cloud Pak for AIOps
Table 3 and table 4 show the hardware requirements for starter and production deployments of IBM Cloud Pak for AIOps, for a base deployment and an extended deployment. For more information about the differences between a base deployment and an extended deployment, see Incremental adoption.
The resource requirements are given for the following scenarios:
- Table 3: IBM Cloud Pak for AIOps only - you are deploying IBM Cloud Pak for AIOps on an existing Red Hat OpenShift cluster.
- Table 4: IBM Cloud Pak for AIOps and Red Hat OpenShift - you are creating a new Red Hat OpenShift cluster and deploying IBM Cloud Pak for AIOps onto it.
Resource | Base deployment IBM Cloud Pak for AIOps only (Starter) |
Extended deployment IBM Cloud Pak for AIOps only (Starter) |
Base deployment IBM Cloud Pak for AIOps only (Production) |
Extended deployment IBM Cloud Pak for AIOps only (Production) |
---|---|---|---|---|
Master node count |
|
|
|
|
Minimum worker node count | 3 | 3 | 6 | 6 |
Total vCPU | 47 | 55 | 130 | 156 |
Total memory (GB) | 123 | 136 | 310 | 368 |
Resource | Base deployment IBM Cloud Pak for AIOps and Red Hat OpenShift (Starter) |
Extended deployment IBM Cloud Pak for AIOps and Red Hat OpenShift (Starter) |
Base deployment IBM Cloud Pak for AIOps and Red Hat OpenShift (Production) |
Extended deployment IBM Cloud Pak for AIOps and Red Hat OpenShift (Production) |
---|---|---|---|---|
Master node count | 3 | 3 | 3 | 3 |
Minimum worker node count | 3 | 3 | 6 | 6 |
Total vCPU | 59 | 67 | 142 | 168 |
Total memory (GB) | 171 | 184 | 358 | 416 |
The Red Hat OpenShift master and worker nodes must meet the minimum size requirements in Hardware requirements - Red Hat OpenShift only.
Important: These values do not include CPU and memory resources for hosting a storage provider, such as Red Hat® OpenShift® Data Foundation or Portworx. Storage providers can require more nodes, more resources, or both, to run. The extra resources that are needed can vary based on your selected storage provider. A general recommendation is for one extra worker node for a starter deployment, and for three extra worker nodes for a production deployment. Consult your storage provider's documentation for exact requirements.
In addition to the default starter and production size deployments, you can choose to deploy a custom-sized deployment of IBM Cloud Pak for AIOps. For more information about customizing the size of your IBM Cloud Pak for AIOps deployment, see Custom sizing.
Requests and limits
In Kubernetes, requests define the resources that must be provided, and limits define the upper boundary of extra resources that can be allocated. Resource quotas can optionally be used to dictate the allowed totals for all requests and limits in a namespace. If a resource quota is set, then it must be at an adequate level to ensure that workload placement is not inhibited. If you do not deploy resource quotas, then skip this section.
The following table shows the requests and limits that the IBM Cloud Pak for AIOps installation namespace require for a typical starter or production deployment.
Resource name | IBM Cloud Pak for AIOps installation namespace (Starter) |
IBM Cloud Pak for AIOps installation namespace (Production) |
---|---|---|
CPU Request | 75 | 227 |
Memory Request (Gi) | 195 | 466 |
Ephemeral Storage Request (Gi) | 105 | 200 |
CPU Limit | 210 | 436 |
Memory Limit (Gi) | 260 | 533 |
Ephemeral Storage Limit (Gi) | 340 | 600 |
Important: Deployment of additional services impacts these values. Review these values regularly to ensure that they still meet your requirements. Extra resources are required for upgrade, for more information see Ensure cluster readiness.
Use the Red Hat OpenShift command line to create or edit a ResourceQuota
object with these values. For more information, see Quotas in the Red Hat OpenShift documentation.
The following example YAML file show the YAML to create ResourceQuota
objects for the IBM Cloud Pak for AIOps installation namespace in a production deployment of IBM Cloud Pak for AIOps:
apiVersion: v1
kind: ResourceQuota
metadata:
name: aiops-large
namespace: cp4aiops
spec:
hard:
requests.cpu: "227"
requests.memory: 466Gi
requests.ephemeral-storage: 200Gi
limits.cpu: "436"
limits.memory: 533Gi
limits.ephemeral-storage: 600Gi
pods: '275'
Processing abilities
Expand the following sections to find out about the processing abilities of starter and production sized deployments of IBM Cloud Pak for AIOps. Higher rates are supported by custom sizes. For more information about customizing the size of your IBM Cloud Pak for AIOps deployment according to your processing and footprint requirements, see Custom sizing.
Supported resource number and throughput rates for starter and production deployments
Expand this section for details.
The following table details the number of records, log messages, events, Key Performance Indicators (KPIs), and resources that can be processed by IBM Cloud Pak for AIOps for each of the deployment sizes. This includes resource and throughput values for the AI algorithms.
Component | Resource | Starter | Production |
---|---|---|---|
Change risk | Incidents and change request records per second | 30 | 30 |
Metric anomaly detection | Maximum throughput - (KPIs) for all metric integrations |
30,000 | 120,000 |
Log anomaly detection (non-kafka integration) | Maximum throughput (log messages per second) for non-kafka log integrations |
1000 | 8000 |
Log anomaly detection (kafka integration) | Maximum throughput (log messages per second) for kafka log integrations |
1000 | 25,000 |
Events (through Netcool integration) | Steady state event rate throughput per second Burst rate event throughput per second |
20 100 |
150 250 |
Automation runbooks | Fully automated runbooks run per second | 1 | 2 |
Topology management | Maximum number of topology resources | 200,000 | 5,000,000 |
UI users | Active users supported | 5 | 20 |
Standing alert count | Number of alerts stored in the system at any period of time | 20,000 | 200,000 |
Notes:
- Event rates in the preceding table assume a deduplication rate of 10 to 1 (10% unique events). For example, a rate of 100 alerts per second sent to IBM Cloud Pak for AIOps can be the end result of an initial 1,000 alerts per second before deduplication and other filtering is applied.
- For metric anomaly detection, the number of key performance indicators (KPIs) that can be processed for each deployment size is shown for an aggregation period of 5 minutes and a training period of 2 weeks. IBM Cloud Pak for AIOps can process 120,000 metrics in a 5-minute interval.
- If you are using additional integrations for metric anomaly detection and log anomaly detection with IBM Cloud Pak for AIOps, you can use default available policies to further refine the volume of data routed for issue resolution lifecycle
actions by your users. You can also create custom policies tailored for your environment. For instance, you can use custom suppression policies to help determine which anomalies should be raised as alerts for user action. For more
information about custom policies, see Suppress alerts
.
- The events (through Netcool integration) throughput rates represents a refined volume of alerts that corresponds to a worst case scenario where the ratio of IBM Tivoli Netcool/OMNIbus events to IBM Cloud Pak for AIOps alerts has no deduplication, and is essentially a 1:1 mapping of events to alerts. However, in most production deployments, the correlation and deduplication on the IBM Tivoli Netcool/OMNIbus server side reduces the volumes of alert data that requires processing within IBM Cloud Pak for AIOps. As part of further optimizing the workload of data presented to IBM Cloud Pak for AIOps, additional IBM Tivoli Netcool/OMNIbus probe rules can filter out events of no interest to IBM Cloud Pak for AIOps. For instance, typical IBM Tivoli Netcool/OMNIbus maintenance events are filtered out as they are not relevant on the IBM Cloud Pak for AIOps side.
- The number of alerts for your system varies based on alerts being cleared or expiring. In addition alerts include a variety of event types so you might not always see the same alerts when you view the Alert Viewer in the UI.
Important:
- If you are using the File observer for more than 600,000 resources, then additional resources are required. For more information, see Configuring the File observer
- For 200,000 stored alerts, it is recommended to set *IR_UI_MAX_ALERT_FETCH_LIMIT* to a maximum value of 10,000 to avoid performance impacts. For more information, see Restricting the number of alerts returned by the data layer to the Alert Viewer
Event, alert, and incident rates
Expand this section for details.
IBM Cloud Pak for AIOps include robust capabilities for managing events from your various applications, services, and devices. If you are integrating IBM Cloud Pak for AIOps with IBM Tivoli Netcool/OMNIbus the benefits that you can leverage for event management are significantly increased. This integration can give you end-to-end alert processing with an on-premises IBM Tivoli Netcool/OMNIbus server so that you can complete part of the event and incident management lifecycle on the IBM Tivoli Netcool/OMNIbus server before events are processed and delivered for action in IBM Cloud Pak for AIOps.
By default, IBM Tivoli Netcool/OMNIbus policies and triggers, such as correlation and deduplication activities, can execute to "pre-process" event workloads, thereby reducing the overall volume of active events on the IBM Tivoli Netcool/OMNIbus server. This overall volume presents a refined (event) workload for subsequent processing within the overall incident resolution (IR) lifecycle. On the IBM Cloud Pak for AIOps side, automation policies run on the remaining events that are flowing from the IBM Tivoli Netcool/OMNIbus server. IBM Cloud Pak for AIOps applies additional suppression and grouping filters to minimize effort, and executes runbooks to automatically resolve events where warranted, and promote the remaining events to Alerts and carefully refined Incidents for ITOps to take action on the most critical concerns.
To help you understand the end-to-end event processing benefits of this deployment pattern in your environment, and where to invest in policies to optimize throughput and response time, review the following event management and impact scenarios:
- As a basic example, a small production IBM Tivoli Netcool/OMNIbus environment with an average incoming event rate of 50 events per second, with a correlation and deduplication ratio of 10:1 raw to correlated events (incidents), can result in a refined volume of 5 Alerts per second being sent to IBM Cloud Pak for AIOps for subsequent processing. With a combination of default available issue resolution (IR) policies and analytics, the alerts can be further reduced (by 90% noise reduction) to less than 1 Incident per second over time on the IBM Cloud Pak for AIOps side.
- As a secondary, larger example, a Production IBM Tivoli Netcool/OMNIbus environment with an average event rate of 500 events per second (with the same correlation and deduplication ratio of 10:1), can in turn present a refined volume of 50 Alerts per second being sent to IBM Cloud Pak for AIOps. By using the same combination of default available issue resolution (IR) policies and analytics, the alerts can be further reduced by 90% noise reduction, with a resultant 5 Incidents per second raised in IBM Cloud Pak for AIOps. Additional issue resolution (IR) policies can be authored to further reduce and refine Incident creation. By leveraging other advanced capabilities within IBM Cloud Pak for AIOps, such as fully automated Runbooks, the volume of actionable incidents that are presented for user interaction can be further reduced.
Custom sizing
The default starter and production deployment sizes enable the full capabilities of IBM Cloud Pak for AIOps to be used with the workload volumes that are stated in the Processing abilities section. If different workload
volumes are required or resource constraints are an issue, then specific capabilities such as Metric Anomaly Detection
, Log Anomaly Detection
, and Runbook Automation
can be sized accordingly. IBM Sales
representatives and Business Partners have access to a custom sizing tool that can assess your runtime requirements and provide a custom profile that scales IBM Cloud Pak for AIOps components. The custom profile is applied when you install
IBM Cloud Pak for AIOps.
Table 7 shows the processing abilities of some example custom-sized deployments of IBM Cloud Pak for AIOps.
Example 1 (Minimum scale) represents a minimally sized deployment of IBM Cloud Pak for AIOps for the evaluation of event management. It demonstrates event analytics, noise reduction, and the automation of issue resolution on a small topology. Metric and log anomaly detection and change risk assessment are de-emphasized. Example 1 requires 49 vCPU and 143 GB memory.
Example 2 (Event management focused) represents a production deployment of IBM Cloud Pak for AIOps which is focused on event management capabilities. It supports event analytics, noise reduction, and the automation of issue resolution on a large topology. Metric and log anomaly detection and change risk assessment are de-emphasized. Example 2 requires 176 vCPU and 366 GB memory.
Example 3 (Maximum scale) shows the upper limits that IBM Cloud Pak for AIOps can be scaled to, across all of its capabilities.
Component | Resource | Example 1 | Example 2 | Example 3 |
---|---|---|---|---|
Change risk | Incidents and change request records per second | 0 | 0 | N/A |
Metric anomaly detection | Maximum throughput - (KPIs) for all metric integration |
0 | 0 | 5,000,000 |
Log anomaly detection | Maximum throughput (log messages per second) for all log integrations |
0 | 0 | 25,000 |
Events (through Netcool integration) | Steady state event rate throughput per second Burst rate event throughput per second |
10 | 600 | 700 1000 |
Automation runbooks | Fully automated runbooks run per second | 1 | 2 | 4 |
Topology management | Maximum number of topology resources | 5,000 | 5,000,000 | 15,000,000 |
UI users | Active users supported | 5 | 40 | 40 |
Infrastructure Automation
Expand the following sections to review details about the minimum resource requirements for installing Infrastructure Automation. Use this information for sizing your environment when you are deploying Infrastructure Automation. Infrastructure Automation can be installed with IBM Cloud Pak for AIOps or stand-alone (Infrastructure Automation only). Multi-region and multi-zone clusters are not supported.
Tip: Work with your IBM Sales representative (or Business Partner) to ensure that the allocated hardware resources for your environment are sufficient to best fit your business needs. For requirements for larger production deployments and to make sure that you meet the right level of resiliency, an IBM representative can work with you on your deployment plans. For Infrastructure Automation, you need to provide the following information:
- The number of cloud environments that you connect to.
- The number of OpenStack environments that you manage.
- The average number of VMs in each of the cloud environments.
- The number of on-premises virtualized environments that you manage. Identify whether it is VMware or KVM, and the number of VMs in each virtualized environment.
Resource allocation for the Red Hat OpenShift Container Platform control plane master nodes
Expand this section for details.
The following table details the minimum requirements for the Red Hat OpenShift control plane master nodes that are necessary for an Red Hat OpenShift cluster to withstand the demands of an Infrastructure Automation deployment. IBM Cloud
Pak for AIOps has many Kubernetes operators that interface with the API server
and etcd
storage, and so master nodes must be adequately sized.
Node count | vCPU (per node) | Memory (GB) (per node) | vCPU (all nodes) | Memory (GB) (all nodes) |
---|---|---|---|---|
3 | 8 | 16 | 24 | 48 |
Resource allocation for the Red Hat OpenShift Container Platform worker nodes
Expand this section for details.
The following table provides the required Red Hat OpenShift worker resources per node for a sample Infrastructure Automation deployment.
Node count | vCPU (per node) | Memory (GB) (per node) | vCPU (all nodes) | Memory (GB) (all nodes) | Disk space (GB) | vCPU (extra capacity) | Memory (GB) (extra capacity) |
---|---|---|---|---|---|---|---|
3 | 16 | 32 | 48 | 96 | 212 | 17 | 18 |
The following table lists the hardware requirements for Infrastructure Automation only, and does not include the requirements for the Red Hat OpenShift control plane.
Resource | Deployment (Managed Services and Infrastructure management) No infrastructure management providers (not under load) |
Deployment (Managed Services and Infrastructure management) 1 infrastructure management provider (not under load) |
---|---|---|
Worker node count Total vCPUs Total memory (GB) Persistent storage (Gi) |
3 14 35 70 |
3 14 38 70 |
Notes:
- These estimates are for an Infrastructure Automation installation with both Managed services and Infrastructure Management deployed.
- The requirements for Infrastructure Automation are in addition to the requirements for IBM Cloud Pak for AIOps.
- However, you might need extra worker nodes, resource capacity, or both, depending on your own requirements. The CPU, memory, and disk resources that are required for Infrastructure Automation depend on the number of provider instances and the total number of instances and VMs managed by the providers. Other worker node counts and sizes can be used based on your available hardware configurations, but the following conditions must be met:
- The combined available vCPU and memory of the worker nodes must equal or exceed the total requirements.
- Each worker node has at least the minimum vCPU and memory that is given in the following table.
- Warning: These requirements do not include CPU and memory resources for hosting a storage provider. These storage providers can require more nodes, resources, or both, to run. The extra resources that are needed can vary based on your selected storage provider. A general recommendation is for one extra worker node in a starter deployment, and for three extra worker nodes for a production deployment where HA is required. Consult your storage provider's documentation for exact requirements.
Additional requirements for offline (airgap) deployments
Expand this section for details.
If you are installing in an air-gapped environment (offline), you must ensure that you have adequate space to download the Infrastructure Automation images to a local registry in your offline environment. The Infrastructure Automation images total 87 GB.
- Bastion host method - the local registry and the bastion host must each have at least 87 GB of storage space.
- Portable compute device method - the local registry and the portable compute device must each have at least 87 GB of storage space.
- Portable storage device - the local registry, connected compute device, portable storage device, and the local compute device must each have at least 87 GB of storage space.