Hardware requirements
Learn about the hardware requirements for a deployment of IBM Cloud Pak® for AIOps on Red Hat® OpenShift® Container Platform. Your hardware must be able to support Red Hat OpenShift, IBM Cloud Pak for AIOps, and your chosen storage solution.
Before you begin
- You cannot change your selected deployment size after installation.
- Multi-zone high availability disaster resolution (HADR) is available as a nonproduction technology preview.. For more information, see Installing IBM Cloud Pak for AIOps on a multi-zone architecture (multi-zone HADR).
- vCPU is defined as when one x86 CPU splits each of its physical cores into virtual cores (vCPU). It is assumed that one x86 CPU's physical core can be split into two logical vCPUs.
- If Red Hat OpenShift is installed on VMware virtual machines, set the value of the
sched.cpu.latencySensitivity
parameter to high. - Persistent storage is also required, for more information, see Storage requirements.
You can deploy a starter, production, or custom sized deployment of IBM Cloud Pak for AIOps. The processing abilities and hardware requirements vary for each deployment size. Review the information in the following sections:
Hardware requirements - Red Hat OpenShift
A Red Hat OpenShift cluster has master nodes and worker nodes. Tables 1 and 2 show the minimum node requirements for Red Hat OpenShift to withstand the demands of starter and production deployments of IBM Cloud Pak for AIOps.
IBM Cloud Pak for AIOps requires your Red Hat OpenShift cluster to have at least three master nodes. IBM Cloud Pak for AIOps has many Kubernetes operators that interface with the API server
and etcd
storage, and so
master nodes must be adequately sized.
The number of worker nodes that are required varies, and depends on the storage and processing requirements of your IBM Cloud Pak for AIOps deployment. The size of each of your worker nodes can vary, but the combined resources of your worker nodes must comply with the resource totals in Tables 3 and 4. Each worker node must meet the minimum requirements in Table 2 to accommodate the placement of larger IBM Cloud Pak for AIOps pods.
A higher recommended vCPU is also given in Table 2. When you are selecting the size of your worker nodes, you must balance cost considerations against the benefits of improved resiliency and the ease of scheduling workloads. Over-allocating resources can improve resiliency by enabling sufficient resources to be available if a worker node fails, and having well-sized worker nodes makes the placement of workloads easier. If your largest worker node becomes unavailable, resiliency is improved if your smallest worker node is able to handle the largest worker node's workloads. The degree of over-allocation of resources is correlated with the extent of failure scenarios that can be accommodated.
Resource | Starter | Production |
---|---|---|
Master node count | 3 | 3 |
vCPU per node | 4 | 4 |
Memory per node (GB) | 16 | 16 |
Disk per node (GB) | 120 | 120 |
Resource | Starter | Production |
---|---|---|
Minimum vCPU per node | 8 | 16 |
Recommended vCPU per node | 16 | 16 |
Memory per node (GB) | 12 | 20 |
Disk per node (GB) | 120 | 120 |
Note: The numbers that are specified are minimums. For high production workloads, you might need a larger disk per node (GB) for your worker loads.
Hardware requirements - IBM Cloud Pak for AIOps
Table 3 and table 4 show the hardware requirements for starter and production deployments of IBM Cloud Pak for AIOps, for a base deployment and an extended deployment. For more information about the differences between a base deployment and an extended deployment, see Incremental adoption.
The resource requirements are given for the following scenarios:
- Table 3: IBM Cloud Pak for AIOps only - you are deploying IBM Cloud Pak for AIOps on an existing Red Hat OpenShift cluster.
- Table 4: IBM Cloud Pak for AIOps and Red Hat OpenShift - you are creating a new Red Hat OpenShift cluster and deploying IBM Cloud Pak for AIOps onto it.
Resource | Base deployment IBM Cloud Pak for AIOps only (Starter) |
Extended deployment IBM Cloud Pak for AIOps only (Starter) |
Base deployment IBM Cloud Pak for AIOps only (Production) |
Extended deployment IBM Cloud Pak for AIOps only (Production) |
---|---|---|---|---|
Master node count |
|
|
|
|
Minimum worker node count | 3 | 3 | 6 | 6 |
Total vCPU | 47 | 55 | 136 | 162 |
Total memory (GB) | 123 | 136 | 310 | 368 |
Resource | Base deployment IBM Cloud Pak for AIOps and Red Hat OpenShift (Starter) |
Extended deployment IBM Cloud Pak for AIOps and Red Hat OpenShift (Starter) |
Base deployment IBM Cloud Pak for AIOps and Red Hat OpenShift (Production) |
Extended deployment IBM Cloud Pak for AIOps and Red Hat OpenShift (Production) |
---|---|---|---|---|
Master node count | 3 | 3 | 3 | 3 |
Minimum worker node count | 3 | 3 | 6 | 6 |
Total vCPU | 59 | 67 | 148 | 174 |
Total memory (GB) | 171 | 184 | 358 | 416 |
The Red Hat OpenShift master and worker nodes must meet the minimum size requirements in Hardware requirements - Red Hat OpenShift only.
Note:
- These values do not include CPU and memory resources for hosting a storage provider, such as Red Hat® OpenShift® Data Foundation or Portworx. Storage providers can require more nodes, more resources, or both, to run. The extra resources that are needed can vary based on your selected storage provider. A general recommendation is for one extra worker node for a starter deployment, and for three extra worker nodes for a production deployment. Consult your storage provider's documentation for exact requirements.
- An additional 1 vCPU and 3GB memory is required for each integration that you configure. For example, if you configure two Netcool integrations then you will require an additional 2 vCPUs and 6GB memory.
Warning: Insufficient hardware will result in product instability and loss of function. Verify that your hardware is sufficiently sized for your expected workloads. For more information, see Processing abilities. In addition to the default starter and production size deployments, you can choose to deploy a custom-sized deployment of IBM Cloud Pak for AIOps. For more information, see Custom sizing.
Extra hardware requirements for deployments on a muliti-zone cluster
If you are installing on a multi-zone cluster, then each zone must have extra resources available. This is so that if a zone outage occurs the nodes in unaffected zones are able to take on the workload of nodes in the failed zone.
For example, in a 3 zone cluster, the nodes in any 2 of these zones must be able to take on the workload of the nodes in a failed zone. Therefore, each zone's nodes must have 50 % extra resources available.
If you have more than 3 zones, then you can work out the resource requirement for the nodes in a zone as follows:
resource_per_zone = ((resource/ uniqueZones) + (resource / ((uniqueZones - 1) * uniqueZones))) + 2
For example, a production base deployment on a non-multizone cluster requires 136 vCPU. You can work out the vCPU that is required per zone in a 4 zone cluster as:
resource_per_zone = ((136/4) + (136/((4-1) * 4))) + 2
resource_per_zone = (34 + (136/12)) + 2
resource_per_zone = 48 vCPU per zone (rounded)
During the installation process, you will run a script to verify that you have sufficient hardware before the installation proceeds, in Verify cluster readiness.
Integrations
When you are configuring an integration in IBM Cloud Pak for AIOps, it is important to consider the performance and footprint. If you have large amounts of data on the selected target system, you can expect relatively higher resource usage from collecting data from that system when compared to a system with less data. Additionally, the sizing of the IBM Cloud Pak for AIOps installation affects the number of resources that are available for running integrations. Review the following considerations for each integration category and the integration installation section before you configure a new integration.
Integration installation
IBM Cloud Pak for AIOps integrations for metrics, events, and logs are configured in the Integrations UI page, which can be used to create, edit, delete, and track the status of integrations. Integrations run as pods and have minimum and maximum resource allocations in terms of CPU, memory and storage. Some integrations offer flexibility in terms of how to install the integration in the UI, but some integrations only offer one option.
You can use the following two options to install integrations:
- Local: Install the integration in the same cluster and namespace where IBM Cloud Pak for AIOps is installed. The integration’s status is displayed in the Integrations UI page and is automatically managed by IBM Cloud Pak for AIOps.
- Remote: Install the integration anywhere you choose, for example, a different network region, on SaaS, or remote on-premises (VM, cluster or container). After adding the integration, you can use the script to run the integration pod using podman.
Regardless of whether an integration is a local or remote installation, the data collected by IBM Cloud Pak for AIOps is stored for some time in Kafka.
Note: Before you set up your first integration, make sure that Kafka storage is correctly configured and scaled for the production environment. If the Kafka persistent volume claims (PVCs) is near capacity, increase each of the PVCs by an extra 60 GB. For more information, see Increasing Kafka PVC.
Hardware requirements - Integrations
Before you create an integration for both local and remote installation, make sure that IBM Cloud Pak for AIOps or remote environment contains the hardware resources that are required to run the integrations.
The minimum resource requirements for the integrations are listed in the following table:
Integration name | Memory limit (Mb) | CPU limit | Ephemeral storage limit (Mb) |
---|---|---|---|
AppDynamics New Relic Splunk | 2500 | 1 | 500 |
AWS CloudWatch, Dynatrace (metrics-only), Infrastructure Management, Generic Webhook, Zabbix | 4096 | 1 | 500 |
Dynatrace (metrics, events, topology) | 10000 | 4 | 8000 |
DB2, GitHub, ServiceNow | 800 | 1 | 1000 |
Email Notifications, IMPACT, Jira | 512 | 1 | 1000 |
Instana | 4096 | 4 | 1000 |
Netcool ObjectServer | 4096 | 2 | 2000 |
Custom, Elk, Falcon LogScale, Mezmo | 1536 | 1 | 1000 |
Notes:
- Dynatrace (metrics, events, topology) supports up to 5 integrations.
- The hardware sizing in the preceding table does not include resource slots for logs integrations or observer jobs. For more information about resource slots for logs integrations, see Performance considerations for logs data collection. For more information about defining observer jobs, see Observer jobs.
For performance considerations for integrations, see the following sections:
Requests and limits
In Kubernetes, requests define the resources that must be provided, and limits define the upper boundary of extra resources that can be allocated. Resource quotas can optionally be used to dictate the allowed totals for all requests and limits in a namespace. If a resource quota is set, then it must be at an adequate level to ensure that workload placement is not inhibited. If you do not deploy resource quotas, then skip this section.
The following table shows the requests and limits that the IBM Cloud Pak for AIOps installation namespace require for a typical starter or production deployment.
Resource name | IBM Cloud Pak for AIOps installation namespace (Starter) |
IBM Cloud Pak for AIOps installation namespace (Production) |
---|---|---|
CPU Request | 75 | 227 |
Memory Request (Gi) | 195 | 466 |
Ephemeral Storage Request (Gi) | 105 | 200 |
CPU Limit | 210 | 436 |
Memory Limit (Gi) | 260 | 533 |
Ephemeral Storage Limit (Gi) | 340 | 600 |
Important: Deployment of additional services impacts these values. Review these values regularly to ensure that they still meet your requirements. Extra resources are required for upgrade, for more information see Ensure cluster readiness.
Use the Red Hat OpenShift command line to create or edit a ResourceQuota
object with these values. For more information, see Quotas in the Red Hat OpenShift documentation.
The following example YAML file show the YAML to create ResourceQuota
objects for the IBM Cloud Pak for AIOps installation namespace in a production deployment of IBM Cloud Pak for AIOps:
apiVersion: v1
kind: ResourceQuota
metadata:
name: aiops-large
namespace: cp4aiops
spec:
hard:
requests.cpu: "227"
requests.memory: 466Gi
requests.ephemeral-storage: 200Gi
limits.cpu: "436"
limits.memory: 533Gi
limits.ephemeral-storage: 600Gi
pods: '275'
Processing abilities
Expand the following sections to find out about the processing abilities of starter and production sized deployments of IBM Cloud Pak for AIOps. Higher rates are supported by custom sizes. For more information about customizing the size of your IBM Cloud Pak for AIOps deployment according to your processing and footprint requirements, see Custom sizing.
Supported resource number and throughput rates for starter and production deployments
Expand this section for details.
The following table details the number of records, log messages, events, Key Performance Indicators (KPIs), and resources that can be processed by IBM Cloud Pak for AIOps for each of the deployment sizes. This includes resource and throughput values for the AI algorithms.
Component | Resource | Starter | Production |
---|---|---|---|
Change risk | Incidents and change request records per second | 30 | 30 |
Metric anomaly detection | Maximum throughput - (KPIs) for all metric integrations |
30,000 | 120,000 |
Log anomaly detection (non-kafka integration) | Maximum throughput (log messages per second) for non-kafka log integrations |
1000 | 8000 |
Log anomaly detection (kafka integration) | Maximum throughput (log messages per second) for kafka log integrations |
1000 | 25,000 |
Events (through Netcool integration) | Steady state event rate throughput per second Burst rate event throughput per second |
20 100 |
150 250 |
Automation runbooks | Fully automated runbooks run per second | 1 | 2 |
Topology management | Maximum number of topology resources | 200,000 | 5,000,000 |
UI users | Active users supported | 5 | 20 |
Standing alert count | Number of alerts stored in the system at any period of time | 20,000 | 200,000 |
Notes:
- Event rates in the preceding table assume a deduplication rate of 10 to 1 (10% unique events). For example, a rate of 100 alerts per second sent to IBM Cloud Pak for AIOps can be the end result of an initial 1,000 alerts per second before deduplication and other filtering is applied.
- For metric anomaly detection, the number of key performance indicators (KPIs) that can be processed for each deployment size is shown for an aggregation period of 5 minutes and a training period of 2 weeks. IBM Cloud Pak for AIOps can process 120,000 metrics in a 5-minute interval.
- If you are using additional integrations for metric anomaly detection and log anomaly detection with IBM Cloud Pak for AIOps, you can use default available policies to further refine the volume of data routed for issue resolution lifecycle
actions by your users. You can also create custom policies tailored for your environment. For instance, you can use custom suppression policies to help determine which anomalies should be raised as alerts for user action. For more
information about custom policies, see Suppress alerts
.
- The events (through Netcool integration) throughput rates represents a refined volume of alerts that corresponds to a worst case scenario where the ratio of IBM Tivoli Netcool/OMNIbus events to IBM Cloud Pak for AIOps alerts has no deduplication, and is essentially a 1:1 mapping of events to alerts. However, in most production deployments, the correlation and deduplication on the IBM Tivoli Netcool/OMNIbus server side reduces the volumes of alert data that requires processing within IBM Cloud Pak for AIOps. As part of further optimizing the workload of data presented to IBM Cloud Pak for AIOps, additional IBM Tivoli Netcool/OMNIbus probe rules can filter out events of no interest to IBM Cloud Pak for AIOps. For instance, typical IBM Tivoli Netcool/OMNIbus maintenance events are filtered out as they are not relevant on the IBM Cloud Pak for AIOps side.
- The number of alerts for your system varies based on alerts being cleared or expiring. In addition alerts include a variety of event types so you might not always see the same alerts when you view the Alert Viewer in the UI.
Important:
- If you are using the File observer for more than 600,000 resources, then additional resources are required. For more information, see Configuring the File observer
- For 200,000 stored alerts, it is recommended to set *IR_UI_MAX_ALERT_FETCH_LIMIT* to a maximum value of 10,000 to avoid performance impacts. For more information, see Restricting the number of alerts returned by the data layer to the Alert Viewer
Event, alert, and incident rates
Expand this section for details.
IBM Cloud Pak for AIOps include robust capabilities for managing events from your various applications, services, and devices. If you are integrating IBM Cloud Pak for AIOps with IBM Tivoli Netcool/OMNIbus the benefits that you can leverage for event management are significantly increased. This integration can give you end-to-end alert processing with an on-premises IBM Tivoli Netcool/OMNIbus server so that you can complete part of the event and incident management lifecycle on the IBM Tivoli Netcool/OMNIbus server before events are processed and delivered for action in IBM Cloud Pak for AIOps.
By default, IBM Tivoli Netcool/OMNIbus policies and triggers, such as correlation and deduplication activities, can execute to "pre-process" event workloads, thereby reducing the overall volume of active events on the IBM Tivoli Netcool/OMNIbus server. This overall volume presents a refined (event) workload for subsequent processing within the overall incident resolution (IR) lifecycle. On the IBM Cloud Pak for AIOps side, automation policies run on the remaining events that are flowing from the IBM Tivoli Netcool/OMNIbus server. IBM Cloud Pak for AIOps applies additional suppression and grouping filters to minimize effort, and executes runbooks to automatically resolve events where warranted, and promote the remaining events to Alerts and carefully refined Incidents for ITOps to take action on the most critical concerns.
To help you understand the end-to-end event processing benefits of this deployment pattern in your environment, and where to invest in policies to optimize throughput and response time, review the following event management and impact scenarios:
- As a basic example, a small production IBM Tivoli Netcool/OMNIbus environment with an average incoming event rate of 50 events per second, with a correlation and deduplication ratio of 10:1 raw to correlated events (incidents), can result in a refined volume of 5 Alerts per second being sent to IBM Cloud Pak for AIOps for subsequent processing. With a combination of default available issue resolution (IR) policies and analytics, the alerts can be further reduced (by 90% noise reduction) to less than 1 Incident per second over time on the IBM Cloud Pak for AIOps side.
- As a secondary, larger example, a Production IBM Tivoli Netcool/OMNIbus environment with an average event rate of 500 events per second (with the same correlation and deduplication ratio of 10:1), can in turn present a refined volume of 50 Alerts per second being sent to IBM Cloud Pak for AIOps. By using the same combination of default available issue resolution (IR) policies and analytics, the alerts can be further reduced by 90% noise reduction, with a resultant 5 Incidents per second raised in IBM Cloud Pak for AIOps. Additional issue resolution (IR) policies can be authored to further reduce and refine Incident creation. By leveraging other advanced capabilities within IBM Cloud Pak for AIOps, such as fully automated Runbooks, the volume of actionable incidents that are presented for user interaction can be further reduced.
Custom sizing
The default starter and production deployment sizes enable the full capabilities of IBM Cloud Pak for AIOps to be used with the workload volumes that are stated in the Processing abilities section. If different workload
volumes are required or resource constraints are an issue, then specific capabilities such as Metric Anomaly Detection
, Log Anomaly Detection
, and Runbook Automation
can be sized accordingly. IBM Sales
representatives and Business Partners have access to a custom sizing tool that can assess your runtime requirements and provide a custom profile that scales IBM Cloud Pak for AIOps components. The custom profile is applied when you install
IBM Cloud Pak for AIOps. This custom profile cannot be applied after installation, and attempting to do so can break your IBM Cloud Pak for AIOps deployment. If you require custom sizing, contact IBM Sales representatives and Business Partners
with details of your intended workloads.
Table 7 shows the processing abilities of some custom-sized deployments of IBM Cloud Pak for AIOps.
Example 1 (Minimum scale) represents a minimally sized deployment of IBM Cloud Pak for AIOps for the evaluation of event management. It demonstrates event analytics, noise reduction, and the automation of issue resolution on a small topology. Metric and log anomaly detection and change risk assessment are de-emphasized. Example 1 requires 49 vCPU and 143 GB memory.
Example 2 (Event management focused) represents a production deployment of IBM Cloud Pak for AIOps which is focused on event management capabilities. It supports event analytics, noise reduction, and the automation of issue resolution on a large topology. Metric and log anomaly detection and change risk assessment are de-emphasized. Example 2 requires 176 vCPU and 366 GB memory.
Maximum scale shows the upper limits that IBM Cloud Pak for AIOps can be scaled to, across all of its capabilities.
Component | Resource | Example 1 | Example 2 | Maximum scale |
---|---|---|---|---|
Change risk | Incidents and change request records per second | 0 | 0 | N/A |
Metric anomaly detection | Maximum throughput - (KPIs) for all metric integration |
0 | 0 | 5,000,000 |
Log anomaly detection | Maximum throughput (log messages per second) for all log integrations |
0 | 0 | 25,000 |
Events (through Netcool integration) | Steady state event rate throughput per second Burst rate event throughput per second |
10 | 600 | 700 1000 |
Automation runbooks | Fully automated runbooks run per second | 1 | 2 | 4 |
Topology management | Maximum number of topology resources | 5,000 | 5,000,000 | 15,000,000 |
UI users | Active users supported | 5 | 40 | 40 |
Infrastructure Automation
For more information about the hardware requirements for your Infrastructure Automation deployment on Red Hat® OpenShift® Container Platform, see Hardware requirements.