Planning an installation of IBM Cloud Pak for AIOps on Linux
Learn about the system requirements for an installation of IBM Cloud Pak for AIOps on Linux.
Before you begin
- There is not a small starter size option for the installation of IBM Cloud Pak® for AIOps on Linux®. You can only deploy a production size deployment.
- vCPU is defined as when one x86 CPU splits each of its physical cores into virtual cores (vCPU). It is assumed that one x86 CPU's physical core can be split into two logical vCPUs.
- If Linux is installed on VMware virtual machines, set the value of the
sched.cpu.latencySensitivity
parameter to high. - You can deploy a base deployment of IBM Cloud Pak for AIOps that does not have log anomaly or ticket analysis capabilities, or an extended deployment of IBM Cloud Pak for AIOps with these capabilities. Learn more
Review the information in the following sections:
Supported platforms
IBM Cloud Pak for AIOps can be installed on the following platform versions:
- Red Hat® Enterprise Linux® 8.10 only
- Red Hat® Enterprise Linux® 9.4 only
- Ubuntu 24.04 LTS
The hardware architecture for installing IBM Cloud Pak for AIOps must be x86_64 (amd64).
If you want to install IBM Cloud Pak for AIOps on Red Hat® OpenShift® Container Platform or on a cloud platform, see Deploying IBM Cloud Pak for AIOps on OpenShift.
Hardware requirements
Multiple Linux nodes are needed, and IBM Cloud Pak for AIOps is installed on this cluster of nodes. The cluster must be reserved for the sole use of IBM Cloud Pak for AIOps.
Three of these nodes must be control plane nodes. The control plane nodes coordinate the running of IBM Cloud Pak for AIOps across the other nodes, and are the entry point into the product. Worker or agent nodes provide more compute resources to run IBM Cloud Pak for AIOps services. All the nodes can be assigned as agent nodes, but three of the nodes must also be assigned as control plane nodes.
The size of each of your nodes can vary, but each worker node must meet the minimum requirements in Table 1 to accommodate the placement of larger IBM Cloud Pak for AIOps pods, and the combined resources of your nodes must comply with the resource totals in Table 2.
You must ensure that the clocks on your Linux cluster are synchronized. Each node in the cluster must have access to an NTP server to synchronize their clocks. Discrepancies between the clocks on the nodes can cause IBM Cloud Pak for AIOps to experience operational issues.
Warning: Insufficient hardware will result in product instability and loss of function. Verify that your hardware is sufficiently sized for your expected workloads. For more information, see Processing abilities. In addition to the default production size deployment, you can choose to deploy a custom-sized deployment of IBM Cloud Pak for AIOps. For more information, see Custom sizing.
Minimum node requirements
Each of the worker and control plane nodes in your cluster must meet the minimum size requirements in the following table:
Resource | Requirement |
---|---|
vCPU per node | 16 |
Memory per node (GB) | 20 |
Disk per node (GB) | 120 |
Overall cluster requirements
Your cluster must meet the overall size requirements in the following table.
Resource | Base deployment requirement | Extended deployment requirement |
---|---|---|
Node count | 9 | 9 |
Total vCPU | 136 | 162 |
Total memory (GB) | 322 | 380 |
Table 2 shows the hardware requirements for a deployment of IBM Cloud Pak for AIOps on a Linux cluster, for a base deployment and an extended deployment. For more information about the differences between a base deployment and an extended deployment, see Incremental adoption.
Important:
- Three of these nodes must be control plane nodes.
- An extra 1 vCPU and 3 GB memory is required for each integration that you configure. For example, if you configure two Netcool integrations then you will require an additional 2 vCPUs and 6 GB memory.
- You will also require an additional server for a load balancer. For more information, see Load balancing.
Integrations
When you are configuring an integration in IBM Cloud Pak for AIOps, it is important to consider the performance and footprint. If you have large amounts of data on the selected target system, you can expect relatively higher resource usage from collecting data from that system when compared to a system with less data. Additionally, the sizing of the IBM Cloud Pak for AIOps installation affects the number of resources that are available for running integrations. Review the following considerations for each integration category and the integration installation section before you configure a new integration.
Integration installation
IBM Cloud Pak for AIOps integrations for metrics, events, and logs are configured in the Integrations UI page, which can be used to create, edit, delete, and track the status of integrations. Integrations run as pods and have minimum and maximum resource allocations in terms of CPU, memory and storage. Some integrations offer flexibility in terms of how to install the integration in the UI, but some integrations only offer one option.
You can use the following two options to install integrations:
- Local: Install the integration in the same cluster and namespace where IBM Cloud Pak for AIOps is installed. The integration’s status is displayed in the Integrations UI page and is automatically managed by IBM Cloud Pak for AIOps.
- Remote: Install the integration anywhere you choose, for example, a different network region, on SaaS, or remote on-premises (VM, cluster or container). After adding the integration, you can use the script to run the integration pod using podman.
Regardless of whether an integration is a local or remote installation, the data collected by IBM Cloud Pak for AIOps is stored for some time in Kafka.
Hardware requirements - Integrations
Before you create an integration for both local and remote installation, make sure that IBM Cloud Pak for AIOps or remote environment contains the hardware resources that are required to run the integrations.
The minimum resource requirements for the integrations are listed in the following table:
Integration name | Memory limit (Mb) | CPU limit | Ephemeral storage limit (Mb) |
---|---|---|---|
AppDynamics New Relic Splunk | 2500 | 1 | 500 |
AWS CloudWatch, Dynatrace (metrics-only), Infrastructure Management, Generic Webhook, Zabbix | 4096 | 1 | 500 |
Dynatrace (metrics, events, topology) | 10000 | 4 | 8000 |
DB2, GitHub, ServiceNow | 800 | 1 | 1000 |
Email Notifications, IMPACT, Jira | 512 | 1 | 1000 |
Instana | 4096 | 4 | 1000 |
Netcool ObjectServer | 4096 | 2 | 2000 |
Custom, Elk, Falcon LogScale, Mezmo | 1536 | 1 | 1000 |
Notes:
- Dynatrace (metrics, events, topology) supports up to 5 integrations.
- The hardware sizing in the preceding table does not include resource slots for logs integrations or observer jobs. For more information about resource slots for logs integrations, see Performance considerations for logs data collection. For more information about defining observer jobs, see Observer jobs.
For performance considerations for integrations, see the following sections:
Processing abilities
Expand the following sections to find out about the processing abilities of a production deployment of IBM Cloud Pak for AIOps. Higher rates are supported by custom sizes. For more information about customizing the size of your IBM Cloud Pak for AIOps deployment according to your processing and footprint requirements, see Custom sizing.
Supported resource number and throughput rates for production deployments
The following table details the number of records, events, Key Performance Indicators (KPIs), and resources that can be processed by a production deployment of IBM Cloud Pak for AIOps. This includes resource and throughput values for the AI algorithms.
Component | Resource | Production deployment |
---|---|---|
Change risk | Incidents and change request records per second | 30 |
Metric anomaly detection | Maximum throughput - (KPIs) for all metric integrations |
120,000 |
Log anomaly detection (non-kafka integration) | Maximum throughput (log messages per second) for non-kafka log integrations |
8000 |
Log anomaly detection (kafka integration) | Maximum throughput (log messages per second) for kafka log integrations |
25,000 |
Events (through Netcool integration) | Steady state event rate throughput per second Burst rate event throughput per second |
150 250 |
Automation runbooks | Fully automated runbooks run per second | 2 |
Topology management | Maximum number of topology resources | 5,000,000 |
UI users | Active users supported | 20 |
Standing alert count | Number of stored alerts | 200,000 |
Notes:
- Event rates in the preceding table assume a deduplication rate of 10 to 1 (10% unique events). For example, a rate of 100 alerts per second sent to IBM Cloud Pak for AIOps can be the end result of an initial 1,000 alerts per second before deduplication and other filtering is applied.
- For metric anomaly detection, the number of key performance indicators (KPIs) that can be processed for each deployment size is shown, for an aggregation period of 5 minutes and a training period of 4 weeks.
- If you are using additional integrations for metric anomaly detection with IBM Cloud Pak for AIOps, you can use default available policies to further refine the volume of data routed for issue resolution lifecycle actions by your users.
You can also create custom policies tailored for your environment. For instance, you can use custom suppression policies to help determine which anomalies should be raised as alerts for user action. For more information about custom
policies, see Suppress alerts
.
- The events (through Netcool integration) throughput rates represents a refined volume of alerts that corresponds to a worst case scenario where the ratio of IBM Tivoli Netcool/OMNIbus events to IBM Cloud Pak for AIOps alerts has no deduplication, and is essentially a 1:1 mapping of events to alerts. However, in most production deployments, the correlation and deduplication on the IBM Tivoli Netcool/OMNIbus server side reduces the volumes of alert data that requires processing within IBM Cloud Pak for AIOps. As part of further optimizing the workload of data presented to IBM Cloud Pak for AIOps, additional IBM Tivoli Netcool/OMNIbus probe rules can filter out events of no interest to IBM Cloud Pak for AIOps. For instance, typical IBM Tivoli Netcool/OMNIbus maintenance events are filtered out as they are not relevant on the IBM Cloud Pak for AIOps side.
- The number of alerts for your system varies based on alerts being cleared or expiring. In addition alerts include a variety of event types so you might not always see the same alerts when you view the Alert Viewer in the UI.
Important:
- If you are using the File observer for more than 600,000 resources, then additional resources are required. For more information, see Configuring the File observer
- For 200,000 stored alerts, it is recommended to set *IR_UI_MAX_ALERT_FETCH_LIMIT* to a maximum value of 10,000 to avoid performance impacts. For more information, see Restricting the number of alerts returned by the data layer to the Alert Viewer
Event, alert, and incident rates
IBM Cloud Pak for AIOps include robust capabilities for managing events from your various applications, services, and devices. If you are integrating IBM Cloud Pak for AIOps with IBM Tivoli Netcool/OMNIbus the benefits that you can leverage for event management are significantly increased. This integration can give you end-to-end alert processing with an on-premises IBM Tivoli Netcool/OMNIbus server so that you can complete part of the event and incident management lifecycle on the IBM Tivoli Netcool/OMNIbus server before events are processed and delivered for action in IBM Cloud Pak for AIOps.
By default, IBM Tivoli Netcool/OMNIbus policies and triggers, such as correlation and deduplication activities, can execute to "pre-process" event workloads, thereby reducing the overall volume of active events on the IBM Tivoli Netcool/OMNIbus server. This overall volume presents a refined (event) workload for subsequent processing within the overall incident resolution (IR) lifecycle. On the IBM Cloud Pak for AIOps side, automation policies run on the remaining events that are flowing from the IBM Tivoli Netcool/OMNIbus server. IBM Cloud Pak for AIOps applies additional suppression and grouping filters to minimize effort, and executes runbooks to automatically resolve events where warranted, and promote the remaining events to Alerts and carefully refined Incidents for ITOps to take action on the most critical concerns.
To help you understand the end-to-end event processing benefits of this deployment pattern in your environment, and where to invest in policies to optimize throughput and response time, review the following event management and impact scenarios:
- As a basic example, a small production IBM Tivoli Netcool/OMNIbus environment with an average incoming event rate of 50 events per second, with a correlation and deduplication ratio of 10:1 raw to correlated events (incidents), can result in a refined volume of 5 Alerts per second being sent to IBM Cloud Pak for AIOps for subsequent processing. With a combination of default available issue resolution (IR) policies and analytics, the alerts can be further reduced (by 90% noise reduction) to less than 1 Incident per second over time on the IBM Cloud Pak for AIOps side.
- As a secondary, larger example, a Production IBM Tivoli Netcool/OMNIbus environment with an average event rate of 500 events per second (with the same correlation and deduplication ratio of 10:1), can in turn present a refined volume of 50 Alerts per second being sent to IBM Cloud Pak for AIOps. By using the same combination of default available issue resolution (IR) policies and analytics, the alerts can be further reduced by 90% noise reduction, with a resultant 5 Incidents per second raised in IBM Cloud Pak for AIOps. Additional issue resolution (IR) policies can be authored to further reduce and refine Incident creation. By leveraging other advanced capabilities within IBM Cloud Pak for AIOps, such as fully automated Runbooks, the volume of actionable incidents that are presented for user interaction can be further reduced.
Custom sizing
The default production deployment size enables the full capabilities of IBM Cloud Pak for AIOps to be used with the workload volumes that are stated in the Processing abilities section. If different workload volumes
are required or resource constraints are an issue, then specific capabilities such as Metric Anomaly Detection
, Log Anomaly Detection
, and Runbook Automation
can be sized accordingly. IBM Sales representatives
and Business Partners have access to a custom sizing tool that can assess your runtime requirements and provide a custom profile that scales IBM Cloud Pak for AIOps components. The custom profile is applied when you install IBM Cloud Pak for
AIOps. This custom profile cannot be applied after installation, and attempting to do so can break your IBM Cloud Pak for AIOps deployment. If you require custom sizing, contact IBM Sales representatives and Business Partners with details
of your intended workloads.
The following table shows the processing abilities of some custom-sized deployments of IBM Cloud Pak for AIOps.
Example 1 (Minimum scale) represents a minimally sized deployment of IBM Cloud Pak for AIOps for the evaluation of event management. It demonstrates event analytics, noise reduction, and the automation of issue resolution on a small topology. Metric and log anomaly detection and change risk assessment are de-emphasized. Example 1 requires 49.36 vCPU and 155GB of memory to run on a 9 node Linux cluster, this includes an overhead of 12GB of memory and 0.36 of a vCPU. If there are more than 9 nodes in the cluster, then extra resources are required as follows: 0.06 of a core and 3GB for each control plane node, and 0.03 of a core and 0.5GB for each worker node.
Example 2 (Event management focused) represents a production deployment of IBM Cloud Pak for AIOps which is focused on event management capabilities. It supports event analytics, noise reduction, and the automation of issue resolution on a large topology. Metric and log anomaly detection and change risk assessment are de-emphasized. Example 2 requires 176.36 vCPU and 378 GB memory to run on a 9 node Linux cluster, this includes an overhead of 12GB of memory and 0.36 of a vCPU. If there are more than 9 nodes in the cluster, then extra resources are required as follows: 0.06 of a core and 3GB for each control plane node, and 0.03 of a core and 0.5GB for each worker node.
Maximum scale shows the upper limits that IBM Cloud Pak for AIOps can be scaled to, across all of its capabilities.
Component | Resource | Example 1 | Example 2 | Maximum scale |
---|---|---|---|---|
Change risk | Incidents and change request records per second | 0 | 0 | N/A |
Metric anomaly detection | Maximum throughput - (KPIs) for all metric integration |
0 | 0 | 5,000,000 |
Log anomaly detection | Maximum throughput (log messages per second) for all log integrations |
0 | 0 | 25,000 |
Events (through Netcool integration) | Steady state event rate throughput per second Burst rate event throughput per second |
10 | 600 | 700 1000 |
Automation runbooks | Fully automated runbooks run per second | 1 | 2 | 4 |
Topology management | Maximum number of topology resources | 5,000 | 5,000,000 | 15,000,000 |
UI users | Active users supported | 5 | 40 | 40 |
Storage requirements
IBM Cloud Pak for AIOps deployments on Linux have the following persistent local storage requirements:
Application storage - control plane nodes and worker nodes
Storage type | Total requirement | Minimum per node | Default storage path |
---|---|---|---|
Application storage | 3000GB | 250GB | /var/lib/aiops/storage |
Platform storage - control-plane nodes only
Storage type | Total requirement | Minimum per node | Default storage path |
---|---|---|---|
Platform storage | 100GB | 25GB | /var/lib/aiops/platform |
To meet the local storage requirements, you must configure distributed storage across the nodes in your Linux cluster, and dedicate specific disks and logical volumes for application and platform storage. Follow the instructions in Configuring local volumes. The default storage paths can be configured as described in the installation instructions.
Important: The file system used by MinIO must be XFS
, not ext4
. The ext4
file system has a limit on the number of inodes that can be created for each file system. If inode usage reaches
100%, the file system becomes read-only even if enough space is available and MinIO is prevented from creating new files or directories. Refer to your storage provider's documentation for information about setting XFS
as the file
system.
Additional requirements for offline deployments
If you are installing in an air-gapped environment (offline), you must also ensure that you have adequate space to download the IBM Cloud Pak for AIOps images to the target registry in your offline environment. The IBM Cloud Pak for AIOps images total 186 GB.
Storage performance requirements
Each node of the storage solution requires a minimum of one disk (SSD or high-performance storage array). The performance of your storage can vary depending on your exact usage, datasets, hardware, storage solution, and more.
The following table specifies the storage performance metrics that must be achieved to support a deployment of IBM Cloud Pak for AIOps. Storage is accessed from multiple nodes due to the distributed nature of IBM Cloud Pak for AIOps workloads, and the total IOPS across all nodes using storage must meet the minimum sequential IOPS requirements. If your deployment is custom-sized to support higher rates than the default production rates listed in Processing abilities, then your storage performance must exceed these metrics.
Metric | Read | Write |
---|---|---|
Minimum sequential IOPS (higher is better, lower is worse) | 5000 | 5000 |
Minimum sequential bandwidth (higher is better, lower is worse) | 20 Mi/sec | 20 Mi/sec |
Maximum average sequential latency (lower is better, higher is worse) | 500 usec | 1000 usec |
Using a network storage system typically entails higher performance requirements on the disks due to factors such as replication and network latency. Performance at the application layer can be tested after the cluster is provisioned. A benchmarking tool is supplied that can be used to compare your storage's performance with these metrics before you install IBM Cloud Pak for AIOps. For more information, see Evaluate storage performance.
Network requirements
The control plane nodes require the following access:
Port number | Direction | Protocol | Description |
---|---|---|---|
80 | Inbound from load balancer | TCP | Application HTTP port |
443 | Inbound from load balancer | TCP | Application HTTPS port |
6443 | Inbound from cluster worker nodes Between control plane nodes |
HTTPS | Control plane server API |
5001 | Between cluster nodes | TCP | Distributed registry |
8472 | Betwen cluster nodes | UDP | Virtual network |
Worker nodes require the following access:
Port number | Direction | Protocol | Description |
---|---|---|---|
6443 | Outgoing to control plane nodes | IBM Cloud Pak for AIOps components | |
5001 | Between cluster nodes | TCP | Distributed registry |
8472 | Between cluster nodes | UDP | Virtual network |
Firewall rules
Some Linux distributions require extra firewall rules to avoid potential conflicts or restrictions.
-
Check whether the firewall is enabled.
systemctl status firewalld
If the firewall is enabled, then the command output provides the status of the firewall, including any active rules. If the firewall is disabled, then the command outputs the message No such file or directory.
-
If the firewall is enabled, run the following commands:
firewall-cmd --permanent --add-port=6443/tcp #apiserver firewall-cmd --permanent --zone=trusted --add-source=10.42.0.0/16 #pods firewall-cmd --permanent --zone=trusted --add-source=10.43.0.0/16 #services firewall-cmd --reload
-
If you are permitting the collection of usage data, then ensure that outbound traffic to https://api.segment.io is allowed. For more information, see Updating usage data collection preferences.
nm-cloud-setup service
The nm-cloud-setup
service must not be enabled. Run the following steps on each of your nodes.
-
Check if
nm-cloud-setup.service
is enabled.systemctl is-enabled nm-cloud-setup.service
If the service is enabled, the command returns
enabled
. -
If the service is enabled, then disable it and reboot the node.
systemctl disable nm-cloud-setup.service nm-cloud-setup.timer reboot
High availability considerations
IBM Cloud Pak for AIOps uses a minimum of three control plane nodes and multiple worker nodes to improve resiliency if a node failure occurs.
A procedure is also provided to delete a failed and inaccessible node, for more information, see Recovering from node failure.
Load balancing
IBM Cloud Pak for AIOps requires a load balancer to distribute incoming traffic among the three control plane nodes. The load balancer is a mandatory component, and if you do not already have a load balancer then you must configure one of your choice.
The load balancer requires the following access:
Port number | Description |
---|---|
443 | Inbound IBM Cloud Pak for AIOps requests |
6443 | Inter-cluster communications: inbound from worker nodes, outbound to control plane nodes |
(optional) 80 | For inbound IBM Cloud Pak for AIOps requests, redirects to secure port 443. If you disable this port, users must type https:// as part of the URL when they first navigate to the IBM Cloud Pak for AIOps console. |
The load balancer is the entry point for accessing IBM Cloud Pak for AIOps. It is important to try to ensure the load balancer's high availability, as any downtime or outage will render IBM Cloud Pak for AIOps inaccessible. Consider adding the
load balancer to the startup configuration (such as init.d
) on the server where it is installed. This configuration will enable the load balancer to launch automatically during the boot process, minimizing downtime if an outage
occurs.
For more information about load balancers, see the IBM topic What is load balancing?
Expand the following section to view an example configuration for a HAProxy load balancer. This example is provided as a reference only, consider your own internal policies and practices when you create the configuration for your chosen load balancer.
Example load balancer configuration
#---- load balancer configuration start ---
frontend aiops-frontend
bind *:443
mode tcp
option tcplog
default_backend aiops-backend
backend aiops-backend
mode tcp
option tcp-check
balance roundrobin
default-server inter 10s downinter 5s
server server-1 10.11.12.105:443 check
server server-2 10.11.12.108:443 check
server server-3 10.11.12.111:443 check
frontend cncf-frontend
bind *:6443
mode tcp
option tcplog
default_backend cncf-backend
backend cncf-backend
mode tcp
option tcp-check
balance roundrobin
default-server inter 10s downinter 5s
server server-1 10.11.12.105:6443 check
server server-2 10.11.12.108:6443 check
server server-3 10.11.12.111:6443 check
frontend aiops-legacy-frontend
bind *:80
mode tcp
option tcplog
default_backend aiops-legacy-backend
backend aiops-legacy-backend
mode tcp
option tcp-check
balance roundrobin
default-server inter 10s downinter 5s
server server-1 10.11.12.105:80 check
server server-2 10.11.12.108:80 check
server server-3 10.11.12.111:80 check
DNS requirements
IBM Cloud Pak for AIOps requires a Domain Name System (DNS) server in the environment, with resolution for the following two hosts:
<load_balancer_ip> aiops-cpd.<your fully-qualified server domain name>
<load_balancer_ip> cp-console-aiops.<your fully-qualified server domain name>
Where <load_balancer_ip>
is the IP address of your load balancer.
Important: The DNS server must provide name resolution for all of the nodes in the cluster.
Security considerations
Ensure that you keep up to date on any security-related news or vulnerabilities with IBM Cloud Pak for AIOps by subscribing to security bulletins.
The aiopsctl
tool must run with root privileges.
For more information about the permissions that are required for Cloud Pak for AIOps, see Permissions.
You can use a custom certificate for IBM Cloud Pak for AIOps instead of the default cluster certificate. For more information, see Using a custom certificate.
Licensing
License usage tracking is required. The aiopsctl
tool deploys the IBM Cloud Pak foundational services License Service on your Linux cluster. This background service collects and stores the license usage information for tracking
license consumption and for audit purposes. For more information, see Licensing.