IBM Cloud Private logging and metrics capacity planning

Flexibility

Responsible planning prepares companies to maximize hardware resources for workloads, while minimizing the resources required for troubleshooting and historic analysis. Allocating sufficient resources towards the capture, storage, and management of logging and metrics is crucial, especially under stressful conditions. The data often provides the key to both analysis of past events and forecasting of future requirements.

No universally, cost-effective recommendation for the capture, storage, and management of logs and metrics is available, but the following guide provides some insights based on observations of workload behavior in IBM Cloud Private. You are encouraged to test workloads while in both idle and stress conditions, and use that information to predict hardware resources that are needed for both short and long-term management.

Managed services

IBM Cloud Private provides a set of managed services that can be deployed to management nodes. The resources that are allocated to those nodes need to reflect whether those managed services must handle all logging and monitoring traffic for the entire IBM Cloud Private cluster. However, central management is not always mandatory.

Several managed services also include a similar, configurable Helm chart in the catalog. Through a combination of node labels and Helm chart options, users can deploy services that focus in on specific workloads and namespaces. This approach requires more detailed consideration of how the cloud is used, but it has the potential benefit of reducing the overall work of a central service.

Logging and monitoring in Kubernetes

Workloads are logged and monitored at two levels. The default, and most common, level handles workloads as black boxes. The logging and monitoring services handle the workloads like black boxes because they read and measure only the data that is visible from outside the Docker container itself. No knowledge of the workload itself is required for the logging and monitoring services to function. The second level is a deeper workload integration.

Workloads as black boxes

Some metadata for pods, containers, namespaces, and other workloads, is not available in Elasticsearch or Prometheus. Most metadata that is visible to Kubernetes users are not visible to the managed logging and monitoring services. In the case of logging, for example, Elasticsearch queries return such fields as kubernetes.pod, kubernetes.container_name, and kubernetes.namespace. Those field values are not retrieved through an API, but instead are extracted from the log file names themselves. Kubernetes helpfully creates symlinks to the underlying Docker logs, and the name of each symlink is structured with the pod name, container name, container ID, and namespace. Without storing that information in the file name, the logging service would not be able to populate those values.

The managed monitoring service has even more constraints. Metrics collectors extract information from the running node about which processes (including Docker containers) are using what resources and to what degree. In most cases, collectors do not have deeper insight into either Kubernetes or the workload to detect metadata about the origin of the collected data, which limits the degree to which filtering and correlation can be performed. As a result, the smallest scope that can be configured for collection by the managed monitoring stack is the IBM Cloud Private node.

Deeper workload integration

Despite the black box limitation for many applications, some workloads do integrate collection APIs and log-sharing features. For example, they might use a Filebeat sidecar for sending a container's log files, or the middleware might implement an API to collect detailed metrics. You can also find ways to combine these techniques with black box containers to express richer metadata collection.

Hardware impact summary

ELK stack

IBM Cloud Private deploys the ELK stack as follows:

A Filebeat daemonset that runs on every node
A single Logstash pod, which can be scaled out
An Elasticsearch master pod, which coordinates the management of the Elasticsearch cluster
An Elasticsearch client pod that implements the REST interface for all incoming logs from Logstash and queries from Kibana
Two Elasticsearch data pods to process and store all of the log data
An optional Kibana pod

In general, the parts of the stack that require the most resources are Logstash and the Elasticsearch data nodes. The master and client Elasticsearch nodes are able to handle high amounts of traffic with minimal resource use. Filebeat is also very efficient, using trivial resources.

The default, single-instance Logstash configuration can handle hundreds of log entries per second, with CPU usage that grows at a rate of about one core per 150 - 200 records per second. However, at a certain point, depending on network capacity, which is potentially around 700 records per second, the volume of log traffic begins to degrade network performance. This volume increase has a correlating effect on applications that run on the affected nodes. In general, if you expect high rates of log traffic, either distribute it across as many nodes as possible, or break up the workloads into multiple IBM Cloud Private clusters. Fortunately, Filebeat and Logstash are excellent at tracking and recovering from connectivity errors with minimal data loss when normal traffic rates resume.

The Elasticsearch data nodes usually use less CPU than Logstash but require more attention to disk and RAM. According to the Elastic company, logs stored with Elasticsearch typically require similar storage as the raw log files themselves. Memory consumption might rise as high as 15-20% of the stored logs. Adjustments to the Elasticsearch configuration potentially affect those numbers, but it's important to emphasize that JVM heap represents only one aspect of the total memory that the data node uses.

Encryption naturally puts a heavier load on the CPU, particularly as log and query traffic increases. Newer CPU models are capable of handling encryption more efficiently, and might offset the need for more hardware, but more hardware will be required. Adding extra memory is also recommended, since the plug-ins providing TLS encryption might impose some overhead.

Prometheus

For various reasons, Prometheus retains all collected metrics in memory for a 2 hour period. The amount of RAM that Prometheus requires depends on a number of factors, including:

The number of nodes in the IBM Cloud Private cluster
The number of workloads during peak operating conditions
The frequency with which metrics are collected

The third factor is the one that needs careful consideration. Doubling the time between metrics collection (for example, increasing to every 30 seconds from every 15 seconds) cuts the memory usage of Prometheus by half, but also reduces the granularity of those metrics. One key element of the planning process is an evaluation of requirements for metrics collection, both for troubleshooting and for predictive analysis.

Detailed impact

Planners need to consider the following factors when they estimate the resources for managing logging and monitoring data:

Whether data-in-motion encryption is required
Whether to centrally collect logs and metrics to the managed logging and monitoring services
The number of Filebeat instances (node daemonset or sidecar) streaming logs to the Logstash cluster
The volume of logs that is generated by the workloads
Anticipated bursts of load, resulting in higher log volume
The granularity of metrics to collect
Logging and metering retention requirements
Elasticsearch query performance

Encryption

While TLS is not supported for many of the collectors used by Prometheus, all other data-in-motion traffic for monitoring and logging can be encrypted. While CPU in general is affected, recent CPUs are more efficient, but still work harder. The ELK stack, in particular, might incur memory overhead as a result of the plug-in enabling encryption. Nodes that have tighter RAM restrictions might encounter stability and performance issues.

In general, if you plan to enable encryption, consider increasing the number of allocated CPUs by an amount proportional to overall log volume.

Centralized collection

The managed logging service is optimally configured to handle relatively small loads, though it can handle much larger workloads. However, as the load increases, usage of CPU, disk, and RAM also increase.

Monitoring resources often require more RAM than CPU. Prometheus retains all metrics in memory for a non-configurable period of 2 hours, for reasons that include responsiveness to time-sensitive queries and more efficient bulk disk operations. The result is that more workloads require more resources to meter, which generates more metrics over those 2 hours. If memory availability is a concern on the management nodes, and IBM Cloud Private is running many workloads, you can restrict the nodes from which the managed monitoring stack collects metrics.

Number of Filebeat instances

This factor largely impacts Logstash. By default, IBM Cloud Private deploys a Filebeat daemonset to every node, and each daemonset streams back to the managed logging service. You also create a Filebeat instance for each pod that uses a Filebeat sidecar to stream out logs that are stored within the container. As the number of Filebeat instances grows, and as log traffic increases, consider reviewing the Logstash CPU usage rate in Grafana. When the Logstash instance starts to use a full CPU core, it is a good time to consider adding another replica to the Logstash cluster.

Log volume

High log volume often impacts RAM and network performance more than other factors. For some environments, that might be tens of log entries per second. In others, it can rise to thousands of entries per second. Some measurements indicate that network performance might begin to degrade as log volume reaches the 1,000 entries per second rate, but individual results vary.

Log volume typically grows either through an increase in workload count, or an increase in the rate of output from workloads. As mentioned in other places, Prometheus requires more RAM for temporary metrics storage as the number of resources, including workloads, grows.

Traffic bursts

While the resources that are required to monitor data collection are relatively stable through bursts of traffic, the volume of log output can grow significantly. Careful consideration must be given to CPU core, memory, and disk allocation to handle unexpected bursts in traffic.

Metrics granularity

See the Prometheus section.

Data retention

The default configuration for the managed logging and monitoring stacks retains data for only one day. Every night around 24:00, the old data is deleted. You can modify these settings. See IBM Cloud Private logging. But when you modify these settings, you must consider some important implications when you choose to retain data for longer periods of time.

Elasticsearch breaks down data into chunks, which are known as indexes. Each index is composed of three parts: first, the data on disk; second, an Elasticsearch in-memory cache; and third, a Lucene (search engine) cache. The managed ELK stack defines each index as one day. For each day's logs that are retained, the disk and cache requirements accumulate. The size of each cache correlates to the volume of logs that are generated for that cache, sometimes as much as 15%. In other words, it is possible that for each 100 GB of logs that are stored, Grafana might report Elasticsearch data nodes that are using as much as 15 GB of RAM. This usage ratio is not a universal rule, but it demonstrates the need for testing to determine the resource load that is created by the workloads that you run on IBM Cloud Private.

Query performance

Some planning scenarios might put constraints on the time frame in which Elasticsearch queries (whether run through Kibana or directly through the Elasticsearch REST API) must complete. Some queries are complex, and others are time-sensitive, so the results must be available within a particular threshold. In these cases it's even more important to plan for more memory, but also faster disks. Solid-state disks (SSDs) generally have a higher cost, but they provide the I/O performance that enables systems to conform to rigid query thresholds.

The combination of more RAM, facilitating larger in-memory caches, and faster disk speeds can help, but they might not tell the whole story. As described in other sections, extremely high log traffic might affect network quality, which might affect query responsiveness. Other factors that are unrelated to the managed ELK stack, or even unrelated to IBM Cloud Private, might affect query responsiveness. Early testing helps to identify any bottlenecks that might arise.

Plan for failure

In many cases, logging and monitoring data is archived for auditing but rarely for active review. It is tempting to allocate fewer costly resources to manage that data, and instead focus that hardware on the workloads. But one of the fundamental ideas behind Kubernetes is that application developers and system administrators should design their software to fail gracefully and quickly recover from the failure. That idea is why there's no command to restart pods: you can only delete the pod and wait for Kubernetes to re-create it.

Accordingly, proper planning considers not only the standard day-to-day behavior of the workloads that run on IBM Cloud Private, but also what is required when one or more systems or workloads catastrophically fail. It is at those times that access to the logging and metric data are most crucial.