Logging and monitoring

Edit online

Red Hat OpenShift Container Platform has built-in capabilities for monitoring and logging. You can monitor in real-time the behavior of the cluster resources and application and create alerts if resource consumption exceeds limits. For application behavior and operational purposes, RHOCP is able to monitor system and application Logs, and display them or forward them to an external Logging tool to be analyzed.

Monitoring in RHOCP

To learn more about monitoring, the different components and how to deploy and configure monitoring see Monitoring(Red Hat documentation).

Red Hat OpenShift Container Platform includes a pre-configured, preinstalled, and self-updating monitoring stack that is based on the Prometheus open source project and its wider eco-system. It provides monitoring of cluster components and includes a set of alerts to immediately notify the cluster administrator about any occurring problems. The cluster monitoring stack is only supported for monitoring Red Hat OpenShift Container Platform clusters.

Note: To ensure compatibility with future Red Hat OpenShift Container Platform updates, configuring only the specified monitoring stack options is supported.

The content of this image is explained in the surrounding text. — Figure 1. Monitoring Collection Alerting and Visualization of Metrics

Red Hat OpenShift Logging

As a cluster administrator, you can deploy Red Hat OpenShift Logging to aggregate all the logs from your RHOCP cluster, such as node system audit logs, application container logs, and infrastructure logs. Logging aggregates these logs from throughout your cluster and stores them in a default log store.

To learn more about Red Hat OpenShift Logging, see Red Hat OpenShift Logging(Red Hat documentation).

Red Hat OpenShift Logging aggregates the following types of logs:

Application - Container logs generated by user applications running in the cluster, except infrastructure container applications.
Receiver - The receiver input type enables the Logging system to accept logs from external sources. It supports two formats for receiving logs: http and syslog.
Infrastructure - Logs generated by infrastructure components running in the cluster and Red Hat OpenShift Container Platform nodes, such as journal logs. Infrastructure components are pods that run in the openshift*, kube*, or default projects.
Audit - Logs generated by the node audit system (auditd), which are stored in the /var/log/audit/audit.log file, and the audit logs from the Kubernetes apiserver and the Red Hat OpenShift API Server.

Sizing estimates for Monitoring and Logging

The monitoring stack imposes additional resource requirements. Consult the computing resources recommendations in Scaling the Cluster Monitoring Operator (Red Hat documentation) and verify that you have sufficient resources.

Number of Nodes	Number of Pods	Prometheus storage growth per day	Prometheus storage growth per 15 days	RAM Space (per scale size)	Network (per tsdb chunk)
50	1800	6.3 GB	94 GB	6 GB	16 MB
100	3600	13 GB	195 GB	10 GB	26 MB
150	5400	19 GB	283 GB	12 GB	36 MB
200	7200	25 GB	375 GB	14 GB	46 MB

Approximately 20% of the expected size is added as overhead to ensure that the storage requirements do not exceed the calculated value.

The calculation is based on the default Red Hat OpenShift Container Platform Cluster Monitoring Operator.

Note: CPU utilization has a minor impact. The ratio is approximately 1 core out of 40 per 50 nodes and 1800 pods.

Recommendations for Red Hat OpenShift Container Platform

Use at least three infrastructure (infra) nodes.
Use at least three RHOCP container storage nodes with fast storage drives.

Configuration of Monitoring and Logging

The supported way of configuring Red Hat OpenShift Container Platform Monitoring is documented in About OpenShift Container Platform monitoring (Red Hat documentation). Do not use other configurations, as they are unsupported. Configuration paradigms might change across Prometheus releases, and such cases can be handled gracefully only if all configuration possibilities are controlled. If you use configurations other than the ones described in this section, your changes disappear because the cluster-monitoring-operator reconciles any differences. The operator reverses everything to the defined state by default and by design.

Explicitly unsupported cases include:

Creating extra ServiceMonitor objects in the openshift-* namespaces. This extends the targets to the cluster monitoring Prometheus instance scrapes, which can cause collisions and load differences that cannot be accounted for. These factors might make the Prometheus setup unstable.
Creating unexpected ConfigMap objects or PrometheusRule objects. This causes the cluster monitoring Prometheus instance to include extra alerting and recording rules.
Modifying resources of the stack. The Prometheus Monitoring Stack ensures that its resources are always in the state it expects them to be. If they are modified, the stack resets them.
Using the resources of the stack for your purposes. The resources that are created by the Prometheus Cluster Monitoring stack are not meant to be used by any other resources, as there are no guarantees about their compatibility with an earlier version.
Stopping the Cluster Monitoring Operator from reconciling the monitoring stack.
Modifying the monitoring stack Grafana instance.

Configuring persistent storage

Running cluster monitoring with persistent storage means that your metrics are stored to a persistent volume (PV) and can survive a pod being restarted or re-created. This is ideal if you require your metrics or alerting data to be guarded from data loss. For production environments, it is highly recommended to configure persistent storage. Because of the high IO demands, it is advantageous to use local storage.

For details see Optimizing storage

Prerequisites

Dedicate sufficient local persistent storage, to ensure that the disk does not become full. How much storage you need depends on the number of pods.
Make sure that you have a persistent volume (PV) ready to be claimed by the persistent volume claim (PVC), one PV for each replica. Because Prometheus has two replicas and Alertmanager has three replicas, you need five PVs to support the entire monitoring stack. The PVs should be available from the Local Storage Operator. This does not apply if you enable dynamically provisioned storage.
Use the block type of storage.
Persistent storage using local volumes.

By default, the Prometheus Cluster Monitoring stack configures the retention time for Prometheus data to be 15 days. You might want to modify the retention time to change how soon the data is deleted.

5.9.6 Alertmanager

The Prometheus Alertmanager is a component that manages incoming alerts, including:

Alert silencing
Alert inhibition
Alert aggregation
Reliable deduplication of alerts
Grouping alerts
Sending grouped alerts as notifications through receivers such as email and PagerDuty

Red Hat OpenShift Container Platform monitoring includes the Watchdog alert, which fires continuously. Alertmanager repeatedly sends notifications for the Watchdog alert to the notification provider. For example, to PagerDuty. The provider is configured to notify the administrator when it stops receiving the Watchdog alert. This mechanism helps ensure continuous operation of Prometheus as well as continuous communication between Alertmanager and the notification provider.

Red Hat OpenShift Container Platform Cluster Monitoring by default ships with a set of pre-defined alerting rules.

The default alerting rules are used specifically for the Red Hat OpenShift Container Platform cluster and nothing else. For example, you get alerts for a persistent volume in the cluster, but you do not get them for persistent volume in your custom namespace.
Some alerting rules have identical names. This is intentional. They send alerts about the same event with different thresholds, with different severity, or both.
With the inhibition rules, the lower severity is inhibited when the higher severity is firing.

Monitoring your own services

You can use RHOCP monitoring for your own services in addition to monitoring the cluster. This way, you do not need to use an extra monitoring solution. This helps keep monitoring centralized. Additionally, you can extend the access to the metrics of your services beyond cluster administrators. This enables developers and arbitrary users to access these metrics. You can also export custom application metrics for the horizontal pod autoscaler.

Note: Opting into monitoring your own services is mutually exclusive with either a custom installation of Prometheus Operator or installing Prometheus Operator using Operator Lifecycle Manager (OLM).

To get the technical details on how to enable monitoring of your own services, see Configure user workload monitoring.