IBM Support

QRadar: Overloaded Hypervisor Causes Instability

Troubleshooting


Problem

QRadar server is receiving events but they are not being processed through the system and receiving real-time clock (rtc) error message "rtc interrupts".

Symptom

The following symptoms might also be present in the environment:

  • A QRadar server, in Web Console UI > Admin > System and License Management, might display status "Unknown".
    This unknown status symptom can occur if the hypervisor isn't giving it enough time on the stack to respond to ping requests. It can also occur if the related errors filled up the /var/log partition faster then log rotation was able to keep up causing the services to be shut down to protect the environment.
  • On the problem server, services ecs-ec, ecs-ep, ecs-ec-ingress, or hostcontext are not in an active state.
    The hypervisor not giving QRadar enough time on the stack causes errors to accumulate in critical services, causing the services to fail.
  • In an HA environment, the excessive errors might cause the service ha_manager to go offline and unmounted the store partition.
    Trying to start the failed service, error stating "cannot access PARTITION Input/output error" might be received:
    ecs-ec[PID]: chown: cannot access ‘/store/jheap’: Input/output error
    ecs-ec[PID]: chmod: cannot access ‘/store/jheap’: Input/output error
    ecs-ec[PID]: mkdir: cannot create directory ‘/store/jheap/ecs-ec.ecs-ec’: Input/output error
    ecs-ec[PID]: chmod: cannot access ‘/store/jheap/ecs-ec.ecs-ec’: Input/output error
    systemd[1]: ecs-ec.service: control process exited, code=exited status=1
    systemd[1]: Failed to start Event Correlation Services Event Collector.
    Upon further investigation, this partition is not mounted:
    df -h /store

NOTE: It is also possible to not see any of these symptoms and the hypervisor still be overloaded.

Cause

RTC interrupt messages are generated when the internal clock missed the hypervisor clock. Many rtc interrupt messages usually mean the hypervisor is overloaded.

Diagnosing The Problem

  1. Checking the dmesg logs or messages for rtc errors:
    grep -i rtc /var/log/messages
    Example Output:
    blk_update_request: I/O error, dev sdb, sector N
    XFS (sdbN): metadata I/O error in "xlog_iodone" at daddr N len N error 5
    XFS (sdbN): xfs_do_force_shutdown(0x2) called from line N of file fs/xfs/xfs_log.c.  Return address = N
    kernel: hpet1: lost N rtc interrupts
  2. See whether the CPU last load average is greater than the number of cores on the VM:
    lscpu | grep 'CPU(s):' | head -n 1 && uptime
  3. Check for more than 1G RAM 'available':
    free -h
  4. Verify the disk '%iowait' is not high:
    iostat -c
    NOTE: What is considered high for the environment can vary, but generally anything greater than 90% is considered high.
  5. Verify that the system meets recommended specifications for your event or flow load.
  6. If your system is getting the rtc interrupts, the Operating System is not reporting CPU, RAM, or Disk IO issues, and you meet the required system specifications then your issue is related to the hypervisor.

Resolving The Problem

Contact your system hypervisor admin to troubleshoot performance as the hypervisor is over loaded.
NOTE: Most hypervisors have default limits or restrictors on resources so virtual machines (VMs) don't consume all resources and some is left over for hypervisor services and tasks. For example, some have a limit set to 75% usage for CPU, RAM, and disk. Just because it's not at 100% doesn't mean your hypervisor isn't overloaded.
NOTE: If the /var/log partition filled as a result of a slow hypervisor, it can be cleaned up.

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB24","label":"Security Software"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSBQAC","label":"IBM Security QRadar SIEM"},"ARM Category":[{"code":"a8m0z000000cwtiAAA","label":"Performance"}],"ARM Case Number":"TS006291021","Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions"}]

Document Information

Modified date:
22 August 2022

UID

ibm16607781