IBM Support

QRadar: Troubleshooting disk I/O performance issues

Troubleshooting


Problem

This article shares commands to troubleshoot slow disks, expensive processes, or too many competing tasks or disk I/O issues than can negatively impact QRadar performance. 

Symptom

  • Searches taking a long time to complete on a specific host
  • High system load but no expensive CPU processes can be found
  • Lucene Indexer falling behind

Cause

Disk contention occurs when multiple processes access the same resource at the same time, leading to an increase of the wait times on the CPU, which results in an increase of the system load.
On QRadar, bad disks, expensive searches, or too many indexed properties contribute to this behavior.

Diagnosing The Problem

If the system is not performing as expected, administrators can follow these steps to determine whether disk contention is a problem. If you find test values outside of the recommended range, proceed to the resolution for steps to fix the issue.

How to find the load average

  1. SSH into the QRadar console as the root user.
  2. Optional. If the issue is on a host different than the console, SSH to that managed host.
  3. Review the load average and wait time (wa). Use the top -c command to display that information:
    top -c
    In the output, search for load average field and the wa.
    image-20221123203415-1
    The wa value is a percentage value.
    A wait time over 15% can be considered high and means that the CPU is waiting for the disk to finish reading or writing 15% of the time.
     Load average value is presented for the last 1, 5 and 15 minutes average.
     Unlike the wait time, the load average is not a percentage and needs to stay under the number of CPU cores of the host. For example, on a 32 core host, a system load of 20 is fine but on a 16 core host that is considered overloaded.
  4. To determine how many CPU cores the QRadar host has, run the lscpu command and check the CPU(s) line.
    Output example:
    image-20221123204959-2

    Result
    If the system is overloaded for a long period (for example, the load average in the last 15 minutes exceeds the CPUs) the administrator needs to determine whether it is caused by disk I/O contention.

    A high wa (15%) while the host is overloaded and is a sign of disk contention. In that case, monitor the performance of a particular disk as described in How to monitor the performance of a particular disk.

    Alternately, if there is an overloaded system and the await times are looking good, it might have a different root cause. Review the following two articles for more information:
    QRadar: How to monitor and check whether the CPU is bound or overloaded
    QRadar: Performance issues caused by oversubscribed hardware resources

How to monitor the performance of a particular disk

  1. The disk that often exhibits this problem is the one where the /store partition is. To confirm which one it is, run the lsblk command. For this example, it is sda:
    image-20221124133047-1
  2. To monitor the performance of a particular disk, run the iostat -dmx <diskName> 1 command.
    Command example:
    iostat -dmx sda 1
    Result
    Output example:
    Output Example
    •  await, r_await (read await), w_await (write await) are the most important columns. Values greater than 15 must be investigated.

    If the values are outside the expected range, the administrator needs to identify any I/O expensive processes that could be causing the problem. This process is described in the next section How to identify the most I/O expensive processes.

How to identify the most I/O expensive processes

To identify the most I/O expensive processes, use the iotop -aoP command. 

Monitor the DISK READs, DISK WRITEs, and IO> columns for any expensive processes: image-20221123212837-6

Result
Note the processes with the highest IO percentage. If the Ariel service appears as an expensive service, follow the Expensive searches or too many concurrent searches steps in the Resolving The Problem section.

If there is an unrecognized expensive process, make sure that there is no a third-party software running on the QRadar host as described in the following note:
Third-party software on QRadar appliances

How to test the disk write speed

Another good indicator is to test the disk write speeds. To do so, run the following command to write a 1 GB file, which shows the average write speed at the very end of the output. Make sure they meet the requirements established in our system requirements:

dd if=/dev/zero of=/store/persistent_queue/test1.img bs=1G count=1 oflag=dsync
Output example:
The disk is not performing as expected with an average write speed of 51.9 MB/s.
image-20221124163826-1

Result
Go to the next link System requirements, and check the Data transfer rate (MB/s) column in the table number 3 in the Storage requirements section to determine what speed is appropriate for your system. Don't forget to remove the 1 GB test file when you are finished testing. If your speed is less than what is expected, try the Slow disk write speed step in the Resolving The Problem section.

Resolving The Problem

Expensive searches or too many concurrent searches

When the Ariel service is I/O expensive, there might be expensive searches or too many of them in execution.

  1. Run the following command to determine how many concurrent searches are running in the affected host (Console, event, or flow processor, data node):
    grep 'Ariel Server is' /var/log/qradar.log | awk '($17>0){print "Date:",$1,$2,$3,"# of concurrent searches",$17 }' | tail -n10
  2. Use this technical note to identify and cancel any expensive searches:
    QRadar: How to find and cancel searches that are running in the background
  3. Optional. If all the running searches can be cancelled or an immediate recovery of the performance is needed, use the following steps:
    1. SSH into the QRadar console as a root user.
    2. To stabilize the performance, restart the Ariel service, this action cancels all the searches in execution. Run the following two commands:
      Note: Before the Ariel service is restarted, check the impact of restarting services to get more information about this action.
      systemctl restart ariel_proxy_server
      /opt/qradar/support/all_servers.sh "systemctl try-restart ariel_query_server"

      Result
      Expensive searches and many concurrent searches are cancelled, helping the performance and giving time to the administrator to work on the expensive and concurrent searches to avoid the issue in the future.

Slow disk write speed

If the affected host is not a virtual machine, it's important to discard a disk failure by using the following procedure, QRadar: Troubleshooting disk failure or predictive disk failure notifications

Result
The administrator confirms that the disk does not present a failure.

If the performance does not improve, contact QRadar Support for assistance and provide all the details collected in the Diagnosing The Problem section of this technical note.

Related Information

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB24","label":"Security Software"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSBQAC","label":"IBM Security QRadar SIEM"},"ARM Category":[{"code":"a8m0z000000cwtiAAA","label":"Performance"}],"ARM Case Number":"","Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions"}]

Document Information

Modified date:
13 November 2023

UID

ibm16841421