IBM Support

QRadar: Troubleshooting disk space usage problems

Troubleshooting


Problem

This article will guide you through troubleshooting high disk usage situations in QRadar, which can ultimately lead to services being stopped, resulting in an outage.

Cause

By default, the QRadar disk sentry check runs every 60 seconds and looks for high disk usage across the following partitions:
QRadar 7.4.x Critical Services Stop Article
/ Yes, at 95% Technote #0881470
/store Yes, at 95% Technote #0882066
/transient Yes, at 95% Technote #0882064
/storetmp Yes, at 95% Technote #0882068
/opt Yes, at 95% Technote #0882070
/var No -
/var/log No Technote #0882056
/var/log/audit No -
/tmp No -
/home No -

If any of these partitions exceeds 90% usage, a warning notification is sent to the UI. You can also see a line logged to /var/log/qradar.log, such as the one seen below:

Apr 14 18:10:31 ::ffff:9.55.221.216 [hostcontext.hostcontext] [cb4eb5ec-2cae-4075-ab9b-48d2e63dafd5/SequentialEventDispatcher] com.q1labs.hostcontext.ds.DiskSpaceSentinel: [WARN] [NOT:0150064102][9.55.221.216/- -] [-/- -]System disk resources above warning threshold

Important: For the partitions listed in the table as critical for system functionality, system services will be stopped to avoid the partition becoming completely full and possibly causing further issues. A maximum threshold notification is sent to the UI and and can also be seen in qradar.log, as referenced below:

Apr 14 18:15:31 ::ffff:9.55.221.216 [hostcontext.hostcontext] [cb4eb5ec-2cae-4075-ab9b-48d2e63dafd5/SequentialEventDispatcher] com.q1labs.hostcontext.ds.DiskSpaceSentinel: [ERROR] [NOT:0150064100][9.55.221.216/- -] [-/- -]Disk usage on at least one disk has exceeded the maximum threshold level of 0.95. The following disks have exceeded the maximum threshold level: /transient, . Processes are being shut down to prevent data corruption. To minimize the disruption in service, reduce disk usage on this system.

While the other partitions denoted as non critical, the disk sentry check will give a warning when the threshold is met, and system processes will not stop and cause an outage. 

For reference, when the system recovers back below the threshold, a notification is sent to the UI and the following message is seen in qradar.log:

Apr 14 18:18:31 ::ffff:9.55.221.216 [hostcontext.hostcontext] [cb4eb5ec-2cae-4075-ab9b-48d2e63dafd5/SequentialEventDispatcher] com.q1labs.hostcontext.ds.DiskSpaceSentinel: [INFO] [NOT:0150066100][9.55.221.216/- -] [-/- -]System disk resources back to normal levels

   

Environment

   

Diagnosing The Problem

The first step in diagnosing the problem is determining which partition has the problem. Using the df -h command, you can get the output of the partitions. An example output is seen below:

Filesystem                        Size  Used Avail Use% Mounted on
/dev/mapper/rootrhel-root          13G  2.9G  9.7G  23% /
devtmpfs                           16G     0   16G   0% /dev
tmpfs                              16G   20K   16G   1% /dev/shm
tmpfs                              16G  1.7G   15G  11% /run
tmpfs                              16G     0   16G   0% /sys/fs/cgroup
/dev/mapper/rootrhel-var          5.0G  208M  4.8G   5% /var
/dev/sda3                          32G  4.1G   28G  13% /recovery
/dev/mapper/rootrhel-home        1014M   33M  982M   4% /home
/dev/sda2                        1014M  224M  791M  23% /boot
/dev/mapper/rootrhel-tmp          3.0G   53M  3.0G   2% /tmp
/dev/mapper/rootrhel-opt           13G  5.1G  7.5G  41% /opt
/dev/mapper/rootrhel-storetmp      15G   34M   15G   1% /storetmp
/dev/mapper/rootrhel-varlog        15G  3.6G   12G  24% /var/log
/dev/mapper/storerhel-transient    40G   40G  236M 100% /transient
/dev/mapper/rootrhel-varlogaudit  3.0G  205M  2.8G   7% /var/log/audit
tmpfs                             3.2G     0  3.2G   0% /run/user/0
/dev/drbd0                        158G   78G   80G  50% /store

From here, you can see that the /transient partition is the one with the issue. Now that you have identified the partition having the issue, go to the Resolving The Problem section to find details about finding large files/directories on the partition. Also, be sure to review the linked article for your partition issue in the Cause section.

   

Resolving The Problem

General troubleshooting for large files or directories:

Generally speaking, there are a couple of reasons you may have high disk usage on your QRadar partition(s).

  • Large file(s) on the partition causing it to fill 

  • Lots of smaller files build up over time and cause a certain directory on the partition to grow excessively 

For the first situation, using the find command can help with this. Run find /partition -xdev -type f -size +200M | xargs ls -lhSr to get an output of all the files over 200MB on a specific partition. An example output can be seen below:

# find /transient -xdev -type f -size +200M | xargs ls -lhSr
-rw-r--r-- 1 root root 39G Apr 14 19:25 /transient/bigfile.img

Note: You may need to modify the size threshold to a higher or lower value based on your output, but 200M is generally a good starting point.

For the second situation, you can utilize the du command to get recursive directory sizes for a specific partition or directory. Run:

du -xch /partition | sort -h

Or

du -chaxd1 | sort -h

This will return with a recursive directory output for the /partition/directory you listed, sorted by the smallest to the largest.

You can use this output to identify which directory is consuming the most disk space on the partition, and then you can look into that directory to see which file(s) are there consuming the space.

For more information on finding large files consuming disk space in QRadar, see Technote 1988496 - QRadar: Finding files that use the most disk space.

   

   

Document Location

Worldwide

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSBQAC","label":"IBM Security QRadar SIEM"},"Component":"","Platform":[{"code":"PF043","label":"Red Hat"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB24","label":"Security Software"}}]

Document Information

Modified date:
12 June 2021

UID

ibm10881013