Purging data to resolve a full disk when the GUI is down

Learn how to identify a full disk, and what to do about it.

About this task

Two areas can get full on a Guardium appliance which can then cause the GUI to stop:
  • The internal database
  • The filesystem itself (usually the /var partition)
One or both can become full. Usually it is the database that fills up, which then causes the filesystem to fill up, since the database files are held in the /var partition. If either gets to 90% full, the system automatically stops services, including the GUI.
Auto stop services: By default the appliance stops services including GUI and sniffer when the database or the filesystem reaches 90% full. An internal 'nanny' process checks the status every 5 minutes and takes actions. You can check the current setting in the CLI:
xxx.xxx.xxx.com> show auto_stop_services_when_full
See Configuration and control CLI commands
Important notes for auto stop services:
  • If the auto_stop_services_when_full is switched off, the system might be filled to 100% preventing all access to the system
  • Never set the auto_stop_services_when_full to off unless used temporarily in the specific circumstance described in the answer section
  • You must stop inspection-core before setting auto_stop_services_when_full to off. This prevents the system filling any further.
  • If you attempt to restart stopped services before the space issue is resolved, then the services stop again after 5 minutes. The filesystem and database usage keep increasing in that time. Command to restart stopped services:
    restart stopped_services
    Warning: Do not use this command until you are sure that space has been recovered.

Diagnosing the problem

Internal database: As user cli, check whether the internal database is full with this command:
support show db-status used %
If the result is 90% or more the GUI should be stopped automatically by auto stop services. It is possible for the database to show over 100% used. It happens when the database files consume more than the set size defined on the system (50% of disk space for collectors, 75% for aggregators). This can happen if system services are not stopped when database reaches 90% or they are restarted manually.
Internal filesystem: To check if /var partition (filesystem) is 90% full or more, run a must gather from cli:
support must_gather system_db_info
Use fileserver to check the df -k output within the system_output.txt file that can be seen in fileserver: must_gather/system_logs/system_output.txt, or extracted from the system.<datetime>.tgz file once you have downloaded it
Inside the system_output.txt file you can find the detail. In this example the /var is only 65% full:
==========2016-11-30 08:36:09 ... Output of df command:========== 
Filesystem      1024-blocks       Used       Available       Capacity Mounted on 
/dev/sda3          10154020    2272668       7357232             24% / 
/dev/sda2          28571320   17384504       9712052             65% /var 
/dev/sda1            505604      33476        446024              7% /boot 
tmpfs               6169768          0       6169768              0% /dev/shm 
Before the database or the filesystem fills to the "auto stop" level you should receive warnings in the system log (messages file). You can run a must_gather command and look inside the compressed file that gets created to check the latest messages file within
support must_gather system_db_info

Sample message filesystem space problem errors.

In this example the messages file shows the filesystem is full (DB space may also be full )
Nov 23 12:00:13 xxx nanny:[2986]: Nanny is awake.
Nov 23 12:00:13 xxx nanny:[2986]: DB parameters - status 2 db warn level 75 db critical level 90 db auto stop 1.
Nov 23 12:00:13 xxx nanny:[2986]: It is in critical ..Used space on your system is almost full(currently at 93%). Please use CLI command 'show filesystem usage' to see which directories take too much space to target your clean up.
Nov 23 12:00:13 xxx nanny:[2986]: Email has been sent to admin (admin@admin.com) on the out-of-space issue.
Nov 23 12:00:13 xxx nanny:[2986]: Stopping Guardium Services until used space on your system has been cleaned up.
This example shows both the DB and the filesystem (/var partition) NEARLY full (before the auto stop of services)
Nov 23 14:13:12 xxx nanny:[10070]: TURBINE DB is configured after nap
Nov 23 14:13:12 xxx nanny:[10070]: Nanny is awake.
Nov 23 14:13:12 xxx nanny:[10070]: DB parameters - status 1 db warn level 75 db critical level 90 db auto stop 1.
Nov 23 14:13:12 xxx nanny:[10070]: Used space on your system is filling up (currently at 88%).  Please use CLI command 'show filesystem usage' to see which directories take too much space to target your clean up.
Nov 23 14:13:12 xxx nanny:[10070]: Email has been sent to admin (admin@admin.com) on the out-of-space issue.
Nov 23 14:13:12 xxx nanny:[10070]: A partition is rapidly filling up. Partition /dev/sda2 (/var) on xxx is on 88 percent usage. Doing preventive cleaning.
Nov 23 14:13:13 xxx root: 64 bit big mem 24554360 limit is 12277180
Nov 23 14:13:13 xxx nanny:[15110]: Hunting version 35, every 300, for more than 12277180 kb.
Nov 23 14:13:13 xxx nanny:[15110]: Also checking tomcat.
Nov 23 14:13:13 xxx nanny:[15110]: Nanny set memory limit to 12277180
Nov 23 14:13:13 xxx nanny:[15110]: TURBINE DB Already configured before nap
Nov 23 14:13:13 xxx nanny:[15110]: Going for my initial nap.

Procedure

  1. If the database is 90% or more full but the filesystem is not 90% full yet:

    If the auto stop has been triggered then this stops services such as the GUI, which stops you from making an emergency purge of data via the "Run Once Now" purge option. However, purge from the GUI is still the best way to reduce data in emergency, provided these steps and considerations are followed.

    1. Make sure that the inspection-core is switched off on collectors to stop more data flooding into the appliance. Check that NO database commands are running except the show process list. If needed let any running commands finish before the next step.
      stop inspection-core 
      xxx.xxx.xxx.com> support show db-processlist running
            Id|        User|       Host|       db|Command|Time| State|              Info|
         ------   ----------   ---------   -------   -----   -   ----   ----------------
        141791|  enchantedg|  localhost|  TURBINE|  Query|  0|  init|  show processlist|
      
      Total of running processes: 1
      Total of sleep processes: 44
    2. Run restart gui to gain access to the GUI to perform the once now purge.
      • Before starting purge ensure that both Archive and Export are not selected, so the system does not first create archive or export files.
      • If there is a problem where the GUI keeps going down every five minutes, then consider switching the auto_stop_services_when_full to off , only temporarily, to allow you to restart the GUI and purge some data. By restarting the GUI on its own, it might only stay running for 5 minutes, and the nanny process might stop the services again before enough data is purged or before you have had time to start the purge.
      • If the auto_stop_services_when_full is switched off, the appliance might go on to fill the system to 100%, preventing you from accessing the system at all. Never set the auto_stop_services_when_full to OFF unless you are using it temporarily in the specific circumstance described here. As soon as you have resolved the the space problem, switch it back to ON.
    3. Keep checking the DB full percentage and the Aggregation Archive log to know when the purge process is finished.
    4. When the purge is finished, set the auto_stop_services_when_full back on and then restart the stopped services.
      store auto_stop_services_when_full on 
      restart stopped_services
    5. Data should start to be collected again. Monitor the system carefully.
    6. Investigate the root cause to ensure the problem does not recur.
      • Purging the data does not resolve any root cause of a full database. Check the policy configuration or level of incoming traffic from S-TAPs.
  2. If the database size is fine but the filesystem (/var) is full then some system files might be left on the appliance for example:
    • If daily exports or archives are failing a temporary file might be left in the system for each day.
    • Some old large patch files might be left in the /var/log/guard/patches directory.
    • Tomcat service running on the system might be crashing and creating dump files.
    The following CLI commands can be used to identify large files.
      • show filesystem usage: Shows types of files (database, log, gim) and how much space is used by them. Log files usually can be deleted.
      • support show large_files 10 0: Shows files larger than 10MB older than 0 days. Consider the largest ones for removal first.
    You might need to work with IBM Technical Support to carefully check for large files and consider ones for deletion.
  3. Use of these two options to delete files.
    • support clean log_files, for example:
      support clean log_files <file to delete, full path>
    • diag-> 4) Perform Maintenance Actions -> 3) Clean Disk Space
      1. Pick the directory where the large files reside.
      2. Carefully enter a filter term to isolate the specific files that will be removed. Work with IBM Support if needed.
      3. Check the list that is returned
      4. Confirm you want the files to be removed