Performance issue: logger queue overflow

Edit online

The logger part of the sniffer has a non-circular buffer that is held in the sniffer memory. If the logger queue increases, the amount of memory used by the sniffer (mem sniffer) also increases. Once the memory has been used by the sniffer, it is not released until the sniffer restarts. This means that you see the mem sniffer value increase as logger queue increases, but never decrease unless the sniffer restarts.

Symptoms

Key columns in the Buffer usage monitor report: Logger Rate, Logger Queue, Mem Sniffer, Sniffer Process ID.

When the sniffer restarts the sniffer process ID changes, indicating a new sniffer process has started.

High logger queue, and mem sniffer reaching its maximum followed by change of sniffer process ID means there has been a logger queue overflow problem.

Logger queue overflow is not the only possible cause of sniffer restarts. The sniffer can be restarted from the CLI.

(The logger queue is different from the analyzer queue in that it is not circular and continues to allocate memory until the sniffer reaches 33% of the total system memory.)

You can use Alerting on Logger Queue Overflow to help identify symptoms.

Logger queue is tracked in the Unit Utilization Level and Deployment Health View. A high utilization status on logger queue in these views indicates a likely logger queue overflow. The buffer usage report on the individual collector can then be checked to confirm.

To alert directly on a high number of sniffer restarts, see Predefined alerts. By default, the alert is set to send to syslog only. Add any receivers that are required. Confirm the alert is active from Setup > Tools and Views > Anomaly Detection.

Causes

After the sniffer allocates memory, it does not release it even if the logger queue recovers. Therefore, it is possible to have a high sniffer memory usage even if the logger queues are not holding any data. Sniffer restarts because of logger queue overflow is also shown in the collector’s syslog file (/var/log/messages). These messages come in two varieties. The first is a sniffer Memory Allocation Problem, which happens when the logger queues grow quickly. The second type of logger queue overflow restart happens when the Guardium “nanny” process, which monitors sniffer memory usage, detects that the sniffer is dangerously close to the limit and restarts it.

Usually, both types of restarts are caused by the same issues, the only difference being the speed at which the sniffer memory grows. Memory allocation problems happen when the sniffer memory grows quickly before the nanny process can react.

The logger queue can grow for the following reasons:

Too much traffic or an overly aggressive policy with many heavy rules, such as Log Full Details. Though the solutions for Analyzer Queue issues can also apply here, most times it might be sufficient to reduce the number of Log Full Details or policy violation rules in the policy, or make such rules less inclusive.
The logger might be competing for MySQL resources if there are an excessive number of reports, correlation alerts, or other internal processes that are running in the background. If your environment includes an Aggregator, consider running daily reports on that appliance instead.

The logger queue is different from the analyzer queue in that it is not circular and continues to allocate memory until the sniffer reaches 33% of the total system memory, by default. This can be configured in the CLI with support store snif_memory_max.

If the logger queues stay high, the maximum memory is eventually reached. At that point the sniffer automatically restarts and the data in the queues is dropped.

Resolving the problem

Reducing the traffic as for analyzer queue overflow helps to some extent, however the amount of data is not the most common cause of logger queue overflow. Reducing the amount of data logged with intensive logging actions in the policy will have more impact. Sniffer patches are more likely to resolve specific issues leading to high logger queues. Decreasing the workload on the internal database will also improve performance of the logger, for example by running Audit Processes on an aggregator where possible. If the logger queue overflow problem is correlated with a specific scheduled job, that jobs impact on database performance is the likely cause. Suggestions for handling logger queue:

Install latest sniffer patch from fix central on the appliance
Reduce amount of traffic logged with 'Log Full Details' or 'Alert per Match' policy actions. See Configuring your policy to prevent appliance problems for more details.
Investigate any scheduled jobs that correlate with logger queue overflow.