Unit utilization and Unit utilization details reports

The Unit utilization report provides an enterprise-level view of collector usage. It employs a simple Low, Medium, or High indicator that shows which collectors are over or under-used. The analysis is mostly based on Buffer Usage Monitor Data, with a few parameters from the Guardium statistics, collected internally on the collectors. The data is downloaded from the managed collectors. From Unit utilization report, you can click and drill down to see a collector's Unit utilization details report that lists many utilization categories and their levels.

In the Unit Utilization report, each managed collector is displayed with its corresponding utilization level. The three utilization levels are: Low, Medium, and High. Right-click any Guardium system and select Unit Utilization Details to open the Unit Utilization Details report. It provides the statistics and levels of the individual parameters that are used to calculate the overall collector utilization in 1-hour increments over a 24-hour period.

The summary is always per hour. It summarizes the data, starting from the last time it was extracted for each managed unit, until the latest whole hour reported to the central manager. If data was not extracted for a specific unit at all, it starts at 24 hours previous to the current time. For example, data was never extracted for a specific unit and it is now 14:04 on a Tuesday and the latest reported record for that unit has a timestamp Tuesday (today) 13:05. Data is extracted for all whole hours from yesterday (Monday) at 15:00 (now minus 24 hours from the first whole hour). The latest period that is extracted is the period that starts today (Tuesday) at 12:00 (the last whole hour that is reported is between 12:00 and 13:00, since the last record is from 13:05).

The Low, Medium, and High utilization levels are defined by a predefined but configurable set of thresholds. The predefined thresholds are adequate in most cases. Threshold 1 defines the level at which a particular parameter goes from Low to Medium. Threshold 2 defines the level at which a particular parameter is at High utilization. You might see an overall High Utilization level in the Unit Utilization Details report but it might not necessarily be indicative of an issue. If the Guardium system reaches this level for only 1 hour in an extended period, it is most likely an isolated event that can be ignored. If you are getting "too many" false positives, consider modifying the default thresholds. The rate of false positives is whatever you decide. It might be one per day, or one per week, for example. To modify the thresholds, go to Manage > Reports > Unit Utilization > Utilization thresholds.

The Deployment Health Dashboard, by default, includes a unit utilization issues pane, which is the same as the Unit Utilization Details report.

The two most important statistics in the Unit utilization levels are:
  • Number of flat logs: does not increase in a system that is working correctly.
  • Number of sniffer restarts: the sniffer does not restart in a system that is working correctly.
These two statistics indicate that data is definitely being dropped. If they go over the threshold once, investigate the reasons. All other statistics can go over thresholds and come back down without actual data loss.
Each measurement has two columns: the hourly statistic, and the level. For example, Number of restarts and Number of restarts level. The hourly statistics are:
Table 1. Unit utilization report
Parameter Description Interpretation
Hostname Collector or aggregator hostname or IP.  
Period start Hour at which statistics collection started.  
Overall unit utilization level The highest level from all of the analyzed parameters. If the number of sniffer restarts reaches a level of Medium, for example, but all other parameters are Low, the overall Unit utilization level for this period is Medium.
Number of restarts Number of times the sniffer restarted. Number of sniffer restarts. Indicates that the sniffer is dropping packets. The sniffer does not restart in a system that is working correctly. If Number of restarts goes over the threshold once, it is of concern. Common causes of sniffer restart:
  • Crash of the sniffer process.
  • Engine buffers become full.
  • Logger queues filling up, system out of memory, can be caused by any of:
    • Too much traffic is coming in from the STAPs.
    • High level of traffic is captured by policy rules, for example, Log Full details. (Log full details.)
Sniffer memory Sniffer memory usage in kB. Sniffer memory usage is always greater than 0 when the sniffer is running. The memory usage increases as more data is held in the logger queue. Memory that is allocated to the sniffer is not released until the sniffer restarts.
Percent Mysql memory The percentage of total system memory that is used by the MySQL database. Provides general background information. This value goes up or down depending on usage of the system. The exact value is not important unless a problem was identified.
Free buffer space The percentage of free buffer space for the sniffer process. The sniffer buffer engine is only used in implementations that use SPAN ports, Network TAPs, or S-TAP® PCAP. If the native S-TAP drivers are used, this value usually remains at 100%.
Analyzer queue Indicates the amount of data that is in the Analyzer/Parser buffer. This value is one of the most direct indicators of sniffer performance. Ideally, the value remains at, or close to, zero. The analyzer queue might grow temporarily during temporary periods of high traffic, but should never remain elevated for more than five or six rows (5 - 6 minutes) in the Buffer Usage Monitor report. The Analyzer/Parser buffer is circular. When the analyzer goes over 80% of queue full, it starts to drop data or put it into flat log, depending on the system configuration. For more information, see Flat log process.
Logger queue The amount of SQL data that is in the logger buffer and waiting to be inserted into the collector’s database. Similar to the analyzer queue, a consistently high amount of data in the logger queue indicates that the appliance is unable to cope with the amount of traffic that is monitored. Temporary spikes in buffered data are normal, provided the buffer is flushed within several minutes.
Mysql disk usage The Current® MySQL disk usage (percentage). High or increasing Mysql disk usage means that the appliance might be in danger of reaching or exceeding 90% full. At that point the sniffer automatically stops.
System CPU load A normalized representation of total system CPU usage. System CPU load is derived from % CPU Sniffer and % CPU Mysql, plus other loads on the CPU. Since CPU load is derived from a few measurements, it does not indicate a specific problem. When higher than normal, it can indicate an underlying problem in many areas.
System var disk usage The utilization of the /var partition. Most of files that are generated by the appliance are stored in /var.
Number of requests Number of SQL requests that were processed during the time period. From the internal Guardium statistics. This value is indicative of ‘normal’ traffic level on the system. The threshold needs to be tuned to the specific environment to be useful.
Number of full SQLs Number of full SQL records logged. From the internal Guardium statistics. This value is indicative of ‘normal’ traffic level on the system. The threshold needs to be tuned to the specific environment to be useful.
Number of exceptions Number of exceptions logged. From the internal Guardium statistics. This value is indicative of ‘normal’ traffic level on the system. The threshold needs to be tuned to the specific environment to be useful. For example, if there is a massive spike in exceptions on one collector, this might indicate an issue.
Number of policy violations Number of policy violations logged. From the internal Guardium statistics. This value is indicative of ‘normal’ traffic level on the system. The threshold needs to be tuned to the specific environment to be useful. For example, if there is a massive spike in policy violations on one collector, this might indicate an issue.
Number of flat log requests Number of requests that were flat logged. Flat log requests indicate that the sniffer is dropping packets. The sniffer usually drops packets due to an analyzer queue overflow problem caused by high traffic. Flat log requests do not increase in a system that is working correctly. If Flat log requests go over the threshold once it is a concern. Flat Log, when configured, takes the overflow from the buffer and stores it in a flat log, then inputs it later to the sniffer, with full analysis according to the policies. For more information, see Flat log process.