Basic set of system performance monitor elements
11 metrics of system performance provide a good basic set to use in an on-going operational monitoring effort.
- Easy to collect - You do not want to use complex or expensive tools for everyday monitoring, and you do not want the act of monitoring to significantly burden the system.
- Easy to understand - You do not want to look up the meaning of the metric each time you see it.
- Relevant to your system - Not all metrics provide meaningful information in all environments.
- Sensitive, but not too sensitive - A change in the metric should indicate a real change in the system; the metric should not fluctuate on its own.
This starter set includes 11 metrics:
- The number of transactions
This provides an excellent base level measurement of system activity.
- Buffer pool hit ratios, measured separately for data, index, XML storage object, and temporary
data:Note: The information that follows discusses buffer pools in environments other than Db2® pureScale® environments. Buffer pools work differently in Db2 pureScale environments. For more information, see Buffer pool monitoring in a Db2 pureScale environment.
Note: The formulas shown for hit ratios for data and index pages exclude any read activity by prefetchers.
- Data pages: ((pool_data_lbp_pages_found - pool_async_data_lbp_pages_found) / (pool_data_l_reads + pool_temp_data_l_reads)) × 100
- Index pages: ((pool_index_lbp_pages_found - pool_async_index_lbp_pages_found ) / (pool_index_l_reads + pool_temp_index_l_reads)) × 100
- XML storage object (XDA) pages: ((pool_xda_lbp_pages_found - pool_async_xda_lbp_pages_found ) / (pool_xda_l_reads + pool_temp_xda_l_reads)) × 100
Although these metrics are generally useful, for systems such as data warehouses that frequently perform large table scans, data hit ratios are often irretrievably low, because data is read into the buffer pool and then not used again before being evicted to make room for other data.
- Buffer pool physical reads and writes per transaction:Note: The information that follows discusses buffer pools in environments other than Db2 pureScale environments. Buffer pools work differently in Db2 pureScale environments. For more information, see Buffer pool monitoring in a Db2 pureScale environment.
These metrics are closely related to buffer pool hit ratios, but have a slightly different purpose. Although you can consider target values for hit ratios, there are no possible targets for reads and writes per transaction. Why bother with these calculations? Because disk I/O is such a major factor in database performance, it is useful to have multiple ways of looking at it. As well, these calculations include writes, whereas hit ratios only deal with reads. Lastly, in isolation, it is difficult to know, for example, whether a 94% index hit ratio is worth trying to improve. If there are only 100 logical index reads per hour, and 94 of them are in the buffer pool, working to keep those last 6 from turning into physical reads is not a good use of time. However, if a 94% index hit ratio were accompanied by a statistic that each transaction did twenty physical reads (which can be further broken down by data and index, regular and temporary), the buffer pool hit ratios might well deserve some investigation.
(POOL_DATA_P_READS + POOL_INDEX_P_READS + POOL_XDA_P_READS POOL_TEMP_DATA_P_READS + POOL_TEMP_INDEX_P_READS) / TOTAL_APP_COMMITS (POOL_DATA_WRITES + POOL_INDEX_WRITES + POOL_XDA_WRITES) / TOTAL_APP_COMMITS
The metrics are not just physical reads and writes, but are normalized per transaction. This trend is followed through many of the metrics. The purpose is to decouple metrics from the length of time data was collected, and from whether the system was very busy or less busy at that time. In general, this helps ensure that similar values for metrics are obtained, regardless of how and when monitoring data is collected. Some amount of consistency in the timing and duration of data collection is a good thing; however, normalization reduces it from being critical to being a good idea.
- The ratio of database rows read to rows selected:
This calculation gives an indication of the average number of rows that are read from database tables to find the rows that qualify. Low numbers are an indication of efficiency in locating data, and generally show that indexes are being used effectively. For example, this number can be very high in the case where the system does many table scans, and millions of rows have to be inspected to determine if they qualify for the result set. Alternatively, this statistic can be very low in the case of access to a table through a fully-qualified unique index. Index-only access plans (where no rows need to be read from the table) do not cause ROWS_READ to increase.
ROWS_READ / ROWS_RETURNED
In an OLTP environment, this metric is generally no higher than 2 or 3, indicating that most access is through indexes instead of table scans. This metric is a simple way to monitor plan stability over time - an unexpected increase is often an indication that an index is no longer being used and should be investigated.
- The amount of time spent sorting per transaction:
This is an efficient way to handle sort statistics, because any extra time due to spilled sorts automatically gets included here. That said, you might also want to collect TOTAL_SORTS and SORT_OVERFLOWS for ease of analysis, especially if your system has a history of sorting issues.
TOTAL_SORT_TIME / TOTAL_APP_COMMITS
- The amount of lock wait time accumulated per thousand transactions:
Excessive lock wait time often translates into poor response time, so it is important to monitor. The value is normalized to one thousand transactions because lock wait time on a single transaction is typically quite low. Scaling up to one thousand transactions provides measurements that are easier to handle.
1000 * LOCK_WAIT_TIME / TOTAL_APP_COMMITS
- The number of deadlocks and lock timeouts per thousand transactions:
Although deadlocks are comparatively rare in most production systems, lock timeouts can be more common. The application usually has to handle them in a similar way: re-executing the transaction from the beginning. Monitoring the rate at which this happens helps avoid the case where many deadlocks or lock timeouts drive significant extra load on the system without the DBA being aware.
1000 * (DEADLOCKS + LOCK_TIMEOUTS) / TOTAL_APP_COMMITS
- The number of dirty steal triggers per thousand transactions:
1000 * POOL_DRTY_PG_STEAL_CLNS / TOTAL_APP_COMMITS
dirty stealis the least preferred way to trigger buffer pool cleaning. Essentially, the processing of an SQL statement that is in need of a new buffer pool page is interrupted while updates on the victim page are written to disk. If dirty steals are allowed to happen frequently, they can have a significant affect on throughput and response time.
- The number of package cache inserts per thousand transactions:
Package cache insertions are part of normal execution of the system; however, in large numbers, they can represent a significant consumer of CPU time. In many well-designed systems, after the system is running at steady-state, very few package cache inserts occur, because the system is using or reusing static SQL or previously prepared dynamic SQL statements. In systems with a high traffic of ad hoc dynamic SQL statements, SQL compilation and package cache inserts are unavoidable. However, this metric is intended to watch for a third type of situation, one in which applications unintentionally cause package cache churn by not reusing prepared statements, or by not using parameter markers in their frequently executed SQL.
1000 * PKG_CACHE_INSERTS / TOTAL_APP_COMMITS
- The time an
agent waits for log records to be flushed to
The transaction log has significant potential to be a system bottleneck, whether due to high levels of activity, or to improper configuration, or other causes. By monitoring log activity, you can detect problems both from the Db2 side (meaning an increase in number of log requests driven by the application) and from the system side (often due to a decrease in log subsystem performance caused by hardware or configuration problems).
LOG_WRITE_TIME / TOTAL_APP_COMMITS
- In partitioned database environments, the number of fast communication
manager (FCM) buffers sent and received between partitions:
These give the rate of flow of data between different partitions in the cluster, and in particular, whether the flow is balanced. Significant differences in the numbers of buffers received from different partitions might indicate a skew in the amount of data that has been hashed to each partition.
Cross-partition monitoring in partitioned database environments
Almost all of the individual monitoring element values mentioned previously are reported on a per-partition basis.
- Logical and physical buffer pool reads for data, indexes, and temporary tables
- Rows read, at the partition level and for large tables
- Sort time and sort overflows
- FCM buffer sends and receives
- CPU and I/O utilization