To better diagnose performance issues, this article discusses a methodical process to help determine if the database is having performance issues and develop remedial actions. When a DB2® or e-business application isn't performing up to expectations, the entire organization and financial bottom line might suffer.
The objective of this article is to guide DBAs in defining and diagnosing performance problems on DB2 on Windows®. The article covers the following:
- Information to collect using Windows tools
- Information to collect using DB2 tools
- Problem investigation and resolution
This article assumes a basic understanding of databases, SQL, and DB2 LUW. A basic understanding of performance tuning and monitoring on UNIX® will help.
Performance problems cover a wide range of scenarios:
- SQL queries performing slower than expected
- Workload or batch job not completing in expected time, or a gradual drop in transaction rate or throughput over a time period
- Overall system slowdown
In most cases, the performance issues occur due to suboptimal use of system resources or over use of resources like CPU, IO, or memory, which often reveals a bottleneck in one of these system resources. In a properly tuned environment, system resources are used optimally without over-reliance on any of them.
The first step in diagnosing a performance issue is to identify any resource bottlenecks. Windows provides tools to help identify these bottlenecks.
If one or more CPUs on the system consistently shows more than 90-percent utilization, it usually means the system has CPU bottlenecks. Task Manager can help figure out if the system is CPU-bottlenecked. Other tools like perfmon.exe and Resource Monitor show CPU utilization, which can also help in identifying CPU bottlenecks.
Memory bottlenecks are not very common, mostly because database heaps and parameters are usually configured based on available memory. But, very low free memory seen in perfmon and Resource Monitor can indicate memory bottlenecks. Sometimes, when STMM is used, the free memory on the system might be very low, but it does not always mean that the system has memory bottleneck.
Very high network utilization as seen in Resource Monitor can indicate a network bottleneck. The Resource Monitor shows the network utilization in percentage used, which helps identify network bottlenecks quickly. If Resource Monitor shows network utilization above 80 percent, it usually indicates network bottleneck.
If one or more disks on the system are consistently busy more than 90 percent of the time, or the disk queue length shows high numbers constantly, it usually means that the system has I/O disk bottlenecks. Windows tools like Resource Monitor and perfmon can help identify I/O bottlenecks. Task Manager does show I/O activity, but Resource Monitor and perfmon show I/O details per disk and additionally show the percentage of active time, which helps identify whether there is a bottleneck on any particular disk.
There are various Windows tools to help determine whether the system has one or more resource bottlenecks.
Task Manager is a quickest way to get information about overall system usage. For example, Figure 2 is a screenshot of the Processes tab of Task Manager with columns providing CPU, memory, and I/O statistics per process (View > Select Columns). Task Manager gives a good summary of CPU, I/O, memory, and network utilization. Task Manager also provides process details, helping figure out which processes are consuming most of the CPU, which processes are doing the most I/O, etc.
If Task Manager shows that overall CPU utilization is more than 90 percent consistently, it is an indication of CPU bottleneck. Task Manager also shows individual CPU activity in the Performance tab. If any one of the CPUs is utilized close to 100 percent consistently, it could mean a CPU bottleneck as well. Usually, this means the workload on the database is single-threaded and unable to exploit all the available CPUs on the system. Figure 1 is an example of single-threaded application running. Even though only one CPU is busy and others are not utilized, it is still a CPU bottleneck.
Task Manager also shows details on the amount of data read/written from the disk by each process. This in itself is useful, but it doesn't show the percentage of utilization of individual disks. This makes it difficult to conclude whether the system is suffering from an I/O bottleneck from Task Manager alone.
Figure 1. Task Manager — Example of single-threaded workload
Figure 2. Task Manager showing processes with maximum memory utilization
Resource Monitor is another Windows tool available on Windows 2008 and Windows 7. It provides detailed information about I/O, CPU, memory, and network usage. The tool displays real-time information about all processes running on your system and provides the ability to filter data based on user requirements. This can be done based on memory, CPU, disk, and network. The Overview tab shows overall system activity and gives a snapshot of the bottlenecks in the system. I/O and network usage are shown in percentage utilization compared to available bandwidth. This helps identify whether the system has an I/O or network bottleneck, which is not possible to find out from Task Manager. Additionally, Resource Monitor shows the disk queue length for each disk, which is useful in determining if the disk has enough bandwidth to address the system's I/O demand.
Figure 3. Resource Monitor
Figure 4. Resource Monitor DB2 disk activity
Although both Task Manager and Resource Monitor are good tools to find out system
activity, they cannot be used to record system activities in a log for later
analysis. The perfmon tool can record system activity in a log file.
This provides flexibility to administrators and DBAs who can collect perfmon data
at various times of the day and use it for later analysis. The perfmon
tool, shipped with Windows, is used to capture performance data and statistics on
resource usage. Understanding how to set up and capture a perfmon log is crucial to
many types of problem investigation. Note for monitoring I/O: disk counters need to
be enabled by running
diskperf -y(-ye for stripe sets), followed by
a reboot. To run perfmon and capture the activity in a log file on Windows 2008 or
- Run perfmon from a command prompt.
- Select Performance Monitor from the left frame.
- Right-click on it and select New > Data Collector set. Create a suitable name and click Next.
- Provide the directory name where the logs will be saved.
- The Data Collector set appears in the left frame. Select Data Collector Set > User Defined in the left frame and select the name you chose in step 4. The status of it should be stopped because we want to add the required counter before collection begins.
- Right-click the Data collector set defined and select
New > Data Collector.
Provide a name and select Performance counter data
collector, then click Next.
Select the sampling frequency and add the performance counters. The
perfmon tool provides
lot of counters to monitor various parameters, but the most useful are mentioned
below. This is a good starting point for collecting data. Based on the
requirement and situation, users can collect and monitor additional counters.
Resource name Monitoring counters Comments Processor Idle Time, Processor time, User time Reported in percentage with respect to total time Memory Available Mbytes, Page Fault/sec, Pages/sec
Available Mbytes — The free memory available on the system.
Pages/sec is the rate at which pages are read from or written to disk to resolve hard page faults. This counter is a primary indicator of the kinds of faults that cause system-wide delays.
Page Fault/sec is the average number of pages faulted per second. Using both the Pages/sec and Page Fault/sec gives an indication on the paging activity on the system.
Logical disk %Disk Time, Avg. Disk Bytes/Read, Avg. Disk Bytes/Write, Avg. Disk queue length, Avg. Disk sec/Read, Avg. Disk sec/Write These counters give an indication of how busy the disks are and provides the average read/write response time. Network interface Bytes Received/sec, Bytes Sent/sec, Current bandwidth
Current bandwidth is an estimate of the current bandwidth of the network interface in bits per second (BPS).
Together with bytes received and sent, it is easy to find out how busy the network interface is at a given time.
- Once we are done with the selection of performance counters as shown in the
figure below, we can collect the data for the required time interval. This can be changed using the properties
section of the chosen data collector set. The data collected for the chosen
counters can be saved in a table or spreadsheet format. The format of the
output can be specified in the properties section under the
Figure 5. perfmon performance counters
- Once you are ready for the data collection, start your workload or query and immediately start the data collection by clicking Start on the chosen data collector set.
- After completion of the query/workload, stop the monitoring data collection and review the collected data.
A good tool for normal monitoring is perfmon. Its logs can be stored, so it's easier to compare system activity when the system was behaving as expected and when the system has performance issues. This can often give valuable clues to the problem at hand. However, at times, a quick look at the data from Task Manager and Resource Monitor might help pinpoint the bottleneck in the system in real time. Once the bottleneck is identified, steps can be taken to remove the bottlenecks.
If perfmon shows the disk time more than 80 percent for one or more disks, or Resource Monitor shows active time more than 80 percent on one or more disks, it usually means the system has an I/O bottleneck. From the perfmon or Resource Monitor, figure out the disk(s) that have high utilization. Once the disk(s) with heavy usage are identified, find out what is placed on the disk(s).
Are any DB2 tablespace containers placed on the disk(s)?
db2 list tablespace containers for <tbsid>
Repeat the command for all tablespaces in the database.
Or are DB2 log files placed on the heavily used disk(s)?
db2 get db cfg for <dbname>
Search for the newlogpath database configuration parameter.
Or do these disks contain utility files like backup destination or load files? Look at the backup/load commands executed. Depending on what lies on the heavily used disk(s), the solution will vary.
If the heavily used disk(s) are assigned to tablespace containers, find out the objects in the tablespace. If the tablespace corresponds to a data tablespace, find out the tables created in the tablespace.
db2 select tabname from syscat.tables where tbspaceid = <hot tablespace id>
MON_GET_TABLE table function, find out the hottest table. The following query will
list the tables with most rows read.
Listing 1. Most active tables in a tablespace
db2 "select varchar(tabschema,20) as tabschema, varchar(tabname,20) as tabname, table_scans, sum(rows_read) as total_rows_read, sum(rows_inserted) as total_rows_inserted, sum(rows_updated) as total_rows_updated, sum(rows_deleted) as total_rows_deleted FROM TABLE(MON_GET_TABLE('','',-2)) AS t WHERE TBSP_ID = hot tablespace id GROUP BY tabschema, tabname ORDER BY total_rows_read DESC"
provides lot of useful information. It tracks the
number of table scans on the table. If the table scans is high, it could
mean the table doesn't have proper indices or the queries are not using the existing
indices on the table.
Once the hot tables are identified, it's easy to find out the SQL statements executed on the
table using the
Listing 2. Finding queries on a given table
db2 "select section_type, executable_id, package_name,num_executions, char(stmt_text, 100) from table (MON_GET_PKG_CACHE_STMT ( 'D', NULL, NULL, -2)) as T where stmt_text like '%hot table name%' "
This way, it's easy to figure out the SQL statements causing the high reads/writes on the hot tables. Use Design Advisor to find out if SQL statements are using proper indices on the table. If the table has proper indices, but they are not used, check whether the statistics on the table are up to date. Incorrect or stale statistics can cause the optimizer to choose a sub-optimal access plan.
If the heavily used disk(s) are assigned to temporary tablespace, this is an
indication of lot of sort activity on the database. High levels of temp tablespace
I/O activity tend to go hand in hand with large queries, with large result
sets, or lots of sorting. In these cases, determine whether lots of sorts are spilling to disk.
MON_GET_PKG_CACHE_STMT can give SQL statements
that are causing sort overflows.
Listing 3. Queries with high sort activity
db2 "select section_type, package_name,num_executions, total_sorts, sort_overflows, char(stmt_text, 100) from table (MON_GET_PKG_CACHE_STMT ( 'D', NULL, NULL, -2)) as T where sort_overflows > 1 order by sort_overflows desc"
If there are too many sort overflows, check if
parameters are set to low values. If the sort related parameters are set properly,
determine whether it is possible to avoid large sorts by creating indices. Use Design Advisor
for sort heavy queries and see if the sort can be avoided or reduced by creating
Transaction logs performance is very sensitive in OLTP environment. The
MON_GET_TRANSACTION_LOG table function provides the detailed activity on the
Listing 4. Transaction log activity
db2 "select log_reads, log_read_time, log_write, log_write_time, num_log_buffer_full, num_log_data_found_in_buffer from table(MON_GET_TRANSACTION_LOG(-1)) as T "
num_log_buffer_full shows the number of times the log buffer was full and had to be
flushed to disk before new logs can be written to the buffer. If this number is
increasing over a period of time, it means the log buffer is small for the workload on
the database. Increasing the
parameter value will help improve log performance and reduce the I/O on the transaction log disks.
num_log_data_found_in_buffer shows the number of times an agent reads log data from
the buffer. Reading log data from the buffer is preferable to reading from the disk
because the latter is slower. Use this element in conjunction with the
num_log_read_io element to determine if the
LOGBUFSZ database configuration
parameter needs to be increased further.
If perfmon or Resource Monitor shows one or more CPU's usage is more than 90 percent, it usually means that the system has CPU bottlenecks. As with I/O bottlenecks, the first step is to identify the database operations that consume high CPU. Typically, there are few database operations known to consume lot of CPU:
- Statement compilation
- Utilities like LOAD, BACKUP, runstats etc
- Heavy sort activity
To find out if lot of CPU is being spent in query compilation, query the
MON_GET_WORKLOAD table function.
Listing 5. CPU time spent in different activities
db2 "select varchar(workload_name,30) as workload_name, sum(total_cpu_time),sum(total_compile_proc_time), sum(act_rqsts_total), sum(total_compilations), sum(total_act_time), sum(pkg_cache_inserts), sum(pkg_cache_lookups) from TABLE(MON_GET_WORKLOAD('',-2)) as T group by workload_name"
compile_proc_time is more than 5-10 percent
pkg_cache_inserts/pkg_cache_lookups is more than 4-5 percent, the database is spending too much time in
statement compilation. It could be because of queries getting compiled
repeatedly, or the package cache is too small and the queries have to be evicted to
make room for new queries. If the application uses string literals instead of
parameter markers, the SQL statements get compiled repeatedly. In DB2 9.7 and 10.1,
DB2 provides a database configuration parameter
STMT_CONC (statement concentrator).
When enabled, statement concentrator modifies dynamic statements to allow increased
sharing of package cache entries, reducing the statement compilation.
To find out if any utilities are causing high CPU utilization, query the
MON_GET_WORKLOAD table function The metrics
returned represent the accumulation of all metrics for requests submitted
by connections mapped to the identified workload object. Metrics are rolled up to a
workload on unit of work boundaries and periodically, during the execution of
requests. Values reported by this table function reflect the current
state of the system at the time of the most recent roll-up. Metrics are strictly
increasing in value. To determine time spent in utilities like
an interval of time, use
MON_GET_WORKLOAD to query the metric at
the start and end of the interval and compute the difference.
Listing 6. CPU time spent in
db2 "select varchar(workload_name,30) as workload_name, sum(total_loads), sum(total_load_proc_time), sum(total_runstats), sum(total_runstats_proc_time) from TABLE(MON_GET_WORKLOAD('',-2)) as T group by workload_name"
LOAD are designed to take
advantage of available resources to maximize performance. If these utilities are taking more CPU than
estimated, the utilities can be throttled by setting the
configuration parameter. Another database operation that takes significant CPU is
sort. As described in the I/O bottleneck section, find out the queries doing most of
the sort activities and use Design Advisor to figure out if the sorts can be avoided
by creating additional indices.
Memory bottlenecks are not very common, mostly because database heaps and parameters can be set based on available memory. Most DB2 heaps are automatic and values arrived based on available memory. STMM does a good job of utilizing the available memory and allocating memory to heaps that require it the most. But, when not using STMM, there could be some cases where, if the memory is over-committed (i.e., allocated more than available), it can result in high paging activity. If perfmon or Resource Monitor shows lot of paging activity, it usually is because the memory for the different heaps has been allocated over the actual memory. In such cases, it is better to turn on STMM and let DB2 tune the memory for bufferpools, sort heaps, and other heaps.
Network bottlenecks typically arise from high volumes of data being moved
around, like very large result sets, client load, etc. or applications manipulating
LOBs in client-server architecture. The
MON_DB_SUMMARY administrative view gives a
good indication of time spent waiting on different resources.
NETWORK_WAIT_TIME_PERCENT field gives the time in percentage spent waiting for
network response. Usually, the time spent waiting for network should be less than
1 percent. If it is more than a few percentage points, and perfmon and
Resource Monitor show heavy usage of network bandwidth, the system could be suffering from network
bottlenecks. In such cases, the application can move some application logic
to server in the form of stored procedure or UDFs. In case of client LOAD, breaking down
the LOAD in smaller parts and executing it at various times instead of loading it at
once will reduce the network traffic.
If the system does not have any resource bottlenecks and the performance is still
bad, it could be because of locking issues. The
LOCK_WAIT_TIME_ PERCENT field in
MON_DB_SUMMARY administrative view gives a
high-level overview of the time spent in
lock waits at database level. To get a detailed view of the workloads that spent
time on lock waits, query the
monitor table function.
Listing 7. Time spent in lock waits
db2 "SELECT varchar(workload_name,30) as workload_name, sum(lock_wait_time) as total_lock_wait_time, sum(lock_waits) as total_lock_waits, sum(lock_timeouts) as total_lock_timeouts, sum(lock_escals) as total_lock_escals FROM TABLE(MON_GET_WORKLOAD('',-2)) AS t GROUP BY workload_name ORDER BY total_lock_wait_time DESC"
Apart from looking at the system resource bottlenecks and locking issues, there are
a couple of important things to worry about in any database environment. Page cleaning
and prefetching are two important activities that need to be tuned properly for
optimal performance. In some cases, if page cleaning is not tuned properly, it might
show up as I/O bottleneck. The monitoring table
MON_GET_BUFFERPOOL provides metrics
to find out the page cleaning and prefetch activity.
Listing 8. Page cleaning activity
db2 "WITH BPMETRICS AS ( SELECT bp_name, pool_data_writes, pool_async_data_writes, pool_index_writes, pool_async_index_writes, pool_no_victim_buffer, pool_lsn_gap_clns, pool_drty_pg_steal_clns, pool_drty_pg_thrsh_clns FROM TABLE(MON_GET_BUFFERPOOL('',-2)) AS METRICS) SELECT VARCHAR(bp_name,20) AS bp_name, pool_data_writes, pool_async_data_writes, CASE WHEN pool_data_writes > 0 THEN DEC((FLOAT(pool_async_data_writes) / FLOAT(pool_data_writes)) * 100,5,2) ELSE NULL END AS PAGE_CLN_RATIO, pool_index_writes, pool_async_index_writes, CASE WHEN pool_index_writes > 0 THEN DEC((FLOAT(pool_async_index_writes) / FLOAT(pool_index_writes)) * 100,5,2) ELSE NULL END AS IND_CLN_RATIO, pool_no_victim_buffer, pool_lsn_gap_clns, pool_drty_pg_steal_clns, pool_drty_pg_thrsh_clns FROM BPMETRICS"
The page cleaning ratio for Data and Index from the above query should be close to 100 percent. If the values are less than 90 percent, page cleaning is not able to keep up with growth of dirty pages in the system.
dirty_page_steal_clns shows the dirty page steals. Ideally, this has to be
zero. If it is more than a fraction of number of logical reads, we need to clean
more. Decrease the
CHG_PGS_THRESHOLD and ensure
that there are enough I/O cleaners available.
pool_no_victim_buffer shows the number of times an agent could not find a free
page in the free page list. If this number is more than few percentage
points of logical
reads, it is harmful. Decrease the
CHG_PGS_THRESHOLD and ensure that there are
enough I/O cleaners available.
Similar to the page cleaning, we need to tune the prefetch activity as well. In a
true OLTP environment, prefetching may not be useful. But in a DSS kind of workload,
prefetching plays an important role. Ideally, we want the
to do all the reading, which is asynchronous in nature. The following query shows the
percentage of I/O reads done by
Listing 9. Prefetch activity
db2 "WITH BPMETRICS AS ( SELECT bp_name, pool_data_p_reads, pool_async_data_reads, pool_temp_data_p_reads, pool_index_p_reads, pool_async_index_reads FROM TABLE(MON_GET_BUFFERPOOL('',-2)) AS METRICS) SELECT VARCHAR(bp_name,20) AS bp_name, pool_data_p_reads, pool_async_data_reads, CASE WHEN pool_data_p_reads > 0 THEN DEC((FLOAT(pool_async_data_reads) / FLOAT(pool_data_p_reads + POOL_TEMP_DATA_P_READS)) * 100,5,2) ELSE NULL END AS PREFETCH_RATIO, pool_index_p_reads, pool_async_index_reads, CASE WHEN pool_index_p_reads > 0 THEN DEC((FLOAT(pool_async_index_reads) / FLOAT(pool_index_p_reads)) * 100,5,2) ELSE NULL END AS PREFETCH_IDX_RATIO FROM BPMETRICS"
A value of more than 90 percent is a good value for
Although the steps do not cover all possible performance issues likely to occur, but above approach focuses on using principle and strategy of solving performance problems. Following the steps will definitely help you narrow down the issue and eventually help you solve the issue.
- "Informix JDBC Driver Guide" is a step-by-step guide to learn JDBC driver
- In the Informix 11.70.Information center, get the resources you need to advance
your Informix skills.
- Read our IBM Informix
Developer's Handbook to get up to speed quickly on IBM Informix product
- Learn more about Information Management at the developerWorks Information Management
zone. Find technical documentation,
how-to articles, education, downloads, product information, and
- Stay current with
developerWorks technical events and webcasts.
- Follow developerWorks on
Get products and technologies
- Build your next
development project with
IBM trial software,
available for download directly from developerWorks.
- Now you can use
DB2 for free. Download DB2 Express-C, a no-charge
version of DB2 Express Edition for the community that offers the same core
data features as DB2 Express Edition and provides a solid base to build
and deploy applications.
- Participate in the discussion forum.
- Check out the
blogs and get involved in the