Tune DB2 database on Windows in eight simple steps

Performance issues can be difficult to diagnose and fix. Often issues crop up when least expected. Usually, they gradually develop and sometimes can cause outage of the application using the database. It is important from the DBA's point of view to be aware of a proactive approach to addressing performance even before it starts hurting the business. This article discusses an analytical process that will help DBAs to get to the root of the issue in a few simple steps.

Sandeep Sethia (sasethia@in.ibm.com), Staff Software Engineer, IBM

Sandeep SethiaSandeep Sethia is working as a staff software engineer with IBM India Software Lab. He joined IBM in 2006 and is currently working with the DB2 Performance team.



Vaibhav Kulkarni (vaibhaku@in.ibm.com), Advisory Software Engineer, IBM

Vaibhav KulkarniVaibhav Kulkarni is an advisory software engineer at IBM's India Software Labs. His area of expertise is DB2 performance. He is working with the DB2 Performance team.



13 September 2012

Also available in Chinese Portuguese

Overview

To better diagnose performance issues, this article discusses a methodical process to help determine if the database is having performance issues and develop remedial actions. When a DB2® or e-business application isn't performing up to expectations, the entire organization and financial bottom line might suffer.

The objective of this article is to guide DBAs in defining and diagnosing performance problems on DB2 on Windows®. The article covers the following:

  • Information to collect using Windows tools
  • Information to collect using DB2 tools
  • Problem investigation and resolution

Assumption

This article assumes a basic understanding of databases, SQL, and DB2 LUW. A basic understanding of performance tuning and monitoring on UNIX® will help.


Performance problems and correlation with system resources

Performance problems cover a wide range of scenarios:

  • SQL queries performing slower than expected
  • Workload or batch job not completing in expected time, or a gradual drop in transaction rate or throughput over a time period
  • Overall system slowdown

In most cases, the performance issues occur due to suboptimal use of system resources or over use of resources like CPU, IO, or memory, which often reveals a bottleneck in one of these system resources. In a properly tuned environment, system resources are used optimally without over-reliance on any of them.

The first step in diagnosing a performance issue is to identify any resource bottlenecks. Windows provides tools to help identify these bottlenecks.

CPU bottleneck

If one or more CPUs on the system consistently shows more than 90-percent utilization, it usually means the system has CPU bottlenecks. Task Manager can help figure out if the system is CPU-bottlenecked. Other tools like perfmon.exe and Resource Monitor show CPU utilization, which can also help in identifying CPU bottlenecks.

Memory bottleneck

Memory bottlenecks are not very common, mostly because database heaps and parameters are usually configured based on available memory. But, very low free memory seen in perfmon and Resource Monitor can indicate memory bottlenecks. Sometimes, when STMM is used, the free memory on the system might be very low, but it does not always mean that the system has memory bottleneck.

Network bottleneck

Very high network utilization as seen in Resource Monitor can indicate a network bottleneck. The Resource Monitor shows the network utilization in percentage used, which helps identify network bottlenecks quickly. If Resource Monitor shows network utilization above 80 percent, it usually indicates network bottleneck.

I/O bottleneck

If one or more disks on the system are consistently busy more than 90 percent of the time, or the disk queue length shows high numbers constantly, it usually means that the system has I/O disk bottlenecks. Windows tools like Resource Monitor and perfmon can help identify I/O bottlenecks. Task Manager does show I/O activity, but Resource Monitor and perfmon show I/O details per disk and additionally show the percentage of active time, which helps identify whether there is a bottleneck on any particular disk.

There are various Windows tools to help determine whether the system has one or more resource bottlenecks.


Step 1: Identifying bottlenecks using Window tools

Task Manager

Task Manager is a quickest way to get information about overall system usage. For example, Figure 2 is a screenshot of the Processes tab of Task Manager with columns providing CPU, memory, and I/O statistics per process (View > Select Columns). Task Manager gives a good summary of CPU, I/O, memory, and network utilization. Task Manager also provides process details, helping figure out which processes are consuming most of the CPU, which processes are doing the most I/O, etc.

If Task Manager shows that overall CPU utilization is more than 90 percent consistently, it is an indication of CPU bottleneck. Task Manager also shows individual CPU activity in the Performance tab. If any one of the CPUs is utilized close to 100 percent consistently, it could mean a CPU bottleneck as well. Usually, this means the workload on the database is single-threaded and unable to exploit all the available CPUs on the system. Figure 1 is an example of single-threaded application running. Even though only one CPU is busy and others are not utilized, it is still a CPU bottleneck.

Task Manager also shows details on the amount of data read/written from the disk by each process. This in itself is useful, but it doesn't show the percentage of utilization of individual disks. This makes it difficult to conclude whether the system is suffering from an I/O bottleneck from Task Manager alone.

Figure 1. Task Manager — Example of single-threaded workload
Image shows Task Manager example of single-threaded workload
Figure 2. Task Manager showing processes with maximum memory utilization
Image shows Task Manager processes sorted by memory utilized

Resource Monitor

Resource Monitor is another Windows tool available on Windows 2008 and Windows 7. It provides detailed information about I/O, CPU, memory, and network usage. The tool displays real-time information about all processes running on your system and provides the ability to filter data based on user requirements. This can be done based on memory, CPU, disk, and network. The Overview tab shows overall system activity and gives a snapshot of the bottlenecks in the system. I/O and network usage are shown in percentage utilization compared to available bandwidth. This helps identify whether the system has an I/O or network bottleneck, which is not possible to find out from Task Manager. Additionally, Resource Monitor shows the disk queue length for each disk, which is useful in determining if the disk has enough bandwidth to address the system's I/O demand.

Figure 3. Resource Monitor
Image shows that Resource Monitor shows overall system activity
Figure 4. Resource Monitor DB2 disk activity
Image shows that Resource Monitor shows disk activity by DB2 processes

perfmon

Although both Task Manager and Resource Monitor are good tools to find out system activity, they cannot be used to record system activities in a log for later analysis. The perfmon tool can record system activity in a log file. This provides flexibility to administrators and DBAs who can collect perfmon data at various times of the day and use it for later analysis. The perfmon tool, shipped with Windows, is used to capture performance data and statistics on resource usage. Understanding how to set up and capture a perfmon log is crucial to many types of problem investigation. Note for monitoring I/O: disk counters need to be enabled by running diskperf -y(-ye for stripe sets), followed by a reboot. To run perfmon and capture the activity in a log file on Windows 2008 or Windows 7:

  1. Run perfmon from a command prompt.
  2. Select Performance Monitor from the left frame.
  3. Right-click on it and select New > Data Collector set. Create a suitable name and click Next.
  4. Provide the directory name where the logs will be saved.
  5. The Data Collector set appears in the left frame. Select Data Collector Set > User Defined in the left frame and select the name you chose in step 4. The status of it should be stopped because we want to add the required counter before collection begins.
  6. Right-click the Data collector set defined and select New > Data Collector. Provide a name and select Performance counter data collector, then click Next. Select the sampling frequency and add the performance counters. The perfmon tool provides lot of counters to monitor various parameters, but the most useful are mentioned below. This is a good starting point for collecting data. Based on the requirement and situation, users can collect and monitor additional counters.
    Resource nameMonitoring countersComments
    ProcessorIdle Time, Processor time, User timeReported in percentage with respect to total time
    MemoryAvailable Mbytes, Page Fault/sec, Pages/sec

    Available Mbytes — The free memory available on the system.

    Pages/sec is the rate at which pages are read from or written to disk to resolve hard page faults. This counter is a primary indicator of the kinds of faults that cause system-wide delays.

    Page Fault/sec is the average number of pages faulted per second. Using both the Pages/sec and Page Fault/sec gives an indication on the paging activity on the system.

    Logical disk %Disk Time, Avg. Disk Bytes/Read, Avg. Disk Bytes/Write, Avg. Disk queue length, Avg. Disk sec/Read, Avg. Disk sec/Write These counters give an indication of how busy the disks are and provides the average read/write response time.
    Network interface Bytes Received/sec, Bytes Sent/sec, Current bandwidth

    Current bandwidth is an estimate of the current bandwidth of the network interface in bits per second (BPS).

    Together with bytes received and sent, it is easy to find out how busy the network interface is at a given time.

  7. Once we are done with the selection of performance counters as shown in the figure below, we can collect the data for the required time interval. This can be changed using the properties section of the chosen data collector set. The data collected for the chosen counters can be saved in a table or spreadsheet format. The format of the output can be specified in the properties section under the File tab.
    Figure 5. perfmon performance counters
    Image shows perfmon performance counters
  8. Once you are ready for the data collection, start your workload or query and immediately start the data collection by clicking Start on the chosen data collector set.
  9. After completion of the query/workload, stop the monitoring data collection and review the collected data.

Choosing the right diagnostic tool

A good tool for normal monitoring is perfmon. Its logs can be stored, so it's easier to compare system activity when the system was behaving as expected and when the system has performance issues. This can often give valuable clues to the problem at hand. However, at times, a quick look at the data from Task Manager and Resource Monitor might help pinpoint the bottleneck in the system in real time. Once the bottleneck is identified, steps can be taken to remove the bottlenecks.


Step 2: I/O bottlenecks — Detailed look

If perfmon shows the disk time more than 80 percent for one or more disks, or Resource Monitor shows active time more than 80 percent on one or more disks, it usually means the system has an I/O bottleneck. From the perfmon or Resource Monitor, figure out the disk(s) that have high utilization. Once the disk(s) with heavy usage are identified, find out what is placed on the disk(s).

Are any DB2 tablespace containers placed on the disk(s)?

db2 list tablespace containers for <tbsid>

Repeat the command for all tablespaces in the database.

Or are DB2 log files placed on the heavily used disk(s)?

db2 get db cfg for <dbname>

Search for the newlogpath database configuration parameter.

Or do these disks contain utility files like backup destination or load files? Look at the backup/load commands executed. Depending on what lies on the heavily used disk(s), the solution will vary.

Disk bottleneck on tablespace containers

If the heavily used disk(s) are assigned to tablespace containers, find out the objects in the tablespace. If the tablespace corresponds to a data tablespace, find out the tables created in the tablespace.

db2 select tabname from syscat.tables where tbspaceid = <hot tablespace id>

From the MON_GET_TABLE table function, find out the hottest table. The following query will list the tables with most rows read.

Listing 1. Most active tables in a tablespace
db2 "select varchar(tabschema,20) as tabschema, varchar(tabname,20) as tabname,
     table_scans, sum(rows_read) as total_rows_read, 
     sum(rows_inserted) as total_rows_inserted, 
     sum(rows_updated) as total_rows_updated, 
     sum(rows_deleted) as total_rows_deleted 
FROM TABLE(MON_GET_TABLE('','',-2)) AS t 
WHERE TBSP_ID = hot tablespace id
GROUP BY tabschema, tabname 
ORDER BY total_rows_read DESC"

NOTE: MON_GET_TABLE function provides lot of useful information. It tracks the number of table scans on the table. If the table scans is high, it could mean the table doesn't have proper indices or the queries are not using the existing indices on the table.

Once the hot tables are identified, it's easy to find out the SQL statements executed on the table using the MON_GET_PKG_CACHE_STMT table function.

Listing 2. Finding queries on a given table
db2 "select section_type, executable_id, package_name,num_executions,
char(stmt_text, 100) 
from table (MON_GET_PKG_CACHE_STMT ( 'D', NULL, NULL, -2)) as T
where stmt_text like '%hot table name%' "

This way, it's easy to figure out the SQL statements causing the high reads/writes on the hot tables. Use Design Advisor to find out if SQL statements are using proper indices on the table. If the table has proper indices, but they are not used, check whether the statistics on the table are up to date. Incorrect or stale statistics can cause the optimizer to choose a sub-optimal access plan.

Disk bottleneck on temp tablespace

If the heavily used disk(s) are assigned to temporary tablespace, this is an indication of lot of sort activity on the database. High levels of temp tablespace I/O activity tend to go hand in hand with large queries, with large result sets, or lots of sorting. In these cases, determine whether lots of sorts are spilling to disk. MON_GET_PKG_CACHE_STMT can give SQL statements that are causing sort overflows.

Listing 3. Queries with high sort activity
db2 "select section_type, package_name,num_executions, total_sorts, sort_overflows,
char(stmt_text, 100) 
from table (MON_GET_PKG_CACHE_STMT ( 'D', NULL, NULL, -2)) as T
where sort_overflows > 1 
order by sort_overflows desc"

If there are too many sort overflows, check if SORTHEAP or SHEAPTHRES configuration parameters are set to low values. If the sort related parameters are set properly, determine whether it is possible to avoid large sorts by creating indices. Use Design Advisor for sort heavy queries and see if the sort can be avoided or reduced by creating additional indices.

Disk bottleneck on transaction logs

Transaction logs performance is very sensitive in OLTP environment. The MON_GET_TRANSACTION_LOG table function provides the detailed activity on the transaction logs.

Listing 4. Transaction log activity
db2 "select log_reads, log_read_time, log_write, log_write_time, num_log_buffer_full,
num_log_data_found_in_buffer 
from table(MON_GET_TRANSACTION_LOG(-1)) as T "

num_log_buffer_full shows the number of times the log buffer was full and had to be flushed to disk before new logs can be written to the buffer. If this number is increasing over a period of time, it means the log buffer is small for the workload on the database. Increasing the LOGBUFSZ database parameter value will help improve log performance and reduce the I/O on the transaction log disks.

num_log_data_found_in_buffer shows the number of times an agent reads log data from the buffer. Reading log data from the buffer is preferable to reading from the disk because the latter is slower. Use this element in conjunction with the num_log_read_io element to determine if the LOGBUFSZ database configuration parameter needs to be increased further.


Step 3: CPU bottlenecks — detailed look

If perfmon or Resource Monitor shows one or more CPU's usage is more than 90 percent, it usually means that the system has CPU bottlenecks. As with I/O bottlenecks, the first step is to identify the database operations that consume high CPU. Typically, there are few database operations known to consume lot of CPU:

  • Statement compilation
  • Utilities like LOAD, BACKUP, runstats etc
  • Heavy sort activity

To find out if lot of CPU is being spent in query compilation, query the MON_GET_WORKLOAD table function.

Listing 5. CPU time spent in different activities
db2 "select varchar(workload_name,30) as workload_name,
sum(total_cpu_time),sum(total_compile_proc_time),
sum(act_rqsts_total), sum(total_compilations),
sum(total_act_time), sum(pkg_cache_inserts), 
sum(pkg_cache_lookups) 
from TABLE(MON_GET_WORKLOAD('',-2)) as T
group by workload_name"

If compile_proc_time is more than 5-10 percent of the total_cpu_time and pkg_cache_inserts/pkg_cache_lookups is more than 4-5 percent, the database is spending too much time in statement compilation. It could be because of queries getting compiled repeatedly, or the package cache is too small and the queries have to be evicted to make room for new queries. If the application uses string literals instead of parameter markers, the SQL statements get compiled repeatedly. In DB2 9.7 and 10.1, DB2 provides a database configuration parameter STMT_CONC (statement concentrator). When enabled, statement concentrator modifies dynamic statements to allow increased sharing of package cache entries, reducing the statement compilation.

To find out if any utilities are causing high CPU utilization, query the MON_GET_WORKLOAD table function The metrics returned represent the accumulation of all metrics for requests submitted by connections mapped to the identified workload object. Metrics are rolled up to a workload on unit of work boundaries and periodically, during the execution of requests. Values reported by this table function reflect the current state of the system at the time of the most recent roll-up. Metrics are strictly increasing in value. To determine time spent in utilities like LOAD and reorg for an interval of time, use MON_GET_WORKLOAD to query the metric at the start and end of the interval and compute the difference.

Listing 6. CPU time spent in LOAD and runstat utilities
db2 "select varchar(workload_name,30) as workload_name,
sum(total_loads), sum(total_load_proc_time),
sum(total_runstats), sum(total_runstats_proc_time)
from TABLE(MON_GET_WORKLOAD('',-2)) as T 
group by workload_name"

Utilities like LOAD are designed to take advantage of available resources to maximize performance. If these utilities are taking more CPU than estimated, the utilities can be throttled by setting the util_impact_limit database configuration parameter. Another database operation that takes significant CPU is sort. As described in the I/O bottleneck section, find out the queries doing most of the sort activities and use Design Advisor to figure out if the sorts can be avoided by creating additional indices.


Step 4: Memory bottlenecks

Memory bottlenecks are not very common, mostly because database heaps and parameters can be set based on available memory. Most DB2 heaps are automatic and values arrived based on available memory. STMM does a good job of utilizing the available memory and allocating memory to heaps that require it the most. But, when not using STMM, there could be some cases where, if the memory is over-committed (i.e., allocated more than available), it can result in high paging activity. If perfmon or Resource Monitor shows lot of paging activity, it usually is because the memory for the different heaps has been allocated over the actual memory. In such cases, it is better to turn on STMM and let DB2 tune the memory for bufferpools, sort heaps, and other heaps.


Step 5: Network bottleneck

Network bottlenecks typically arise from high volumes of data being moved around, like very large result sets, client load, etc. or applications manipulating LOBs in client-server architecture. The MON_DB_SUMMARY administrative view gives a good indication of time spent waiting on different resources. NETWORK_WAIT_TIME_PERCENT field gives the time in percentage spent waiting for network response. Usually, the time spent waiting for network should be less than 1 percent. If it is more than a few percentage points, and perfmon and Resource Monitor show heavy usage of network bandwidth, the system could be suffering from network bottlenecks. In such cases, the application can move some application logic to server in the form of stored procedure or UDFs. In case of client LOAD, breaking down the LOAD in smaller parts and executing it at various times instead of loading it at once will reduce the network traffic.


Step 6: Locking issues

If the system does not have any resource bottlenecks and the performance is still bad, it could be because of locking issues. The LOCK_WAIT_TIME_ PERCENT field in MON_DB_SUMMARY administrative view gives a high-level overview of the time spent in lock waits at database level. To get a detailed view of the workloads that spent time on lock waits, query the MON_GET_WORKLOAD monitor table function.

Listing 7. Time spent in lock waits
db2 "SELECT varchar(workload_name,30) as workload_name,
sum(lock_wait_time) as total_lock_wait_time,
sum(lock_waits) as total_lock_waits,
sum(lock_timeouts) as total_lock_timeouts, 
sum(lock_escals) as total_lock_escals 
FROM TABLE(MON_GET_WORKLOAD('',-2)) AS t
GROUP BY workload_name 
ORDER BY total_lock_wait_time DESC"

Step 7: Tuning page cleaning activity

Apart from looking at the system resource bottlenecks and locking issues, there are a couple of important things to worry about in any database environment. Page cleaning and prefetching are two important activities that need to be tuned properly for optimal performance. In some cases, if page cleaning is not tuned properly, it might show up as I/O bottleneck. The monitoring table MON_GET_BUFFERPOOL provides metrics to find out the page cleaning and prefetch activity.

Listing 8. Page cleaning activity
db2 "WITH BPMETRICS AS (
SELECT bp_name,
pool_data_writes, pool_async_data_writes, 
pool_index_writes, pool_async_index_writes,
pool_no_victim_buffer, pool_lsn_gap_clns,
pool_drty_pg_steal_clns, pool_drty_pg_thrsh_clns
FROM TABLE(MON_GET_BUFFERPOOL('',-2)) AS METRICS)
SELECT
   VARCHAR(bp_name,20) AS bp_name,
   pool_data_writes, pool_async_data_writes,
  CASE WHEN pool_data_writes > 0
  THEN DEC((FLOAT(pool_async_data_writes) / FLOAT(pool_data_writes)) * 100,5,2)
    ELSE NULL
    END AS PAGE_CLN_RATIO,
     pool_index_writes, pool_async_index_writes,
  CASE WHEN pool_index_writes > 0
  THEN DEC((FLOAT(pool_async_index_writes) / FLOAT(pool_index_writes)) * 100,5,2)
    ELSE NULL
    END AS IND_CLN_RATIO,
      pool_no_victim_buffer, pool_lsn_gap_clns,
      pool_drty_pg_steal_clns, pool_drty_pg_thrsh_clns
    FROM BPMETRICS"

The page cleaning ratio for Data and Index from the above query should be close to 100 percent. If the values are less than 90 percent, page cleaning is not able to keep up with growth of dirty pages in the system.

dirty_page_steal_clns shows the dirty page steals. Ideally, this has to be zero. If it is more than a fraction of number of logical reads, we need to clean more. Decrease the CHG_PGS_THRESHOLD and ensure that there are enough I/O cleaners available.

pool_no_victim_buffer shows the number of times an agent could not find a free page in the free page list. If this number is more than few percentage points of logical reads, it is harmful. Decrease the CHG_PGS_THRESHOLD and ensure that there are enough I/O cleaners available.


Step 8: Tuning prefetch activity

Similar to the page cleaning, we need to tune the prefetch activity as well. In a true OLTP environment, prefetching may not be useful. But in a DSS kind of workload, prefetching plays an important role. Ideally, we want the IO_SERVERS (prefetchers) to do all the reading, which is asynchronous in nature. The following query shows the percentage of I/O reads done by IO_SERVERS.

Listing 9. Prefetch activity
db2 "WITH BPMETRICS AS (
    SELECT bp_name,
      pool_data_p_reads, pool_async_data_reads,
      pool_temp_data_p_reads,
      pool_index_p_reads, pool_async_index_reads
    FROM TABLE(MON_GET_BUFFERPOOL('',-2)) AS METRICS)
    SELECT
      VARCHAR(bp_name,20) AS bp_name,
      pool_data_p_reads, pool_async_data_reads,
    CASE WHEN pool_data_p_reads > 0
    THEN DEC((FLOAT(pool_async_data_reads) / FLOAT(pool_data_p_reads + 
                POOL_TEMP_DATA_P_READS)) * 100,5,2)
    ELSE NULL
    END AS PREFETCH_RATIO,
      pool_index_p_reads, pool_async_index_reads,
    CASE WHEN pool_index_p_reads > 0
    THEN DEC((FLOAT(pool_async_index_reads) / FLOAT(pool_index_p_reads)) * 100,5,2)
    ELSE NULL
    END AS PREFETCH_IDX_RATIO
FROM BPMETRICS"

A value of more than 90 percent is a good value for PREFETCH_RATIO.


Summary

Although the steps do not cover all possible performance issues likely to occur, but above approach focuses on using principle and strategy of solving performance problems. Following the steps will definitely help you narrow down the issue and eventually help you solve the issue.

Resources

Learn

Get products and technologies

  • Build your next development project with IBM trial software, available for download directly from developerWorks.
  • Now you can use DB2 for free. Download DB2 Express-C, a no-charge version of DB2 Express Edition for the community that offers the same core data features as DB2 Express Edition and provides a solid base to build and deploy applications.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=834366
ArticleTitle=Tune DB2 database on Windows in eight simple steps
publish-date=09132012