Performance tuning IBM Tivoli Directory Server
In this series we're exploring techniques and tools to use when troubleshooting poor performance for IBM Tivoli Directory Server. In this article we're focusing on the underlying DB2 database and ways to identify and correct bottlenecks therein.
The IBM Tivoli Directory Server DB2 database
IBM Tivoli Directory Server uses IBM DB2 to store and access the data stored in the directory. Roughly, TDS translates LDAP queries into DB2 queries to access the data in the underlying datastore. DB2 is the one that executes the query and returns the requested data to the TDS process (ibmslapd) who responds to the initial request with the results.
In many cases DB2 can be considered a black box. The database is created, tuned, and managed by TDS processes and tools. Most TDS deployments require little, if any, DB2 knowledge to administer. There are the rare cases, however, when an LDAP administrator is unable to improve the performance of slow queries using the conventional TDS interfaces and there is a need to delve down into DB2 to resolve the problem. This article addresses some of the common scenarios when this investigation of the DB2 layer is necessary.
The remainder of this article assumes you have some experience with databases, ideally DB2, and are comfortable collecting DB2-level information in your environment.
DB2 monitoring and snapshots
DB2 supports monitoring of several different areas:
- buffer pools
- units of work
To troubleshoot TDS performance problems at the DB2 level it is strongly suggested that you enable monitoring for all areas except tables. It has been our experience that having table monitoring enabled can have a non-negligible performance impact. Enabling monitoring in the other areas does not have a noticeable performance impact and these monitors can, and should, be left enabled in production environments.
Information from these monitors can be obtained through the collection of database snapshots.
Enabling, disabling, and resetting monitoring
DB2 monitoring can be enabled at the connection or database level. If monitors are enabled at the connection level, any metrics gathered are cleared and gathering stops when the connection is terminated. Enabling monitors at the database level has the benefit of providing metrics beginning immediately after the database is started without requiring an active connection.
To get the current monitor states of the database, connect to the database as the database administrator and run the following command:
$ db2 get monitor switches Monitor Recording Switches Switch list for db partition number 0 Buffer Pool Activity Information (BUFFERPOOL) = OFF Lock Information (LOCK) = OFF Sorting Information (SORT) = OFF SQL Statement Information (STATEMENT) = OFF Table Activity Information (TABLE) = OFF Take Timestamp Information (TIMESTAMP) = ON 07/28/2008 09:46:44.988974 Unit of Work Information (UOW) = OFF
The monitor states above are the default for new DB2 databases. Notice that the only monitor flag enabled by default is the TIMESTAMP monitor. For TDS monitoring, we need all of the monitors except the TABLE monitor enabled.
To enable connection-level monitoring, use the
db2 update monitor switches command. Remember that with connection-level monitoring metrics are only gathered the connection is closed. Examples of how to enable connection-level monitoring are:
db2 update monitor switches using BUFFERPOOL on db2 update monitor switches using LOCK on db2 update monitor switches using SORT on db2 update monitor switches using STATEMENT on db2 update monitor switches using TIMESTAMP on db2 update monitor switches using UOW on
Note that we left the TABLE monitor turned off for performance reasons.
For TDS it is recommended that you enable monitoring at the database level. This ensures that monitoring metrics are gathered when the database is started. This makes it easier to spot performance problems as they occur instead of needing to enable connection-level monitoring and recreate the problem. Database-level monitoring is controlled by the DFT_MON_* database manager parameters:
$ db2 get dbm cfg | grep DFT_MON Buffer pool (DFT_MON_BUFPOOL) = OFF Lock (DFT_MON_LOCK) = OFF Sort (DFT_MON_SORT) = OFF Statement (DFT_MON_STMT) = OFF Table (DFT_MON_TABLE) = OFF Timestamp (DFT_MON_TIMESTAMP) = ON Unit of work (DFT_MON_UOW) = OFF
To enable database-level monitoring, update the desired DFT_MON_ switches:
db2 update dbm cfg using DFT_MON_BUFPOOL on db2 update dbm cfg using DFT_MON_LOCK on db2 update dbm cfg using DFT_MON_SORT on db2 update dbm cfg using DFT_MON_STMT on db2 update dbm cfg using DFT_MON_TIMESTAMP on db2 update dbm cfg using DFT_MON_UOW on
Note that we left the DFT_MON_TABLE monitor turned off for performance reasons. There's no need to enable DFT_MON_TIMESTAMP as it is enabled by default, but it is included here for completeness.
For the database-level monitoring to take effect the database must be stopped and restarted using
db2start. Before executing db2stop, the ibmslapd process must be stopped along with any other connected applications. After executing db2start the ibmslapd process must be restarted. Neglecting to restart DB2 will result in monitoring not being enabled which can be frustrating if a problem arises later for which you were hoping to have some metrics. This happens often enough that it bears repeating: be sure to restart DB2 after enabling database-level monitoring.
If you want to disable database-level monitoring, run the following commands as the database administrator:
db2 update dbm cfg using DFT_MON_STMT off db2 update dbm cfg using DFT_MON_BUFPOOL off db2 update dbm cfg using DFT_MON_LOCK off db2 update dbm cfg using DFT_MON_SORT off db2 update dbm cfg using DFT_MON_TABLE off db2 update dbm cfg using DFT_MON_UOW off
Note that we left DFT_MON_TIMESTAMP enabled as this is enabled by default. There should be no need to disable DFT_MON_TABLE as we didn't enable it, but it is included here for completeness.
After disabling monitoring, the database must be stopped and restarted, using
db2start, before the changes are picked up.
After monitoring is enabled, either at the connection or the database level, the collected values can be reset. This can be useful if you are wanting to pull performance metrics on a specific activity within the system. Resetting the database monitors always happens at a connection level, regardless if the monitoring was enabled at the connection or database level. For example, consider the case where database-level monitoring is enabled and the monitors are reset. Any subsequent snapshots for that connection would show only the metrics gathered since the reset but other connections to the database will still see all of the metrics gathered since the database was restarted. The only way to reset the metrics at the database level is to stop and restart DB2.
Monitor switches can be reset either for a specific database (if there is more than one database in the instance) or for all databases at once. If there is only one database in the instance, the two are equivalent:
# Individual database db2 reset monitor for database database_name # All databases db2 reset monitor for all
To pull the monitoring data out of the database after it has been enabled, use snapshots. Snapshots extract monitoring data at a point in time into a text file. Snapshots can be pulled for the following components:
- buffer pools
- dynamic SQL
- table spaces
You can collect snapshots on these components individually or all at once using the following commands as the database administrator:
# Individually db2 get snapshot for applications on database_name db2 get snapshot for database on database_name db2 get snapshot for dynamic sql on database_name db2 get snapshot for bufferpools on database_name db2 get snapshot for locks on database_name db2 get snapshot for tables on database_name db2 get snapshot for tablespaces on database_name # All at once db2 get snapshot for all on database_name
Usually the output of these commands are redirected to a file for processing. You can then use a script or viewer to read and analyze the data.
Here is the top section of a database snapshot to give you an idea of what one looks like:
Database Snapshot Database name = LDAPDB2 Database path = /home/ldapdb2db/ldapdb2/NODE0000/SQL00001/ Input database alias = LDAPDB2 Database status = Active Catalog database partition number = 0 Catalog network node name = Operating system running at database server= AIX 64BIT Location of the database = Local First database connect timestamp = 04/24/2008 19:02:16.542927 Last reset timestamp = Last backup timestamp = 04/25/2008 01:45:01.000000 Snapshot timestamp = 04/25/2008 16:12:25.914501
The snapshot includes some useful information about the monitoring itself such as when the first connection to the database occurred (effectively when monitoring began if database-level monitoring is enabled), the time the monitoring was last reset, and when the snapshot was taken. This lets us know that the metrics included in the snapshot was taken between the time of the last reset (or first database connect if empty) and the time the snapshot was taken.
Each snapshot provides a different set of information that can help in identifying performance problems. This section digs deeper into the snapshots and how they can be useful.
All scripts mentioned in this section can be found within the IBM Tivoli Directory Server performance scripts under Download.
A database snapshot provides some generally useful information about the state of the database. Of particular interest are such things like:
- buffer pool aggregate summary information
- number of transactions
- number of current and total connections
- number of locks held currently
- number of deadlocks detected
- number of lock escalations
- number of sorts and sort time
- number of package cache inserts, lookups, and overflows
Most of the data of interest provided by the database snapshot is easily accessible with a text editor. To make locating possible trouble areas easier a script, perfanalyze_database.pl, was created to process a database snapshot. The script also does some further analysis on the aggregate buffer pool summary and makes some buffer pool sizing recommendations, if necessary.
$ perfanalyze_database.pl -i database.snapshot Bufferpools: Bufferpool: ALL Hit ratios (the higher the better) Data: 53.453% Index: 99.918% Total: 99.678% Sync write ratios (the lower the better) Data: 2.542% ( 6 syncs, 230 asyncs) Index: 1.873% ( 5 syncs, 262 asyncs) Total: 2.187% ( 11 syncs, 492 asyncs) Direct reads: 46 Direct writes: 0 Database files closed: 482 NOTICE: Database files closed should be = 0 Increase MAXFILOPs until this counter stops growing. see http://www.db2mag.com/db_area/archives/2001/q1/hayes.shtml Connections: Total: 29 Current: 25 High Water Mark: 32 Transactions: Total: 313 Sorts: Total: 3 Average Time per Sort (ms): 0 Average Sorts per Transaction: 0.00958466453674121 Percentage Overflows: 0% Locks: Current: 36 Total: 0 Average Time per Lock (ms): 0 Average Lock per Transaction: 0 Deadlocks: 0 Escalations: 0 Package cache: Lookups: 7334 Inserts: 5 Overflows: 0
In the example above, the script is recommending that we increase the MAXFILOP database parameter to reduce the number of files DB2 is opening and closing. The number of sorts reported in the system is low, only 3. If this number was very large and the average sorts per transaction was very high, you should confirm that the calling application is not using server-side sorting unnecessarily as this can impact performance. These would show up in the dynamic SQL snapshot processing which we'll discuss later.
Note that the buffer pool output is reporting the aggregate information for all buffer pools in the database. More precise buffer pool recommendations can be seen by processing a buffer pool snapshot, which we'll cover next.
Buffer pool snapshots
Because the logic for processing buffer pool snapshots closely mirrors that for processing the buffer pool aggregate information in a database snapshot, both snapshots are processed by the perfanalyze_database.pl script.
perfanalyze_database.pl example for buffer pool snapshots
$ perfanalyze_database.pl -i bufferpool.snapshot Bufferpool: IBMDEFAULTBP Current size: 366211 pages Hit ratios (the higher the better) Data: 53.595% Index: 99.919% Total: 99.686% Sync write ratios (the lower the better) Data: 2.632% ( 6 syncs, 222 asyncs) Index: 1.880% ( 5 syncs, 261 asyncs) Total: 2.227% ( 11 syncs, 483 asyncs) Direct reads: 46 Direct writes: 0 Database files closed: 475 NOTICE: Database files closed should be = 0 Increase MAXFILOPs until this counter stops growing. see http://www.db2mag.com/db_area/archives/2001/q1/hayes.shtml Bufferpool: LDAPBP Current size: 15259 pages Hit ratios (the higher the better) Data: 48.675% Index: 99.654% Total: 95.748% Sync write ratios (the lower the better) Data: 0.000% ( 0 syncs, 8 asyncs) Index: 0.000% ( 0 syncs, 1 asyncs) Total: 0.000% ( 0 syncs, 9 asyncs) Direct reads: 0 Direct writes: 0 Database files closed: 7 NOTICE: Database files closed should be = 0 Increase MAXFILOPs until this counter stops growing. see http://www.db2mag.com/db_area/archives/2001/q1/hayes.shtml
The buffer pool snapshot analysis is most useful for determining if the buffer pools are sized adequately. The primary metric in evaluating this goal is the total buffer pool hit ratio. This metric highlights how frequently DB2 was able to find what it was looking for in a buffer pool rather than loading the data from the disk. The higher the buffer pool hit ratio, the fewer trips to the disk are necessary and the better the performance due to reduced I/O. If the total buffer pool hit ratio is below 90%, the script will recommend that you increase the size of the buffer pools.
While a good metric to use in improving DB2 performance, a high buffer pool hit ratio doesn't determine if DB2 is performing efficiently. With a large enough buffer pool, DB2 may be able to cache an entire table in memory and thus provide decent performance against that table despite there being a missing index. Adding the missing index would further improve performance and remove the need to cache the entire table. Don't assume that a high buffer pool hit ratio means that the system is tuned optimally.
Dynamic SQL snapshots
Dynamic SQL snapshots are one of the most useful pieces of information when tuning TDS. This snapshot provides information about the queries performed against the database and how long they took to execute. Because TDS uses parameter markers in the SQL there is not a one-to-one correlation between the LDAP query sent to TDS and the query from TDS to DB2. This results in the dynamic SQL snapshot having a single SQL query for many LDAP queries. Each SQL query will include the total time it took to complete and the number of executions. For a better picture of which queries are taking longer to complete than others, it is necessary to estimate a per-query execution time.
Here's a sample query from a dynamic SQL snapshot:
Number of executions = 319 Number of compilations = 1 Worst preparation time (ms) = 13 Best preparation time (ms) = 13 Internal rows deleted = 0 Internal rows inserted = 0 Rows read = 0 Internal rows updated = 0 Rows written = 0 Statement sorts = 0 Statement sort overflows = 0 Total sort time = 0 Buffer pool data logical reads = 0 Buffer pool data physical reads = 0 Buffer pool temporary data logical reads = 0 Buffer pool temporary data physical reads = 0 Buffer pool index logical reads = 9623904 Buffer pool index physical reads = 0 Buffer pool temporary index logical reads = 0 Buffer pool temporary index physical reads = 0 Total execution time (sec.ms) = 264.341145 Total user cpu time (sec.ms) = 263.680000 Total system cpu time (sec.ms) = 0.220000 Statement text = SELECT distinct E.EID FROM LDAPDB2.LDAP_ENTRY AS E, LDAPDB2.LDAP_ENTRY as pchild WHERE E.EID=pchild.EID AND pchild.PEID=? AND E.EID IN ((SELE CT EID FROM LDAPDB2.ERPARENT WHERE ERPARENT = ?) INTERSECT (SELECT EID FROM LDAPDB2.ERJAV ASCRIPT) INTERSECT (SELECT EID FROM LDAPDB2.OBJECTCLASS WHERE OBJECTCLASS = ?)) FOR FETC H ONLY
This query took 264 seconds to execute 319 times, or 0.828 seconds per execution. To avoid doing this calculation for each query in the snapshot, the perfanalyze_dynamicsql.pl script will do it for you.
If run without parameters, perfanalyze_dynamicsql.pl will show all queries that took more than 0.1 seconds to execute and will truncate the SQL at 80 characters. The
-t parameter can be used to control the truncation. In the example above the truncation was disabled by setting it to zero. The
-c parameter can be used to limit which queries are shown by setting a cutoff threshold in seconds. The default threshold is 0.1 seconds and any queries that complete faster than that are not shown. To show all queries use
In the above example one query is taking 19 seconds to complete on average. More investigation will be needed on this query to determine why that is. We'll tackle that point later.
Table space snapshots
The table space snapshot can provide some information about the configuration and current state of each table space in the system. The perfanalyze_tablespaces.pl script can be used to identify common problem areas for table spaces such as a prefetch/extent size mismatch or best practices on automatic DMS resizing and file system caching.
Here's a sample of the perfanalyze_tablespace.pl output for just the TDS tablespaces from a non-standard installation:
$ perfanalyze_tablespaces.pl -i tablespace.snapshot Tablespace: LDAPSPACE (id: 4) Tablespace Type: Database managed space Auto-resize enabled: No NOTICE: You may want to enable auto-resize for this tablespace to reduce administrative overhead although care should be taken to ensure the file system doesn't run out of disk space. File system caching: Yes NOTICE: You may want to disable file system caching for this tablespace if your bufferpool hit ratio is very high to take advantage of Direct I/O (DIO) or Concurrent I/O (CIO) on supported platforms. Total pages: 800000 Free pages: 249440 (31.180%) Extent size (pages): 32 Prefetch size (pages): 576 (18 extents) Tablespace: USERSPACE1 (id: 2) Tablespace Type: System managed space File system caching: Yes NOTICE: You may want to disable file system caching for this tablespace if your bufferpool hit ratio is very high to take advantage of Direct I/O (DIO) or Concurrent I/O (CIO) on supported platforms. Extent size (pages): 256 Prefetch size (pages): 256 (1 extents)
Putting it together: analyzing a problem
So far the information discussed has been about gathering information and doing some rough analysis on it. Lets put it all together and show how to troubleshoot a TDS problem at the database level.
Finding slow queries
The first place to start when troubleshooting a TDS problem from the DB2 side is to find slow queries. Much like the perfanalyze_audit.pl script does for the audit log (see Part 1) the perfanalyze_dynamicsql.pl script is the place to start in finding slow DB2 queries. Lets revisit the dynamic SQL analysis discussed previously:
The place to start digging is the query that takes the longest, or the 19-second query in this case. It is worth noting that although the query took 19 seconds to complete, it was only ever seen twice. If the performance problem being investigated (you must be investigating some specific issue or you wouldn't be digging in the DB2 snapshots) is thought to be related, then continue with the analysis. If, however, it doesn't appear related to your problem skip it and look elsewhere. There's no point in spending hours optimizing a query that is executed very infrequently and isn't manifesting itself as a problem somewhere.
Identifying missing indexes
Long query times are often the result of missing indexes. Finding missing indexes involves doing a comparison between all attributes that TDS knows are indexed via its schema and the columns being queried against by DB2. To do this we use the perfanalyze_indexes.pl script introduced in Part 1. It not only works with TDS audit.log files but tries its best with dynamic SQL snapshots as well.
perfanalyze_indexes.pl example for dynamic SQL snapshots
It appears that several of the attributes used in the long-running queries are unindexed. The perfanalyze_indexes.pl script can generate an LDIF for use in indexing these attributes using the
-l parameter. See Part 1 for more detailed information on how to use the perfanalyze_indexes.pl script.
This is a good time to discuss an important point: let TDS manage the attribute indexes, don't create them directly within DB2. While it is true that updating the TDS schema to index an attribute will, in the end, result in a new DB2 index, it is important to let TDS manage the indexes. Bypassing TDS and creating attribute indexes in DB2 directly can cause several problems:
- Creating indexes directly in DB2 will result in the TDS schema and the underlying database being out of sync. This may not show up until subsequent schema updates and it can lead to problems in future data migrations as often the LDAP schema is migrated and relied on to accurately reflect the state of the underlying database. If the indexes aren't in in the TDS schema, they will not be created resulting in poor performance.
- Depending on the attribute type and size, TDS may create more than just one index. If TDS determines that the attribute may be searched against using wildcard prefix searches (ie: attribute=*value) it may create an index not just on the ATTRIBUTE column but also on RATTRIBUTE as well. Similarly if the attribute is large enough, TDS will create and manage the index against the truncated attribute column (ATTRIBUTE_TRUNC) instead.
- Indexing done through TDS schema updates will be propagated to other TDS nodes in the replication topology if schema replication has been configured. This simplifies administration as the updates are pushed out to the other servers automatically ensuring index consistency between all peer nodes.
Generating explain plans
If no attributes of a slow-performing query show up as being unindexed, its time to determine how DB2 is fulfilling the query. This is done by generating an explain plan which will show the indexes and tables DB2 will use to satisfy the SQL query and in what order. Generating an explain plan requires that the explain tables be set up in the database. This is a one-time activity and is done by connecting to the database the problem query is running against and loading the EXPLAIN.DDL file located in the
db2_instance_home/sqllib/misc directory on Unix systems and
db2_install_directory/misc directory on Windows systems. The file is loaded with the command
db2 -tvf explain_ddl_path/EXPLAIN.DDL.
After the explain tables have been loaded, we need to create the query file. This file (query.sql in our examples) includes a single SQL we want to explain. The query should be all on one line (the following example is wrapped):
SELECT DISTINCT D.DEID FROM LDAPDB2.LDAP_DESC AS D WHERE D.AEID=? AND D.DEID IN ((SELECT EID FROM LDAPDB2.CN WHERE CN_T = ?) INTERSECT (SELECT EID FROM LDAPDB2.SN WHERE SN = ?)) FOR FETCH ONLY
We create an explain plan with the following commands:
db2 connect to database_name db2 set current explain mode explain db2 -f query.sql db2 commit work db2 connect reset db2 terminate db2exfmt -d database_name -g TIC -e instance_name -n % -s % \ -w -1 -# 0 -o query.sql.exfmt
The query.sql.exfmt file will contain the explain plan. To facilitate generating explain plans, the explainSQL.sh and explainSQL.bat scripts included with the TDS tuning scripts do all of the above for you.
The following is the Access Plan from an explain of the above query:
Access Plan: ----------- Total Cost: 90.1134 Query Degree: 1 Rows RETURN ( 1) Cost I/O | 1.0833e-11 UNIQUE ( 2) 90.1134 7 | 1.0833e-11 NLJOIN ( 3) 90.1133 7 /--------+-------\ 8.1273e-07 1.33292e-05 MSJOIN IXSCAN ( 4) ( 8) 77.2173 38.6004 6 3 /---+---\ | 3.96673 2.04887e-07 2.01315e+07 IXSCAN FILTER INDEX: LDAPDB2 ( 5) ( 6) SN 38.6055 38.6088 3 3 | | 2.01316e+07 5.45757 INDEX: LDAPDB2 IXSCAN CN ( 7) 38.6088 3 | 1.45373e+08 INDEX: LDAPDB2 LDAP_DESC_DEID
Based on this, first DB2 join rows from LDAP_DESC_DEID and CN indexes and then join the rows with the SN index. The number above each "INDEX: LDAPDB2" entry shows how many rows DB2 thinks are in that table based on the last runstats. In this case the LDAP_DESC_DEID index has ~100 million rows.
The LDAP_DESC table is used to create the hierarchical tree structure. Each LDAP entry has one row in this table for each ancestor in the tree all the way up to the suffix. If there are many objects under a branch of a tree, a subtree search against this tree can perform poorly. In the case above, if the values queried against CN and SN are selective, ie: few CN and SN records will match, DB2 would be better suited to first join the CN and SN index results and then the LDAP_DESC_DEID results. To force DB2 to do this we set the cardinality, or table size, for LDAP_DESC to 9E18. This changes the query plan to this:
Access Plan: ----------- Total Cost: 110.068 Query Degree: 1 Rows RETURN ( 1) Cost I/O | 5.28733e-05 UNIQUE ( 2) 110.068 7.50666 | 5.28733e-05 NLJOIN ( 3) 110.068 7.50666 /--------+-------\ 5.28733e-05 14249.4 MSJOIN IXSCAN ( 4) ( 8) 84.0942 58.4036 6.50666 3 /---+---\ | 3.96673 1.33292e-05 9e+18 IXSCAN FILTER INDEX: LDAPDB2 ( 5) ( 6) LDAP_DESC_DEID 38.6055 45.4103 3 3.50666 | | 2.01316e+07 268.337 INDEX: LDAPDB2 IXSCAN CN ( 7) 45.4103 3.50666 | 2.01315e+07 INDEX: LDAPDB2 SN
The above version will perform better for large subtree searches although it may perform slower for small subtree searches if LDAP_DESC will be more selective. Artificial cardinality tunings are a valuable, although large scope, tuning technique for TDS. TDS sets some artificial cardinalities on specific tables at start-up, as does the idsrunstats utility. If runstats is performed outside of the idsrunstats utility, including automatic runstats in DB2 v9 and later, it is important to take steps to ensure the artificial cardinalities are re-applied or poor performance my result. Note that other TDS clients, such as IBM Tivoli Access Manager and IBM Tivoli Identity Manager, include additional artificial cardinality tunings to ensure good performance for those products. Consult the product documentation and tuning guides for those products for more information.
Troubleshooting TDS performance at the database level is more involved than doing so at the audit log level and requires more experience with DB2. However, there are times it becomes necessary to tune the underlying DB2 datastore to ensure good performance and the results of further tuning can be well worth it.
|TDS audit scripts used in this article||tds_perf_scripts.tar.gz||38KB|
- Troubleshooting IBM Tivoli Directory Server performance, Part 1: Resolving slow queries using the TDS audit log
- IBM Tivoli Directory Server product documentation
- ITDS Performance Tuning and Capacity Planning Guide
- IBM DB2 V8 product documentation
- IBM DB2 V9 product documentation
- IBM DB2 V9.5 product documentation
- IBM Redbooks: Understanding LDAP - Design and Implementation
- Redpaper: Performance Tuning for IBM Tivoli Directory Server