In my daily work, I get escalations from customers whose backups have suddenly started to run slow. Recently I was working on such a case where the customer was a SAP shop and they were moving some tables of a tablespace and all of a sudden their backups started to take about 3x their normal times.
In debugging a backup issue, the best data to collect is generally db2trc as that captures everything for all the threads involved in the backup and tell us exactly what is happening.
Generally, for any backup the following threads are created:
db2agent -- The main co-ordinator agent
db2bm -- Buffer Manipulator threads which read tbsp data to buffers and once the buffer is full hand it over to the db2med thread
db2med -- Media thread which takes a full buffer and writes it to wherever the backup destination is.
So, when a back up is slow, it can be in two places:
1) db2bm taking time to read the data into the buffer OR
2) db2med is taking time to write a full buffer to destination which is making the db2bm wait for an empty buffer.
The read throughput and write throughput can be easily diagnosed using iostat tool.
An easy way to identify all the threads associated with the backup is to do the following:
db2pd -edus | grep db2med
You will get the name of the thread as:
123456 is the edu id of the main co-ordinator agent.
You can now get the application handle by doing:
db2pd -age | grep 123456
The reason I wanted to get the application handle was to pass that to the db2trc command:
db2trc on -l 512m -apphdl <apphdl>
and get the trace of all the backup threads involved.
In this particular customer case, I collected the following:
1) db2trc on -l 512m -apphdle <apphdl>
2) db2pd -stack app=<apphdl> -rep 180 3
Once I got the trace, I found that in fact both db2bm and db2med were idle which was a bit puzzling as at least one of them has to be doing something to move the backup forward. So, I went back to the customer and asked for a longer trace ( previous was for only 30 secs, I asked them to take it for around 2 mins ). The second time trace was able to capture db2bm intermittently getting some data from a tablespace. I found that the tablespace was a fairly large one Also, from the stacks I observed that there was only 1 db2bm thread active and most of the time it was spending time waiting on the latch SQLO_LT_SQLB_POOL_CB__readLatch as other agents were holding it to read in some data from time to time. I found the reason was that it an an ONLINE backup and this tablespace contained close to 12k tables and other agents would be reading data from tables in it during the online backup.
The way ONLINE backup works is that it tries to read contiguous chunks of data from the tablespace in one go ( while holding the latch in X mode ) and fill up the buffer but obviously that was not happening here as it was found to be waiting behind the other agents most of the time which were holding this latch in S mode.
This info gave me a clue and pointed me to the fact that there's a possibility that there was a lot of fragmentation in the tablespace which is why there's not enough contiguous pages to fill up the buffer in one go and the backup has to try and get the X latch more frequently on the tablespace and do more read operations than if the tablespace was relatively fragmentation free.
I looked up the db2pd -tablespaces information and parsed it ( script attached ) to find out that indeed there was huge fragmentation in the tablespace. The parse script showed:
As can be seen the tablespaces had huge difference is Used Pages and HWM pages.
After talking to customer it was found that the tables were actually moved from this tablespace and also this tablespace was created in v9.1 ( currently customer was on v9.7 ).
Since, the tablespace was created in v9.1 the new feature of v9.7 to reduce HWM dynamically ( ALTER TABLESPACE <ts> REDUCE MAX ) was unavailable, so the only way we could get around this was to ask customer to create dummy tables in the tablespace to fill up the "gaps" ( verify it by db2pd -tablespaces -db <dbname> ) and try and reduce fragmentation.
After customer reduced the fragmentation using this trick, the backup times went back to normal and everything was running fine.
I am providing two scripts here:
1) To automate data collection for slow backups ( If you open up a PMR with this data, that would be awesome !! )
2) Check the fragmentation of tablespaces. ( You can verify the fragmentation using db2dart <dbname> /dhwm /tsi <tbsp id> /rptn <file> -- the holes will show up as ==EMPTY== )
Also, the following query can quickly identify whether the tablespace has got RECLAIMABLE STORAGE set or not ( by default its set if the tablespace is created in v9.7 ).
db2 "select tbsp_name, RECLAIMABLE_SPACE_ENABLED FROM TABLE(MON_GET_TABLESPACE('',-2)) AS t"
NOTE: I should mention here that this problem would not have surfaced if it was an offline backup. Its specific to online backup only.
Hope you find this post useful in debugging slow backup issues yourself.