Technical Blog Post
DB2 Backup and Restore basics
I have been asked about backup and restore (including performance) in the past few weeks and wanted to explore the backup and restore from a DB2 perspective. First and foremost, a filesystem backup is no substitute for a DB2 backup. Unless DB2 is paused (write suspended) for a filesystem snapshot of all the neccesary paths (typical SNAPSHOT backup) a standard filesystem backup is backing up a moving target, each file block will be copied in the state it's in when the OS backup reads that block from the file. There is no guarantee that a DB2 page will be consistent internally , much less in sync with it's neighbors on disk. The DB2 online backup and restore has the ability to back up that moving target and then using the transaction logs, bring it to a consistent point (the end of the backup or later) at the end of the restore and rollforward process.
There are two types of DB2 backups offline (no changes allowed in the database for the duration of the backup) and online (changes are allowed during the backup). The basic method for reading from the tablespaces and writing that data to the backup media is the same for both. The backup coordinating agent will allocate a number of tablespace readers (db2bm threads) based on the calculated parallelism or the user defined values. It will allocate a niumber of media writers based on the number of targets or sessions defined in the backup command. In addition, it will allocate a number of backup buffers from the utility heap (UTILHEAP) to buffer the data between the readers and writers which is again either calculated by DB2 or provided in the command. The backup will then start to read from the tablespaces largest to smallest (as we know the largest will take the longest to process) from the beginning to the end (high water mark) of the used extents in the tablespace. Each db2bm thread will process one tablespace at a time until all of the tablespaces are backed up. Please note, as the number of tablespaces left to be processed drops below the number of db2bm threads allocated at the start, they will start to go idle (there is currently no intra-tablespace parallelism).
A few additional nuggets of information related to online backups. The online backup will start by flushing the dirty pages in the bufferpool to disk to reduce the amount of log replay needed during the restore. The online backup has to coexist with the applications doing their work and share the I/O bandwidth for the tablespace access, so it will never be as fast as an offline backup. It walks the tablespaces latching ranges of pages to process (usually an extent at a time) and has to wait for the pages to be consistent before copying them to the backup buffer. It will not send empty ranges of pages to the backup, but it still has to process the range to make that determination.
On the writer side of things, the db2med threads will read from the backup buffers that are being populated by the db2bm threads and write those buffers out to the media or storage manager sessions defined in the command. As a general rule, we recommend a ratio of 5-8 readers to 1 writer for good performance if you have the resources available.
The restore is similar to the backup in that the same threads keep their same roles , but reverse direction. The db2med threads will read the data from the backup media and populate the buffers and the db2bm threads will read the buffers, organize the pages and write them out to the various tablespaces. There is not a specific tablespace to db2bm relationship during the restore as we are simply processing the buffers of pages and placing them at the correct offset in the various tablespace containers.
Now that we have laid out the basics of backup and restore, we can discuss some areas that can affect performance.
1. If all of the data is in a few large tablespaces the backup throughput will be quick at first (maximum parallelism) and begin to taper off as the number of tablespaces being processed quickly drops to the last one or two large ones.
2. The backup can only go as fast as the media writers can write out the data to the target(s). Ensure that you have adequate bandwith on the target side as well. You can have the fastest SAN storage in the world serving your database, but if you are trying to send it to one tape drive over the network the backup will be hampered by the network and tape speed.
3. The backup and restore processing on the db2bm side of things is generally managed in extent sized blocks of pages. So, extent size can be a factor in both backup and restore performance. There is no one-size-fits-all answer, but the reason I mention it is that there have been some classic database designs that use 2 or 4 page extents and we have seen that contribute to the poor performance.
4. Take advantage of the Reclaimable Storage feature in v97 and above to compact or defragment your tablespaces to reduce tablespace fragmentation as this can lead to poor online backup performance as we are walking the extents to locate used pages to include in the backup.
See this link for more details on the tablespace fragmentation issue: /support/pages/node/228393
5. Review the Performance Statistics information at the end of a backup or restore to see if there are any potential bottlenecks that could be addressed in a future recovery operation. The is feature is on by default in 10.1 and above and can be enabled in 9.7 with a registry variable (DB2_BAR_STATS).
See this link for more details on interpreting this output: https://www.ibm.com/developerworks/community/blogs/DB2PerfTips/entry/backup_statistics?lang=en
As always, please feel free to comment or ask questions