Today, customer data is considered to be among the most critical assets of a company. Many different mechanisms in both software and hardware layers are employed to provide multiple levels of high availability. Methods such as storage replication, standby disaster recovery site, and periodic incremental database backup are well-known techniques employed by database administrators to avoid data corruption and loss. These techniques, however, are generally targeted at preserving database directories only, which typically reside in their own file systems. This means the DB2 instance directory (that is, the sqllib directory in the instance's home directory) is often not backed up.
In a non-pureScale environment, the loss of the instance directory can be recovered by re-creating the instance, re-cataloging databases and manually setting relevant database parameters. The procedures and commands involved are all at the DB2 software level, where database administrators are already proficient.
In a pureScale environment, on top of the instance directory which is local to each member, there is an instance shared directory (called sqllib_shared) residing on the GPFS file system (<gpfs path>/<instance name>/sqllib_shared/) that stores instance configuration metadata shared by all members and CFs. Unlike the recovery of instance directory, the sqllib_shared directory cannot be recovered by simply recreating the instance. The pureScale cluster configuration data and in most cases, the original GPFS file system configuration, must be restored together for a successful recovery. As such, losing the instance shared directory without a proper backup and recovery plan in place can cause significant outage and damages to business.
The focus of this article is to detail the backup and recovery procedures of the instance shared directory (sqllib_shared) in circumstances where some or all files or directories within it become unavailable or unusable. The reason for the unavailability can range from an errant removal of files or directories, to GPFS file system-related problems, disk problems, or other issues. This article doesn't cover complete recovery instructions for potential GPFS-related failures. If the failure is indeed GPFS related, refer to the GPFS file system/disk recovery procedure available in the GPSF Problem Determination Guide, available at the following web site:
The following items must be backed up to enable the sqllib_shared directory recovery:
- The sqllib_shared directory and its contents
- The sqllib_shared GPFS file system configuration
Note that the backup files should be stored on a separate permanent file system, (that is, not on the same file system as sqllib_shared directory.) The example below uses /tmp as the repository of any backup files for illustrative purposes only.
The sqllib_shared directory and its contents
In general, there are many commands on UNIX available to backup this directory, for example cp, tar, gtar, gnutar, and so on. Users are free to use their preferred tool. The minimum requirement is that the tool used must be able to preserve the full path, directory structure, ownership, and permission of all files and directories in sqllib_shared.
- Log in as root to perform the backup so that all permissions and ownership of files and directories are preserved.
- (Optional) To reduce the size of the backup, the contents in the sqllib_shared/db2dump/ directory can be excluded from the backup if disk space consumption is a concern. (Sometimes there can be core files that are huge in the db2dump directory.) However, an empty db2dump directory must be recreated after the sqllib_shared directory is restored.
- For illustration purpose, the following example uses the system command
tarfor backup and restore.
- Sample command on AIX:
tar -X <file> -cpf /tmp/<archive>.tar <full path>/sqllib_shared
-X <file>allows the specification of any file or directory not to be included in the archive. The filename or directory name needs to be on its own line in the
- Sample command on Linux:
tar --exclude=db2dump -Pcpf <archive>.tar <full path>/sqllib_shared
--excludeskips the inclusion of the db2dump directory and it's contents.
-Pprevents the removal of the first "/" in the path of the contents.
- Sample command on AIX:
The sqllib_shared GPFS file system configuration
The goal is to back up the sqllib_shared GPFS file system configuration information, including:
- Disk information (NSD names, sizes, failure groups)
- Storage pool layout
- Filesets and junctions points
- Policy file rules
- Quota settings and current limits
- File system parameters (block size, replication factors, number of inodes, default mount point, and so on)
Note that this is not the same as backing up the entire file system. Only the configuration is backed up so that in the case where the file system is corrupted or a disk failure causing the file system to be recreated, this configuration backup can restore the original state of the file system on the new devices.
To back up the configuration:
- Log in as root user.
- Run the following command:
/usr/lpp/mmfs/bin/mmbackupconfig <target file system device name> -o <full path to backup file name>
Listing 1 shows an example. Note, the first line wraps and should be entered all on one line.
Listing 1. Backing up the configuration
root@coralpib189:/>/usr/lpp/mmfs/bin/mmbackupconfig /dev/svtfs0 -o /tmp/backup.config.svtfs0.1 mmbackupconfig: Processing file system svtfs0 ... mmbackupconfig: Command successfully completed root@coralpib189:/> ls -al /tmp/backup.config.svtfs0.1 -rw-r--r-- 1 root system 6272 Jul 19 09:15 /tmp/backup.config.svtfs0.1
In general, the above backup items should be taken under the following conditions:
- The goal is to have a recent copy of the backup file even without any of the actions listed in #2.
- A suggestion is to put the backup items in a script and run it as a daily/nightly cronjob.
- Whenever the following actions are performed:
- Database backup
- Setting and unsetting DB2 registry variables at instance level (that is,
db2set -i <instname> <var>=<value>or simply
db2set <var>=<value>). However, setting it at member level or global level does not require the backup).
- Changing any database manager configuration parameter
- Any DB2 cluster topology changes are made such as adding or deleting members and CFs, modifying cluster interconnect netnames for CFs, and so on.
Prerequisite: Resolve any GPFS-related failure
As indicated in the first section of this article, it is paramount to determine the root cause of the sqllib_shared failure. If it is deemed to be the GPFS file system or disk-related, refer to the GPFS Problem Determination Guide, linked from the Resources section, for a solution. Do not proceed to the next step until the root caused is remedied.
Assumption: Before proceeding to next step, it is expected that:
- The GPFS cluster is operational without any issue.
- Either the old GPFS file system used by sqllib_shared has been cleared of corruption to re-host the sqllib_shared directory or a new GPFS file system has been created to host it.
Step 0: Clean up previous DB2 processes and resources
Depending on whether the state of the instance at the time when the sqllib_shared
directory fails, there might be some left over processes on some or all hosts. They
need to be cleaned up before restarting. However, since the sqllib_shared directory
is gone, no DB2 command can work at the instance level. Therefore, you must use
a system command such as
kill -9 to remove the process.
Before doing that, TSA must be put into maintenance mode to prevent TSA restarting
any DB2 processes. A quick way to achieve this is to put the peer domain offline.
- Log in as root on one of the hosts in the cluster and run the following command to put the domain offline:
<instance dir>/bin/db2cluster -cm -stop -domain <domain>
where <domain> is the cluster domain name which can be retrieved using
- Log in as the instance owner and run the following on each host:
- Determine the list of DB2 related processes by using
ps -ef | grep <instance ID>.
kill -9 <PID>[,PID...,PID]to terminate those processes.
- Determine the list of DB2 related processes by using
Step 1: Restore the original settings of the sqllib_shared file system
The GPFS command
mmrestoreconfig can be used to restore the settings. This command
"converts" an existing file system to the one specified in the backup configuration
file. Thus the target
file system must be unmounted from all hosts before performing the restore. Here are
the steps for this process:
- Login as root on one of the hosts in the cluster.
- To unmount the target file system on all hosts in the cluster, use this
/usr/lpp/mmfs/bin/mmumount <mount point> -a
where <mount point> does not need to be preceeded with a "/".
- Restore the configuration:
/usr/lpp/mmfs/bin/mmrestoreconfig <device> -i <backup file name>
- <device> is the name of the device used to create the file system, for example, enter "/dev/<name>" or simply <name> (without the "/" at the beginning).
- <backup file name> is the full path to the backup configuration file name generated by mmbackupconfig.
Here is an example of running the command to restore the configuration:
/usr/lpp/mmfs/bin/mmrestoreconfig /garbage -i /tmp/backup.config.svtfs0.1
Note that in this example it actually passed in a different mount point. The mmrestoreconfig command can actually rename it back to the mount point encoded in the backup configuration file.
Listing 2. Sample output
-------------------------------------------------------- Configuration restore of svtfs0 begins at Tue Jul 19 09:51:21 EDT 2011. -------------------------------------------------------- Checking disk settings for svtfs0: Checking the number of storage pools defined for svtfs0. Checking storage pool names defined for svtfs0. Checking storage pool size for 'system'. Checking filesystem attribute configuration for svtfs0: Filesystem attribute value for stripeMethod restored. Filesystem attribute value for logicalSectorSize restored. Filesystem attribute value for minFragmentSize restored. Filesystem attribute value for inodeSize restored. Filesystem attribute value for indirectBlockSize restored. Filesystem attribute value for defaultMetadataReplicas restored. Filesystem attribute value for maxMetadataReplicas restored. Filesystem attribute value for prefetchBuffers restored. Filesystem attribute value for defaultDataReplicas restored. Filesystem attribute value for maxDataReplicas restored. Filesystem attribute value for blockAllocationType restored. Filesystem attribute value for maxExpectedDiskI/OLatency restored. Filesystem attribute value for fileLockingSemantics restored. Filesystem attribute value for ACLSemantics restored. Filesystem attribute value for estimatedAverageFilesize restored. Filesystem attribute value for numNodes restored. Filesystem attribute value for maxConcurrentI/OOperationsPerDisk restored. Filesystem attribute value for blockSize restored. Filesystem attribute value for quotasEnforced restored. Filesystem attribute value for defaultQuotasEnabled restored. Filesystem attribute value for maxNumberOfInodes restored. Filesystem attribute value for filesystemVersion restored. Filesystem attribute value for filesystemVersionLocal restored. Filesystem attribute value for filesystemVersionManager restored. Filesystem attribute value for filesystemVersionOriginal restored. Filesystem attribute value for filesystemHighestSupported restored. Filesystem attribute value for aggressivenessLevelOfTokensPrefetch restored. Filesystem attribute value for supportForLargeLUNs restored. Filesystem attribute value for DMAPIEnabled restored. Filesystem attribute value for logfileSize restored. Filesystem attribute value for exactMtime restored. Filesystem attribute value for suppressAtime restored. Filesystem attribute value for strictReplication restored. Filesystem attribute value for storagePools restored. Filesystem attribute value for filesetdfEnabled restored. Filesystem attribute value for Maximum restored. Filesystem attribute value for automaticMountOption restored. Filesystem attribute value for additionalMountOptions restored. Checking fileset configurations for svtfs0: Checking policy rule configuration for svtfs0: No policy rules installed in backed up filesystem svtfs0. Checking quota settings for svtfs0: Checking quota enablement for svtfs0. mmrestoreconfig: Command successfully completed
Step 2: Untar the sqllib_shared archive
As instructed earlier in the backup procedure, the full path to sqllib_shared must be stored in the archive. Hence, the untar procedure is simply as follows:
- Log in as the root owner.
- Go to the root directory:
- Untar the archive:
tar -xvpf <dir>/<archive>.tar
The ownership and permission should also be preserved by the archive process. Verify it to avoid any permission problems later on.
- If the sqllib_shared/db2dump/ directory was excluded during the backup, it
must be recreated now. Make sure the ownership and permission is the same as other
subdirectories under sqllib_shared.
Listing 3. Recreating sqllib_shared/db2dump directory
mkdir <dir>/sqllib_shared/db2dump chown <instance ID>:<instance group ID> <dir>/sqllib_shared/db2dump chmod 2777 <dir>/sqllib_shared/db2dump chmod o+t <dir>/sqllib_shared/db2dump
Verify that you can access the db2dump directory via the instance local directory.
Step 3: Remount the GPFS file system and restart the peer domain
Note that this step is only required if step 1 is run.
- Login as root on one of the hosts in the cluster.
- Mount the target file system on all hosts using one command:
/usr/lpp/mmfs/bin/mmmount <mount point> -a
where <mount point> does not need to precede with a "/",
Step 4: Restart the peer domain
Note that this step is only required if step 0 #1 is run.
- Log in as root on one of the hosts in the cluster.
- Put the domain online:
<instance dir>/bin/db2cluster -cm -start -domain <domain>
where <domain> is the cluster domain name that can be retrieved using the
Step 5: Clear any alerts
- Log in as the instance owner.
- To determine if there are any alerts, run:
Alerts will be flagged with a "YES" under the ALERT column.
- To clear any alerts, run:
db2cluster -clear -alerts
Step 6: Restart the instance
The instance should be restarted automatically within a short period of time. If it doesn't, restart the instance manually:.
Just as a well-designed database and storage backup plan is critical to company success, a periodic backup of the instance shared directory and its relevant shared file system metadata is equally important to prevent significant outage and delay in recovery. This article has provided a step-by-step guide to creating backup as well as recovery procedures in the event of such occurrence.
- Learn more about GPFS in the GPFS product library.
- Read more articles on developerWorks about DB2 pureScale.
- In the DB2 for Linux, UNIX, and Windows area on developerWorks, get the resources you need to advance your DB2 skills.
- Follow developerWorks on Twitter.
Get products and technologies
- Download a trial version of DB2 for Linux, UNIX, and Windows.