Backup and restore procedures for the DB2 instance shared directory in a DB2 pureScale environment

In this article, learn about the recovery procedure for an IBM® DB2® pureScale® instance shared directory (sqllib_shared). By following the procedures described here, you should be able to restore a pureScale instance to its last backup state in the case where some or all files or directories within it become unavailable or unusable.

Share:

Alan Lee (ykalee@ca.ibm.com), Software Developer, IBM

Alan Lee photoAlan Y. Lee has been part of the DB2 for Linux, UNIX, and Windows development team since version 6. His expertise is in the kernel area of DB2. For DB2 pureScale, he led the delivery of various high availability enhancements and assisted customers in designing and configuring their pureScale network topology.



06 June 2013

Introduction

Today, customer data is considered to be among the most critical assets of a company. Many different mechanisms in both software and hardware layers are employed to provide multiple levels of high availability. Methods such as storage replication, standby disaster recovery site, and periodic incremental database backup are well-known techniques employed by database administrators to avoid data corruption and loss. These techniques, however, are generally targeted at preserving database directories only, which typically reside in their own file systems. This means the DB2 instance directory (that is, the sqllib directory in the instance's home directory) is often not backed up.

In a non-pureScale environment, the loss of the instance directory can be recovered by re-creating the instance, re-cataloging databases and manually setting relevant database parameters. The procedures and commands involved are all at the DB2 software level, where database administrators are already proficient.

In a pureScale environment, on top of the instance directory which is local to each member, there is an instance shared directory (called sqllib_shared) residing on the GPFS file system (<gpfs path>/<instance name>/sqllib_shared/) that stores instance configuration metadata shared by all members and CFs. Unlike the recovery of instance directory, the sqllib_shared directory cannot be recovered by simply recreating the instance. The pureScale cluster configuration data and in most cases, the original GPFS file system configuration, must be restored together for a successful recovery. As such, losing the instance shared directory without a proper backup and recovery plan in place can cause significant outage and damages to business.

The focus of this article is to detail the backup and recovery procedures of the instance shared directory (sqllib_shared) in circumstances where some or all files or directories within it become unavailable or unusable. The reason for the unavailability can range from an errant removal of files or directories, to GPFS file system-related problems, disk problems, or other issues. This article doesn't cover complete recovery instructions for potential GPFS-related failures. If the failure is indeed GPFS related, refer to the GPFS file system/disk recovery procedure available in the GPSF Problem Determination Guide, available at the following web site:

Backup items

The following items must be backed up to enable the sqllib_shared directory recovery:

  • The sqllib_shared directory and its contents
  • The sqllib_shared GPFS file system configuration

Note that the backup files should be stored on a separate permanent file system, (that is, not on the same file system as sqllib_shared directory.) The example below uses /tmp as the repository of any backup files for illustrative purposes only.

The sqllib_shared directory and its contents

In general, there are many commands on UNIX available to backup this directory, for example cp, tar, gtar, gnutar, and so on. Users are free to use their preferred tool. The minimum requirement is that the tool used must be able to preserve the full path, directory structure, ownership, and permission of all files and directories in sqllib_shared.

  1. Log in as root to perform the backup so that all permissions and ownership of files and directories are preserved.
  2. (Optional) To reduce the size of the backup, the contents in the sqllib_shared/db2dump/ directory can be excluded from the backup if disk space consumption is a concern. (Sometimes there can be core files that are huge in the db2dump directory.) However, an empty db2dump directory must be recreated after the sqllib_shared directory is restored.
  3. For illustration purpose, the following example uses the system command tar for backup and restore.
    • Sample command on AIX:

      tar -X <file> -cpf /tmp/<archive>.tar <full path>/sqllib_shared

      where:

      • -X <file> allows the specification of any file or directory not to be included in the archive. The filename or directory name needs to be on its own line in the <file>.
    • Sample command on Linux:

      tar --exclude=db2dump -Pcpf <archive>.tar <full path>/sqllib_shared

      where:

      • --exclude skips the inclusion of the db2dump directory and it's contents.
      • -P prevents the removal of the first "/" in the path of the contents.
      The full path to the sqllib_shared directory (<full path>/sqllib_shared) is mandatory so that the full directory structure can be preserved after restore.

The sqllib_shared GPFS file system configuration

The goal is to back up the sqllib_shared GPFS file system configuration information, including:

  • Disk information (NSD names, sizes, failure groups)
  • Storage pool layout
  • Filesets and junctions points
  • Policy file rules
  • Quota settings and current limits
  • File system parameters (block size, replication factors, number of inodes, default mount point, and so on)

Note that this is not the same as backing up the entire file system. Only the configuration is backed up so that in the case where the file system is corrupted or a disk failure causing the file system to be recreated, this configuration backup can restore the original state of the file system on the new devices.

To back up the configuration:

  1. Log in as root user.
  2. Run the following command:

    /usr/lpp/mmfs/bin/mmbackupconfig <target file system device name> -o <full path to backup file name>

Listing 1 shows an example. Note, the first line wraps and should be entered all on one line.

Listing 1. Backing up the configuration
root@coralpib189:/>/usr/lpp/mmfs/bin/mmbackupconfig /dev/svtfs0 -o 
  /tmp/backup.config.svtfs0.1

mmbackupconfig: Processing file system svtfs0 ...
mmbackupconfig: Command successfully completed

root@coralpib189:/> ls -al /tmp/backup.config.svtfs0.1
-rw-r--r--    1 root     system         6272 Jul 19 09:15 /tmp/backup.config.svtfs0.1

Backup frequency

In general, the above backup items should be taken under the following conditions:

  1. Regularly
    • The goal is to have a recent copy of the backup file even without any of the actions listed in #2.
    • A suggestion is to put the backup items in a script and run it as a daily/nightly cronjob.
  2. Whenever the following actions are performed:
    • Database backup
    • Setting and unsetting DB2 registry variables at instance level (that is, db2set -i <instname> <var>=<value> or simply db2set <var>=<value>). However, setting it at member level or global level does not require the backup).
    • Changing any database manager configuration parameter
    • Any DB2 cluster topology changes are made such as adding or deleting members and CFs, modifying cluster interconnect netnames for CFs, and so on.

Recovery procedure

Prerequisite: Resolve any GPFS-related failure

As indicated in the first section of this article, it is paramount to determine the root cause of the sqllib_shared failure. If it is deemed to be the GPFS file system or disk-related, refer to the GPFS Problem Determination Guide, linked from the Resources section, for a solution. Do not proceed to the next step until the root caused is remedied.

Assumption: Before proceeding to next step, it is expected that:

  • The GPFS cluster is operational without any issue.
  • Either the old GPFS file system used by sqllib_shared has been cleared of corruption to re-host the sqllib_shared directory or a new GPFS file system has been created to host it.

Step 0: Clean up previous DB2 processes and resources

Depending on whether the state of the instance at the time when the sqllib_shared directory fails, there might be some left over processes on some or all hosts. They need to be cleaned up before restarting. However, since the sqllib_shared directory is gone, no DB2 command can work at the instance level. Therefore, you must use a system command such as kill -9 to remove the process. Before doing that, TSA must be put into maintenance mode to prevent TSA restarting any DB2 processes. A quick way to achieve this is to put the peer domain offline.

  1. Log in as root on one of the hosts in the cluster and run the following command to put the domain offline:

    <instance dir>/bin/db2cluster -cm -stop -domain <domain>

    where <domain> is the cluster domain name which can be retrieved using lsrpdomain system command.

  2. Log in as the instance owner and run the following on each host:
    1. Determine the list of DB2 related processes by using ps -ef | grep <instance ID>.
    2. Use kill -9 <PID>[,PID...,PID] to terminate those processes.
    3. Issue ipclean -a.

Step 1: Restore the original settings of the sqllib_shared file system

The GPFS command mmrestoreconfig can be used to restore the settings. This command "converts" an existing file system to the one specified in the backup configuration file. Thus the target file system must be unmounted from all hosts before performing the restore. Here are the steps for this process:

  1. Login as root on one of the hosts in the cluster.
  2. To unmount the target file system on all hosts in the cluster, use this command:

    /usr/lpp/mmfs/bin/mmumount <mount point> -a

    where <mount point> does not need to be preceeded with a "/".

  3. Restore the configuration:

    /usr/lpp/mmfs/bin/mmrestoreconfig <device> -i <backup file name>

    where:

    • <device> is the name of the device used to create the file system, for example, enter "/dev/<name>" or simply <name> (without the "/" at the beginning).
    • <backup file name> is the full path to the backup configuration file name generated by mmbackupconfig.

Here is an example of running the command to restore the configuration:

/usr/lpp/mmfs/bin/mmrestoreconfig /garbage -i /tmp/backup.config.svtfs0.1

Note that in this example it actually passed in a different mount point. The mmrestoreconfig command can actually rename it back to the mount point encoded in the backup configuration file.

Listing 2. Sample output
--------------------------------------------------------
Configuration restore of svtfs0 begins at Tue Jul 19 09:51:21 EDT 2011.
--------------------------------------------------------
Checking disk settings for svtfs0:
Checking the number of storage pools defined for svtfs0.
Checking storage pool names defined for svtfs0.
Checking storage pool size for 'system'.

Checking filesystem attribute configuration for svtfs0:
Filesystem attribute value for stripeMethod restored.
Filesystem attribute value for logicalSectorSize restored.
Filesystem attribute value for minFragmentSize restored.
Filesystem attribute value for inodeSize restored.
Filesystem attribute value for indirectBlockSize restored.
Filesystem attribute value for defaultMetadataReplicas restored.
Filesystem attribute value for maxMetadataReplicas restored.
Filesystem attribute value for prefetchBuffers restored.
Filesystem attribute value for defaultDataReplicas restored.
Filesystem attribute value for maxDataReplicas restored.
Filesystem attribute value for blockAllocationType restored.
Filesystem attribute value for maxExpectedDiskI/OLatency restored.
Filesystem attribute value for fileLockingSemantics restored.
Filesystem attribute value for ACLSemantics restored.
Filesystem attribute value for estimatedAverageFilesize restored.
Filesystem attribute value for numNodes restored.
Filesystem attribute value for maxConcurrentI/OOperationsPerDisk restored.
Filesystem attribute value for blockSize restored.
Filesystem attribute value for quotasEnforced restored.
Filesystem attribute value for defaultQuotasEnabled restored.
Filesystem attribute value for maxNumberOfInodes restored.
Filesystem attribute value for filesystemVersion restored.
Filesystem attribute value for filesystemVersionLocal restored.
Filesystem attribute value for filesystemVersionManager restored.
Filesystem attribute value for filesystemVersionOriginal restored.
Filesystem attribute value for filesystemHighestSupported restored.
Filesystem attribute value for aggressivenessLevelOfTokensPrefetch restored.
Filesystem attribute value for supportForLargeLUNs restored.
Filesystem attribute value for DMAPIEnabled restored.
Filesystem attribute value for logfileSize restored.
Filesystem attribute value for exactMtime restored.
Filesystem attribute value for suppressAtime restored.
Filesystem attribute value for strictReplication restored.
Filesystem attribute value for storagePools restored.
Filesystem attribute value for filesetdfEnabled restored.
Filesystem attribute value for Maximum restored.
Filesystem attribute value for automaticMountOption restored.
Filesystem attribute value for additionalMountOptions restored.

Checking fileset configurations for svtfs0:

Checking policy rule configuration for svtfs0:
No policy rules installed in backed up filesystem svtfs0.

Checking quota settings for svtfs0:
Checking quota enablement for svtfs0.
mmrestoreconfig: Command successfully completed

Step 2: Untar the sqllib_shared archive

As instructed earlier in the backup procedure, the full path to sqllib_shared must be stored in the archive. Hence, the untar procedure is simply as follows:

  1. Log in as the root owner.
  2. Go to the root directory: cd\.
  3. Untar the archive:

    tar -xvpf <dir>/<archive>.tar

    The ownership and permission should also be preserved by the archive process. Verify it to avoid any permission problems later on.

  4. If the sqllib_shared/db2dump/ directory was excluded during the backup, it must be recreated now. Make sure the ownership and permission is the same as other subdirectories under sqllib_shared.
    Listing 3. Recreating sqllib_shared/db2dump directory
    mkdir <dir>/sqllib_shared/db2dump
    chown <instance ID>:<instance group ID> <dir>/sqllib_shared/db2dump
    chmod 2777 <dir>/sqllib_shared/db2dump
    chmod o+t <dir>/sqllib_shared/db2dump

    Verify that you can access the db2dump directory via the instance local directory.

Step 3: Remount the GPFS file system and restart the peer domain

Note that this step is only required if step 1 is run.

  1. Login as root on one of the hosts in the cluster.
  2. Mount the target file system on all hosts using one command:

    /usr/lpp/mmfs/bin/mmmount <mount point> -a

    where <mount point> does not need to precede with a "/",

Step 4: Restart the peer domain

Note that this step is only required if step 0 #1 is run.

  1. Log in as root on one of the hosts in the cluster.
  2. Put the domain online:

    <instance dir>/bin/db2cluster -cm -start -domain <domain>

    where <domain> is the cluster domain name that can be retrieved using the lsrpdomain system command.

Step 5: Clear any alerts

  1. Log in as the instance owner.
  2. To determine if there are any alerts, run:

    db2instance -list

    Alerts will be flagged with a "YES" under the ALERT column.

  3. To clear any alerts, run:

    db2cluster -clear -alerts

Step 6: Restart the instance

The instance should be restarted automatically within a short period of time. If it doesn't, restart the instance manually:.

db2start


Conclusion

Just as a well-designed database and storage backup plan is critical to company success, a periodic backup of the instance shared directory and its relevant shared file system metadata is equally important to prevent significant outage and delay in recovery. This article has provided a step-by-step guide to creating backup as well as recovery procedures in the event of such occurrence.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=932783
ArticleTitle=Backup and restore procedures for the DB2 instance shared directory in a DB2 pureScale environment
publish-date=06062013