AIX disaster recovery

Resolving resource conflicts

Recovering AIX® IT resources after a disaster requires the personnel performing the recovery to concentrate on working the prepared plan. Unexpected deviations from the plan can cause immense delays in the disaster recovery project. Often, these unexpected deviations are due to a lack of adherence to business continuity policies, guidelines, standards, and procedures. This article identifies resource conflicts that typically occur during a disaster recovery implementation and provides suggestions for resolving these conflicts.

Dana French, President, Mt Xia Inc.

Mr. French's career in the IT industry has spanned three decades and numerous industries. His work focuses primarily on the fields of business continuity, disaster recovery, and high availability, and he has designed and written numerous software packages to automate the processes of disaster recovery and high availability. He is most noted for his approach to system administration as an automated, business oriented process, rather than as a system oriented, interactive process. He is also a noted authority on the subject of Korn Shell programming.



18 September 2007

Also available in Chinese Russian

Introduction

During a disaster recovery implementation effort, the last thing you want is unexpected hardware and software resource conflicts. These tend to consume time, personnel resources, and cause recovery time objectives to be missed. The goal of this article is to identify the most common causes of resource conflicts and to provide mechanisms to avoid or resolve them. The most desirable solution is to avoid these conflicts altogether so that resolution during a disaster recovery implementation effort is not required.

Many IT departments attempt to support multiple implementation standards, one for non-clustered systems, another for clustered systems, and yet another for disaster recovery-enabled systems. Maintaining multiple standards can itself be a source and cause of conflicts during a disaster recovery implementation. Consolidation into a single set of standards should be a goal of your disaster recovery planning project and an overall strategy of business continuity.

Some of the typical issues encountered during an AIX® disaster recovery implementation include:

  • Systems provisioned for the purpose of disaster recovery (These systems are of a different type, size, and capacity than production; the systems are usually newer with updated operating systems.)
  • User and group permission problems
  • Multiple instances of an application (Each application is implemented on a separate system in production and are combined on a single system in the disaster recovery provisioning.)
  • Network naming and addressing issues
  • Production applications (tied to a specific network address or network name during installation)
  • Node name and host name conflicts (conflicts between existing systems in the disaster recovery site and the new systems being implemented under the disaster recovery plan)
  • Multiple implementation standards for various functional system types such as standalone, high availability, and disaster recovery

The resource conflicts and resolutions discussed here assume an organization has multiple data centers running production systems, and each data center serving as a disaster recovery site for one or more other data centers. The information presented here is applicable to any data center and disaster recovery plan.

Networking conflicts

The best solution for avoiding networking conflicts during a disaster recovery implementation is to always ensure that each network address (TCP/IP) or name has a unique value across the enterprise. In organizations with multiple active data centers, network addresses (TCP/IP) from the production data center should not be failed over to the disaster recovery site. To do so requires reconfiguration of routers and switches, and it could endanger the existing production systems running in the data center accepting the disaster recovery workload. Therefore, the production applications should never be tied to or dependent upon a specific network TCP/IP address because, in a disaster, those network TCP/IP addresses change, causing the applications not to work. Applications and regular users should never use or specify a network service by its TCP/IP address, and they should only use a symbolic name. Furthermore, the symbolic name used by applications and regular users should only be an alias and point to a host name.

In this context, a node refers to any system as a whole, whether or not it is part of a cluster or a standalone system, and the node name is a separate entity from the host name. The node name structure consists of alphanumeric characters only. One or more host names can be derived based on the node name of the system. Table 1 illustrates a node named abcdefgh00 with five host names.

Table 1: Node names and host names
Node nameHost nameDescription
abcdefgh00abcdefgh00-bootThese node and host names refer to an IP address assigned to a network adapter at boot time.
abcdefgh00abcdefgh00-persThese node and host names refer to a persistent IP address assigned to a network adapter that will always be present.
abcdefgh00abcdefgh00-rg01These node and host names refer to a service IP address assigned to a network adapter that is available when the application services are running.
abcdefgh00abcdefgh00-mgmtThese node and host names refer to a system management IP address assigned to a network adapter that is always present.
abcdefgh00abcdefgh00These node and host names refer to a service IP address assigned to a network adapter that is available when the system is ready to provide application services.

Table 1 also shows a host name called abcdefgh00, which happens to correspond with the node name of the system. Recognize that even though they have the same name, their purposes are different.

The symbolic name used by applications and regular users should not be any of the host names referenced in Table 1, and they should only use an alias to these host names, as shown in Table 2.

Table 2: Host names and aliases
Host nameAlias
abcdefgh00-rg01 myappl5
ijklmnop02-rg22db2sys
qrstuvwx17-rg05mqseries5

Notice in Table 2 that all host names have an extension of -rg##. These are references to a resource group service address. Even though this is an HACMP concept, it can be used on any clustered or standalone system to refer to a network service offered by a system. Any applications or regular users requiring access to a systems network services should only refer to the alias, which redirects them to the resource group host name, which redirects them to the resource group service IP address.

In the event of a disaster, the production applications are restarted at the disaster recovery site on systems with different TCP/IP addresses, different node names, and different host names. To provide access to the applications now restarted at the disaster recovery site, the only change that is necessary is to repoint the alias names in the DNS to the new resource group host name in the disaster recovery site. No changes are necessary for the applications and regular users, and all applications are automatically rerouted to the correct location and application server.

User names

Each person in an organization should be assigned a unique identifier across the enterprise that is only assigned to that person, and when that person leaves the organization, it is retired. This ensures a seamless audit trail when evaluating problems, issues, and actions. The user name should consist of alphanumeric characters and be a valid structure for all systems within an organization so that each person only has one user name. Specifying a user name structure that works on all systems and provides enough variability can be a daunting task. The reason for this is because most organizations today run a wide variety of operating systems, each with it's own requirements for user name structures. To devise a common user name structure, you must take into consideration the requirements of Microsoft® Windows®, multiple variants of UNIX®, Linux®, OS/400®, RACF®, and others.

Some user name structures that seem to work across these environments are:

  • Four to seven lowercase and alphanumeric characters, beginning with a letter
  • Seven lowercase and alphanumeric characters, first three characters being alphabetic and the last four characters being digits
  • Seven lowercase and alphanumeric characters, first four characters being alphabetic and the last three characters being digits

There are certainly other user name structures that work, but these seem to be the most commonly used structures. These structures provide the commonality necessary to satisfy most requirements, as well as enough flexibility to be useful in both large and small organizations.

User ID and group ID numbers

In disaster recovery planning, it is important to recognize that user ID (UID) and group ID (GID) numbers should be uniquely assigned to users and groups on an enterprise-wide basis in order to avoid conflicts or security breaches during a disaster recovery implementation.

In order to avoid the maintenance and support issues of keeping a database of UID and GID values and their associated user and group names, a reproducible algorithm should be used to calculate the UID and GID values. An easy algorithm for performing a UID and GID calculation is to use the sum command and the -r option to generate the Berkeley cksum value. Here's an example of using this technique:

$ print "abc1234" | sum -r
  29247  1

All commands shown in this series of articles are expressed in Korn shell syntax.

A variation on the above command using Korn Shell 93 syntax:

/usr/bin/ksh93
  UID=$( print "abc1234" | sum -r )
  UID=${UID//[!0-9]/}
  print "UID=${UID}"

A limitation of this technique is that it only calculates numbers between 600 and 65,000 for the specified user name format of three lowercase letters followed by four digits. So the number of users and groups is limited to less than 65,000 across the enterprise, and the possibility of duplicates does exist with this algorithm.

Resource group names

The concept of a resource group is used here in a larger sense than just as a high-availability entity. Here a resource group is used to define any logical collection of resources, which might include disk, I/O, users, applications, and so forth, regardless of whether or not a node participates in a high-availability cluster or disaster recovery provisioning scheme. In this context, a resource group should be viewed as independent from any machine, server, or data center.

The following provides a standard for defining the resource group name:

Table 3: Resource group name structure definition
Application code+Environment+Function+Customer or client ID+Sequence ID
Three characters+One character+One character+Two characters+One digit

The detailed information for each component of the resource group name is described in Table 4.

Table 4: Resource group name structure definition
Resource group name componentNumber of charactersValues
Application code3db2 = DB2®
nim = NIM
mqm = MQSeries®
tsm = Tivoli® Storage Manager
vio = Virtual I/O
Environment1a = acceptance
g = pre-production/Gold
d = test/development
p = production
t = test
x = disaster recovery
Function1a = application
c = combination/multi-purpose
d = database
m = management
u = utility
Company or other identifier 2ac = Acme
mx = Mt Xia
ib = IBM
Sequence ID10-9,A-Z,a-z
Table 5: Example resource group names
Application codeEnvironmentFunctionCustomer or client IDSequence IDExample resource group name
db2apmx0db2apmx0
nimddmx1nimddmx1
mqmtamx2mqmtamx2

Volume group names

In order to facilitate normal maintenance, disaster recovery, and business continuity, it is recommended that each volume group name be a unique value across the enterprise. A single AIX system might contain multiple resource groups, and there typically is one volume group defined per resource group. However, a resource group might contain several volume groups, depending upon the requirements of the application.

The volume group name should be based on the previously defined resource group name, and a two-digit sequence number that uniquely identifies the volume group followed by the characters vg.

Using this standard, the volume group name consists of exactly 12 characters with the following structure:

Table 6: Volume group name structure definition
Resource group name componentVolume group sequence identifier Literal charactersVolume group name
db2apmx000vgdb2apmx000vg
db2apmx001vgdb2apmx001vg
db2apmx002vgdb2apmx002vg

Logical volume names

Extending the unique mentality to logical volume names across the enterprise ensures naming conflicts will be avoided while performing system failovers during planned or unplanned outages. The logical volume name should also be based upon the previously defined resource group name. Each logical volume is associated with a volume group, and each volume group typically contains several logical volumes.

Using the resource group name to define a logical volume name, determine which volume group and resource group this logical volume belongs to. Then add a four-character alphanumeric identifier that uniquely identifies the logical volume followed by the characters lv.

Table 7: Logical volume name structure definition
Resource group name componentLogical volume sequence identifierLogical volume identifierLogical volume name
db2apmx0db20lvdb2apmx0db20lv
db2apmx0db21lvdb2apmx0db21lv
db2apmx0db22lvdb2apmx0db22lv

Log logical volume names

JFS and JFS2 file systems require a logical volume for the JFS log. A unique name across the enterprise is also required. The log logical volume name structure should be the same as previously defined for a normal logical volume; however, the logical volume sequence identifier should consist of the literal characters jfs followed by a single digit to uniquely identify this name. Logical volume will be defined per volume group on one log; however, multiples might be defined. As an example, a resource group named db2apmx0 might have a volume group named db2apmx00vg. This volume group has multiple JFS log logical volumes associated with it:

Table 8: Log logical volume name structure definition
Resource group name componentJFS log logical volume sequence IDJFS log logical volume IDJFS log logical volume name
db2apmx0jfs0lvdb2apmx0jfs0lv
db2apmx0jfs1lvdb2apmx0jfs1lv
db2apmx0jfs2lvdb2apmx0jfs2lv

File system mount point directory names

To ensure the ability to recover multiple instances of an application onto a single system in a disaster recovery scenario, each file system containing application files should have a unique mount point directory across the enterprise. The best way to achieve this is to use the resource group name or a substring of the logical volume name as the top-level directory, since typically a file system mount point is required for each logical volume.

As an example, a resource group named db2apmx0 might have multiple file systems associated with it:

Table 9: File system mount point directory name definition
Resource group name componentOptional logical volume sequence IDOptional sub-directoriesFile system mount point
db2apmx0db2_08_01/bin/db2apmx0/db2_08_01/bin
db2apmx0db2_08_01/data/db2apmx0/db2_08_01/data
db2apmx1mq01/db2apmx1mq01
db2apmx1mq02/db2apmx1mq02
db2apmx1mq03/db2apmx1mq03

Conclusion

The unique naming methodology for resources across the enterprise provides an effective method of avoiding conflicts during a disaster recovery or high-availability failover. The use of a resource group name as the basis of this methodology can be expanded beyond the scope presented here. Additional potential resource conflicts include:

  • File names for application start and stop scripts
  • Workload Manager classes and subclasses
  • Performance monitoring
  • Job scheduling
  • Printer queues
  • Tivoli Storage Manager
  • MQSeries

The standards described in this article might or might not suit your the particular needs and requirements of your organization; however, the overriding principle of this article is that your organization should have similar standards in place to help resolve or eliminate these potential conflicts. The worst time to decide that standards are needed is when you need them most, such as during a planned or unplanned outage.

Avoiding resource conflicts requires an organization to adopt an enterprise-wide mentality of business continuity. Furthermore, business continuity must be the beginning point in systems design, not the end point. Unfortunately, very few systems are built from the business continuity perspective backwards.

Resources

Learn

Get products and technologies

  • IBM trial software: Build your next development project with software for download directly from developerWorks.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into AIX and Unix on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=255882
ArticleTitle=AIX disaster recovery
publish-date=09182007