Troubleshooting
Oftentimes, a customer will encounter a problem, but by the time they open a PMR they have lost the opportunity to gather time- and scenario-specific data. The purpose of this section is to help you identify what problem category you may be experiencing so that you can gather the appropriate diagnostic information. This will help IBM Support address the issue with greater efficiency. In some cases, it may even help you resolve the problem on your own.
After studying customer issues encountered over the past couple years, we have come up with four major categories of problems:
The following sections examine each of these areas.
In an unexpected failover, an event occurs on the primary that should not result in the standby taking over. However, failover does occur, resulting in potential data loss and outage.
Considerations:
- Is the solution behaving as designed?
- Scenarios where a failover is expected:
- Primary machine goes down (for example, power off)
- Primary machine public network failure
- Primary instance failure
- Scenarios where a failover is expected:
- Scenarios where failover should not occur:
- Killing the instance (for example, db2_kill)
- Loss of the private network
- Why was there a failover?
- Original state of the HADR database resource is not primary
- Instance and HADR start script are not invoked
- Instance and HADR start script are unsuccessful
Consider the original role of the HADR database resource
TSA/RSCT guarantees that the HADR database resource will be online (primary) on one host and offline (standby) on the other. If this is not the case, what seems like an unexpected failover may actually be TSA's attempt to ensure the resource is online on an available host. There are several ways to check the HADR database role:
- Check the db2diag.log on the original primary. (See the "db2diag.log" section for details on this.) Is the HADR state primary to begin with? Does it remain this way leading up to the failover event?
- Check the database configuration parameters on the original
primary. (See the "Database
configuration parameters" section.). What is the database
role and does this match
lssamdatabase resource OpState? - Check db2pd output on the original primary. (See the "db2pd utility" section
for details on this.) What is the database role, and does this
match
lssamdatabase resource OpState?
The original role was primary, now what?
The first and most obvious question you must consider is: were any start scripts invoked? Keep in mind that if the instance died, its HADR database would also be unavailable as a result. Thus, TSA/RSCT would invoke both the instance and HADR start scripts. If only the HADR database became unavailable, then just the HADR start script would be invoked.
- Open up the syslog on the original primary.
- Do you see any script invocations?
- Yes - Continue debugging.
- No - This is very suspicious. As mentioned in the "syslog" section, the syslog contains a history of resource management by TSA. You should see the monitor script being called at specific intervals. Ask yourself, "Is the cluster configured correctly? Is logging enabled and at the correct level?"
- Do you see a start script invocation? Look for the instance
monitor script "Returning 2." Shortly after this, you should see
the instance start script invocation.
- Yes - Continue debugging.
- No - This is likely a TSA/RSCT problem. Contact IBM Support.
- Did the start script return successfully (in other words,
"Returning 0")?
- Yes - This means that the instance started correctly. However, something occurred shortly after this point to cause a failover. Look in the db2diag.log around this time frame to see if there are any further errors.
Once the instance has started, the database must be activated in order for HADR to continue replication. Activation of an inconsistent database would fail. A database would become inconsistent if the instance crashed (for example, killing the instance with db2_kill). If the database has AUTORESTART enabled, a restart would be triggered and crash recovery performed, returning the database to a consistent state.
- Check that AUTORESTART is enabled and try activating manually.
db2 get db cfg for <database> | grep AUTORESTART db2 activate db <database>
- No - Let's take a look at some of the possible root causes.
The instance start script uses the utility db2gcf to start the instance. If db2gcf timed out, you would see db2diag.log message indicating this.
Listing 13. db2diag.log message
2010-06-05-02.05.33.824563-240 I18731842A257 LEVEL: Error PID : 1831554 TID : 1 FUNCTION: DB2 Common, Generic Control Facility, GcfCaller::start, probe:40 MESSAGE : ECF=0x9000028C Timeout occured while calling a GCF interface function |
Try starting the instance manually, and see how long this takes (db2start).
Try starting the instance using db2gcf without specifying a timeout, and see how long this takes. The start time varies, depending on the number of partitions in your environment; however, it should not be more than several seconds per partition.
db2gcf -u -i <instance> |
If the instance started successfully, there would definitely be messages in the db2diag.log indicating this.
Listing 14. db2diag.log message
2010-08-04-22.05.57.453549-240 E10236E305 LEVEL: Event PID : 9620 TID : 47284539421760PROC : db2star2 INSTANCE: db2inst1 NODE : 000 FUNCTION: DB2 UDB, base sys utilities, DB2StartMain, probe:911 MESSAGE : ADM7513W Database manager has started. START : DB2 DBM |
Try starting the instance manually to see if this is successful.
DB2 interacts with the cluster manager by issuing TSA/RSCT commands. If these commands are unsuccessful, there would be a log entry with a specific identifier in db2diag.log. (See the "db2diag.log" section for details on this.)
The role was not primary, now what?
The main question here is why was the original role not primary? Is there a bigger problem at play?
When you cluster an instance using db2haicu, TSA/RSCT resources are
created as well as dependencies between them. To see dependencies in
your cluster you can use the lsrel
command.
The utility can be run as follows:
lsrel -Ab |
Listing 15.
lsrel command
db2inst1@host1:~> lsrel -Ab
Displaying Managed Relationship Information:
All Attributes
Managed Relationship 1:
Class:Resource:Node[Source] = IBM.Application:db2_db2inst1_host1_0-rs
Class:Resource:Node[Target] = {IBM.Equivalency:db2_public_network_0}
Relationship = DependsOn
Conditional = NoCondition
Name = db2_db2inst1_host1_0-
rs_DependsOn_db2_public_network_0-rel
ActivePeerDomain = hadr_domain
ConfigValidity =
|
An example of a dependency is the one created between the instance resource and the public network. If the public network goes down, clients can no longer connect to the database so a failover is necessary. Check if the public network is up.
If after going through this section your problem has not been solved, this would be a good point to open a PMR. Include in your description all your findings based on the guidance given above. In addition to this, please provide the following:
- Output of any commands run manually
- db2support
- getsadata
This section describes how to troubleshoot scenarios where a failover is expected but did not occur. There are a few questions that you should ask yourself before diving deep into further diagnostic steps:
- Should a failover occur?
- Scenarios where we expect a failover are:
- Primary machine goes down (for example,. power off)
- Primary machine public network failure
- Primary instance failure
- Scenarios where we expect a failover are:
- Scenarios where failover should not occur:
- Killing the instance (for example,. db2_kill)
- Loss of the private network
- If failover is expected but it did not occur, what actually
happened?
- Failure was not detected, hence no action
- Failure was detected, but failover was not initiated
- Failover was attempted but failed
To pick the correct category, let's review the action TSA/RSCT would take in an event of a failure.
There is a common misconception that a failover should always take place when there is a failure on primary. This is not always true. In scenarios where the primary can be brought up quickly, a failover may not be required. First, TSA/RSCT would try to restart the instance, the HADR database, or both on the same host. If that fails, then it would start up the HADR database on the standby host.
For example, when the db2sysc process is killed, no failover would take place if the HADR primary database is able to restart in the allotted time. However, in a prolonged failure scenario (for example, the instance cannot come back up or takes too long to restart), a failover is expected.
A clustered resource is monitored by invoking its monitor script and evaluating its return code. Around the time of failure, you should see a change in the return code from the resource monitor script.
- Open up the syslog on the original primary host.
- Do you see any HADR monitor script entries?
- Yes - Continue debugging.
- No - In a correct setup, the monitor script should be run periodically to monitor the state of the HADR database resource. Is logging enabled and at the correct level? Verify that the cluster domain is configured correctly.
- Scan through the HADR monitor script entries and look for a time
frame when the return code changes from "Returning 1" to
"Returning 2." Do you see this change?
- Yes - This is the time when TSA/RSCT first detected that the primary database is down. The failure was detected correctly. Go to the question "Was a failover attempt initiated?"
- No - Continue debugging.
- Do you see any timeout errors?
- Yes - The monitor script is taking too long to obtain the
status of the resource for the cluster manager. Run the script
in verbose mode to see where it is spending the most time or
if the script hangs.
Listing 16. Running the script in the verbose moderoot@host2:/usr/sbin/rsct/sapolicies/db2# ./hadrV97_monitor.ksh db2inst1 db2inst1 HADRDB verbose + CT_MANAGEMENT_SCOPE=2 + export CT_MANAGEMENT_SCOPE + CLUSTER_APP_LABEL=db2_db2inst1_db2inst1_HADRDB + [[ verbose == verbose ]] + typeset +f + typeset -ft main monitorhadr set_candidate_P_instance + main db2inst1 db2inst1 HADRDB verbose + set_candidate_P_instance + candidate_P_instance=db2inst1 + candidate_S_instance=db2inst1 + + hostname + tr . + awk {print $1} localShorthost=host2 + [[ db2inst1 != db2inst1 ]] + return 0 + + lsrsrc -s Name = 'db2_HADRDB_host2_Reintegrate_db2inst1_db2inst1' IBM.Test Name + grep -c Name nn=0 + [[ 0 -eq 0 ]] + [[ db2inst1 != db2inst1 ]] + [[ 0 -ne 0 ]] + monitorhadr
- No - If the monitor script is always "Returning 1," then TSA/RSCT thinks the resource is still online and no action will be taken.
- Does this match the status from DB2 perspective?
- Check db2diag.log, and use db2pd for clues on the database instance health and the last HADR role changes.
Was a failover attempt initiated?
In a scenario where failover is expected (for example, the resource cannot be restarted on the original host), once the failure is detected, a failover should take place. For DB2 HADR resource, this means an invocation of the HADR start script on the old Standby host.
- Open up the syslog on the original standby host.
- Focus in on the relevant time frame.
- Do you see any script invocations?
- Yes - Continue debugging.
- No - This is very suspicious. You should see the monitor script being called at a certain time interval in the syslog. Ask yourself, "Is the cluster configured correctly? Is logging to syslog enabled?"
- Do you see any HADR start script invocations?
- Yes - This means TSA/RSCT attempted to start the resource on the standby host. This is the correct behavior so far. Go to the question "Was the failover successful?"
- No - This is likely a TSA/RSCT problem. Contact IBM Support.
- Did the HADR start script return successfully (in other words,
"Returning 0")?
- Yes - This means the database was started successfully as
Primary on this originally Standby host.
At this point in time, if you look at the
lssamoutput, you would see a role switch, where the HADR database resource is online on the old standby host and failed offline on the old primary host.
Listing 17.lssamoutputroot@host1:/# lssam Online IBM.ResourceGroup:db2_db2inst1_host1_0-rg Nominal=Online '- Online IBM.Application:db2_db2inst1_host1_0-rs '- Online IBM.Application:db2_db2inst1_host1_0-rs:host1 Failed offline IBM.ResourceGroup:db2_db2inst1_host2_0-rg Nominal=Online '- Failed offline IBM.Application:db2_db2inst1_host2_0-rs '- Failed offline IBM.Application:db2_db2inst1_host2_0-rs:host2 Online IBM.ResourceGroup:db2_db2inst1_db2inst1_HADRDB-rg Control=MemberInProblemState Nominal=Online '- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs Control=MemberInProblemState |- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:host1 '- Failed offline IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:host2 Online IBM.Equivalency:db2_db2inst1_host1_0-rg_group-equ '- Online IBM.PeerNode:host1:host1 Online IBM.Equivalency:db2_db2inst1_host2_0-rg_group-equ '- Online IBM.PeerNode:host2:host2 Online IBM.Equivalency:db2_db2inst1_db2inst1_HADRDB-rg_group-equ |- Online IBM.PeerNode:host2:host2 '- Online IBM.PeerNode:host1:host1
Continue troubleshooting. A takeover did occur successfully, so something must have happened later on, causing an apparent failure to failover. Follow through syslog entries to piece together the next sequence of events.
Note: Look at the HADR monitor script return codes. Do they initially return 1 and then later on return 2 or time out? If the HADR monitor script returns 2 later on, then a second failure occurred. Verify this with the db2diag.log.
- No - Using HADR start script, TSA/RSCT is not able to start up the database as primary on the standby host. Examine the db2diag.log to diagnose further. Always match the timestamps in the syslog with the timestamps in the db2diag.log file. Are there any db2diag.log error messages around the same time frame that the HADR start script fails? Let's take a look at some possible root causes.
- Yes - This means the database was started successfully as
Primary on this originally Standby host.
If the HADR_PEER_WINDOW configuration
parameter is too small, it may expire by the time the cluster manager
tries to issue a takeover from the standby. In Listing 18 below,
HADR_PEER_WINDOW is set to 5.
Listing 18. Standby db2diag.log
2010-08-05-10.30.21.185557-240 I609665A427 LEVEL: Severe
PID : 741466 TID : 4502 PROC : db2sysc
INSTANCE: db2inst1 NODE : 000
EDUID : 4502 EDUNAME: db2hadrs (HADRDB)
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduAcceptEvent, probe:20215
RETCODE : ZRC=0x8280001B=-2105540581=HDR_ZRC_COMM_CLOSED
"Communication with HADR partner was lost"
2010-08-05-10.30.21.185758-240 E610093A373 LEVEL: Event
PID : 741466 TID : 4502 PROC : db2sysc
INSTANCE: db2inst1 NODE : 000
EDUID : 4502 EDUNAME: db2hadrs (HADRDB)
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000
CHANGE : HADR state set to S-DisconnectedPeer (was S-Peer)
2010-08-05-10.30.21.186057-240 I610467A356 LEVEL: Warning
PID : 741466 TID : 4502 PROC : db2sysc
INSTANCE: db2inst1 NODE : 000
EDUID : 4502 EDUNAME: db2hadrs (HADRDB)
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrCloseConn, probe:30595
MESSAGE : Peer window end time :1281018623
2010-08-05-10.30.24.198297-240 I612083A367 LEVEL: Warning
PID : 741466 TID : 4502 PROC : db2sysc
INSTANCE: db2inst1 NODE : 000
EDUID : 4502 EDUNAME: db2hadrs (HADRDB)
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduAcceptEvent, probe:20202
MESSAGE : Peer window ends. Peer window expired.
2010-08-05-10.30.24.198729-240 E612451A389 LEVEL: Event
PID : 741466 TID : 4502 PROC : db2sysc
INSTANCE: db2inst1 NODE : 000
EDUID : 4502 EDUNAME: db2hadrs (HADRDB)
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000
CHANGE : HADR state set to S-RemoteCatchupPending (was S-DisconnectedPeer)
|
By the time the failure on the primary database has been detected by TSA/RSCT and a failover is deemed necessary, the peer window has already expired. This would result in the HADR start script failing.
Listing 19. Standby syslog
Aug 5 10:31:15 host1 user:notice /usr/sbin/rsct/sapolicies/db2/hadrV97_monitor.ksh
[663766]: Returning 2 : db2inst1 db2inst1 HADRDB
Aug 5 10:31:17 host1 user:debug /usr/sbin/rsct/sapolicies/db2/db2V97_monitor.ksh
[692326]: Returning 1 (db2inst1, 0)
Aug 5 10:31:27 host1 user:debug /usr/sbin/rsct/sapolicies/db2/db2V97_monitor.ksh
[471160]: Returning 1 (db2inst1, 0)
Aug 5 10:31:34 host1 user:notice /usr/sbin/rsct/sapolicies/db2/hadrV97_start.ksh
[471162]: Entering : db2inst1 db2inst1 HADRDB
Aug 5 10:31:36 host1 user:notice /usr/sbin/rsct/sapolicies/db2/hadrV97_monitor.ksh
[598066]: Entering : db2inst1 db2inst1 HADRDB
Aug 5 10:31:37 host1 user:debug /usr/sbin/rsct/sapolicies/db2/db2V97_monitor.ksh
[700494]: Returning 1 (db2inst1, 0)
Aug 5 10:31:39 host1 user:notice /usr/sbin/rsct/sapolicies/db2/hadrV97_monitor.ksh
[598080]: Returning 2 : db2inst1 db2inst1 HADRDB
Aug 5 10:31:44 host1 user:notice /usr/sbin/rsct/sapolicies/db2/hadrV97_start.ksh
[471192]: Returning 3 : db2inst1 db2inst1 HADRDB |
Listing 20. Standby db2diag.log around the HADR start script invocation
2010-08-05-10.31.36.489581-240 I728601A485 LEVEL: Error
PID : 741466 TID : 4502 PROC : db2sysc
INSTANCE: db2inst1 NODE : 000
EDUID : 4502 EDUNAME: db2hadrs (HADRDB)
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSDoTakeover, probe:47004
RETCODE : ZRC=0x8280001D=-2105540579=HDR_ZRC_NOT_TAKEOVER_CANDIDATE_FORCED
"Forced takeover rejected as standby is in the wrong state or peer window
has expired"
2010-08-05-10.31.36.489773-240 I729087A363 LEVEL: Warning
PID : 741466 TID : 4502 PROC : db2sysc
INSTANCE: db2inst1 NODE : 000
EDUID : 4502 EDUNAME: db2hadrs (HADRDB)
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSDoTakeover, probe:47008
MESSAGE : Info: Standby has failed to takeover.
|
Try increasing the HADR_PEER_WINDOW value.
In the db2diag.log error message, do you find errors with the identifier "###---####-###"? This may indicate an error directly returned by TSA/RSCT or TSA/RSCT being non responsive at the time. Refer to the RSCT information center for further details (see Resources for a link).
As you may have gathered, the main step in the old standby assuming a primary role is a "db2 takeover." Look through the db2diag.log to see if there are any messages having to do with this. Listing 21 illustrates what a successful takeover message would look:
Listing 21. Successful takeover message
2010-08-06-10.31.20.530184-240 I8024A377 LEVEL: Warning PID : 622806 TID : 4405 PROC : db2sysc INSTANCE: db2inst1 NODE : 000 EDUID : 4405 EDUNAME: db2hadrp (HADRDB) FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSDoTakeover, probe:47003 MESSAGE : Info: Standby has completed takeover (now primary). |
Do you see this?
Check the current HADR state of your database (by db2pd). If it is not primary, try to manually issue a db2 takeover.
Listing 22. Issue a db2 takeover
db2inst1@host2) /vbs/engn/ha/salinux $ db2 takeover hadr on db HADRDB DB20000I The TAKEOVER HADR ON DATABASE command completed successfully. |
If after going through this section your problem has not been solved, this would be a good point to open a PMR. Include in your description all your findings based on the guidance given above. In addition to this, provide:
- Output of any commands run manually
- db2support
- getsadata
After a failover, the old primary must come back and function as the new standby, which is known as "reintegration." In some cases reintegration may fail, sending the cluster domain into an undesirable state. This section explores the possible causes for an integration failure and expands on the steps to further debug.
Questions to ask yourself:
- Did HADR start successfully on the new primary?
- Did the instance start successfully on the old primary?
Did HADR start successfully on the new primary?
As previously mentioned, reintegration is when the old primary restarts as standby. What that means is TSA/RSCT called the HADR start script on the old standby, which in turn resulted in a "db2 takeover" command being issued. If the takeover is not successful, you are in a "No failover" situation. Refer to "No failover" section for assistance on this.
Did the instance start successfully on the old primary?
The first step in reintegration is to start the instance on the old primary, if it is not already started.
- Open up the syslog on the old primary.
- Do you see any script invocations?
- Yes - Continue debugging.
- No - This is very suspicious. You should see the monitor script being called at a certain time interval in the syslog. Ask yourself, "Is the cluster configured correctly? Is logging to syslog enabled?"
- Do you see an instance start script invocation? (Look for the instance monitor script "Returning 2." Shortly after this you should see the instance start script invocation.)
- Yes - Continue debugging.
- No - This is likely a TSA/RSCT problem. Contact IBM Support.
- Did the instance start script return successfully ("Returning 0")?
- Yes - This means the instance started successfully and should have reintegrated. Check the database configuration parameters or db2pd to confirm HADR database role.
- No - Look at the following possible root causes.
The instance start script uses the utility db2gcf to start the instance. If db2gcf timed out, you would see db2diag.log message indicating this.
Listing 23. db2diag.log message - db2gcf timed out
2010-06-05-02.05.33.824563-240 I18731842A257 LEVEL: Error PID : 1831554 TID : 1 FUNCTION: DB2 Common, Generic Control Facility, GcfCaller::start, probe:40 MESSAGE : ECF=0x9000028C Timeout occurred while calling a GCF interface function |
Try starting the instance manually and see how long this takes (db2start). You could also try starting the instance using db2gcf without specifying a timeout and see how long this takes:
db2gcf -u -i db2inst1 |
The start time varies, depending on the number of partitions in your environment. However, it should not be more than several seconds per partition.
If the instance started successfully, there would definitely be a message in the db2diag.log indicating this.
Listing 24. db2diag.log message - successful start message
2010-08-04-22.05.57.453549-240 E10236E305 LEVEL: Event PID : 9620 TID : 47284539421760PROC : db2star2 INSTANCE: db2inst1 NODE : 000 FUNCTION: DB2 UDB, base sys utilities, DB2StartMain, probe:911 MESSAGE : ADM7513W Database manager has started. START : DB2 DBM |
Try starting the instance manually to see if this is successful.
In this case, the instance has started successfully but the old primary
HADR database has failed to start as standby. As mentioned in the "lsrsrc"
section, flags are a way for DB2 to communicate with TSA/RSCT. In the
case of reintegration, the new primary would create a reintegration
flag. This is a signal for the old primary to start itself as standby.
Once the reintegration is complete, the flag is deleted. Check if the
reintegration flag still exists. If the flag still exists, this is a
strong indication that failure to reintegrate had something to do with
the flag detection.
Table 5 lists some APARS relating to reintegration failure. Read through each one and see if it applies to your scenario.
Table 5. Reintegration APARS
| APAR | Fixed in |
|---|---|
| IC65836 / IC65837 | V95FP6 / V97FP2 |
| IC64142 / IC64666 | V95FP6 / V97FP2 |
| IC67393 | V97FP2 |
If after going through this section your problem has not been solved, this would be a good point to open a PMR. Include in your description all your findings based on the guidance given. In addition to this, provide the following:
- Output of any commands run manually
- db2support
- getsadata
Configuration issues with db2haicu
In order to cluster an HADR setup, the db2haicu tool is used. This tool interacts with TSA/RSCT under the covers to create resources to manage various DB2 entities. When run in interactive mode, the user would be prompted with a series of questions. In most cases, if the user enters an incorrect/improper answer, he would be prompted as such.
In this section, find guidance on how to identify those cases where incorrect input is not corrected immediately (in certain versions of db2haicu), resulting in improper functioning of the cluster.
There are two ways db2haicu can fail during setup:
To distinguish between issues due to invalid inputs and TSA/RSCT errors, look through the db2diag.log. Focus on logs for the functions in the family "sqlha." Do you see the ####-### identifier in the data field? If so, this is an error directly from the cluster manager. Otherwise, this may be a configuration error. The data section should indicate the resource it is trying to action on (for example, the name of the resource or resource group that it is attempting to create or find).
Note: The configuration parameter requirements for DB2 HADR are different from those of the db2haicu tool. Due to this difference, if the HADR database pair is working perfectly, the db2haicu tool may still fail when creating resources in the cluster domain. Refer to Table 2 for details.
In most cases, an obvious invalid user input will be rejected from the db2haicu interactive interface (for example, using a host name as a VIP). However, some configuration errors cannot be immediately detected, so the setup process may fail at the end with a generic message. You will need to examine the db2diag.log for further debugging.
Let's take a look at a few common areas that are prone to invalid inputs:
The integrated HA solution relies on various sources for host name. As a result, you should verify that the formatting of all user inputs and system settings are consistent with one another.
To find out how TSA/RSCT views the host name, issue
lsrpnode.
To find out how DB2 views the host name, issue
uname. (Note: It is good
practice to keep uname host name consistent with the output of host
name.)
Listing 25 shows a few places where the host name is requested from users.
Listing 25. Host name requests
Create a unique name for the new domain: ha Nodes must now be added to the new domain. How many cluster nodes will the domain ha contain? 2 Enter the host name of a machine to add to the domain: host1 Enter the host name of a machine to add to the domain: host2 db2haicu can now create a new domain containing the 2 machines that you specified. If you choose not to create a domain now, db2haicu will exit. Create the domain now? [1] 1. Yes 2. No 1 Creating domain ha in the cluster... Creating domain ha in the cluster was successful. |
For the HADR database, the host names are specified in
HADR_LOCAL_HOST and
HADR_REMOTE_HOST parameters. If an IP
address is specified here, the db2haicu interactive tool will then
prompt for the actual host name.
Listing 26. Prompt for host name
Setting a high availability configuration parameter for instance db2inst1 to TSA. Adding DB2 database partition 0 to the cluster... Adding DB2 database partition 0 to the cluster was successful. Do you want to validate and automate HADR failover for the HADR database HADRDB? [1] 1. Yes 2. No 1 Adding HADR database HADRDB to the domain... The cluster node 9.26.53.5 was not found in the domain. Please re-enter the host name. host1 The cluster node 9.26.53.50 was not found in the domain. Please re-enter the host name. host2 Adding HADR database HADRDB to the domain... The HADR database HADRDB has been determined to be valid for high availability. However, the database cannot be added to the cluster from this node because db2haicu detected this node is the standby for the HADR database HADRDB. Run db2haicu on the primary for the HADR database HADRDB to configure the database for automated failover. All cluster configurations have been completed successfully. db2haicu exiting... |
The value specified in parameters
HADR_LOCAL_HOST and
HADR_REMOTE_HOST will be used as-is to
generate the clustered resource names. If there is inconsistency in
formatting between these values for host name sources, this will
result in failure. Failure to keep this consistency will lead to
db2haicu configuration failure. Best practice is to use canonical host
names in all areas.
Listings 27 and 28 show error messages that may be seen in a situation where there is host name inconsistency.
Listing 27. Error message from db2haicu
The HADR database HADRDB cannot be added to the cluster because the standby instance is not configured in the domain. Run db2haicu on the standby instance to configure it into the cluster. All cluster configurations have been completed successfully. db2haicu exiting ... |
Listing 28. Error message from db2diag.log
2010-08-04-13.45.17.792165+720 E258375A757 LEVEL: Error
PID : 23901 TID : 2199089142032PROC : db2haicu
INSTANCE: db2inst3 NODE : 000
FUNCTION: DB2 Common, SQLHA APIs for DB2 HA Infrastructure, sqlhaAddResourceGroup,
probe:300
MESSAGE : ECF=0x90000557=-1879046825=ECF_SQLHA_CLUSTER_ERROR
Error reported from Cluster
DATA #1 : String, 35 bytes
Error during vendor call invocation
DATA #2 : unsigned integer, 4 bytes
46
DATA #3 : String, 45 bytes
db2_db2inst3_lildb202.nz.thenational.com_0-rg
DATA #4 : unsigned integer, 4 bytes
1
DATA #5 : unsigned integer, 8 bytes
1
DATA #6 : signed integer, 4 bytes
0
DATA #7 : String, 0 bytes
Object not dumped: Address: 0x00000000800D59FC Size: 0 Reason: Zero-length data |
Another source of error for HADR configuration is
HADR_REMOTE_INST, which is case-sensitive
for db2haicu.
Listing 29. Error message from db2diag.log
2010-08-05-09.16.57.317614-240 E49090A665 LEVEL: Error
PID : 717032 TID : 1 PROC : db2haicu
INSTANCE: db2inst1 NODE : 000
EDUID : 1
FUNCTION: DB2 Common, SQLHA APIs for DB2 HA Infrastructure, sqlhaUICreateHADR, probe:895
RETCODE : ECF=0x9000056F=-1879046801=ECF_SQLHA_HADR_VALIDATION_FAILED
The HADR DB failed validation before being added to the cluster
MESSAGE : Please verify that HADR_REMOTE_INST and HADR_REMOTE_HOST are correct
and in the exact format and case as the Standby instance name and
host name.
DATA #1 : String, 7 bytes
Db2inst1
DATA #2 : String, 10 bytes
host2
2010-08-05-09.16.57.317983-240 E49756A594 LEVEL: Error
PID : 717032 TID : 1 PROC : db2haicu
INSTANCE: db2inst1 NODE : 000
EDUID : 1
FUNCTION: DB2 Common, SQLHA APIs for DB2 HA Infrastructure, sqlhaUICreateHADR, probe:900
RETCODE : ECF=0x9000056F=-1879046801=ECF_SQLHA_HADR_VALIDATION_FAILED
The HADR DB failed validation before being added to the cluster
DATA #1 : String, 7 bytes
db2inst1
DATA #2 : String, 7 bytes
Db2inst1
DATA #3 : String, 10 bytes
host1
DATA #4 : String, 10 bytes
host2
DATA #5 : String, 7 bytes
HADRDB
|
The instance name cannot contain an underscore (_) because it is used as a delimiter for the TSA/RSCT resource. It may lead to incorrect parsing. Improvements have been made in V95FP6 (IC62705) and V97FP2 (IC64856) to allow instance names containing underscores. However, it is advisable to steer away from such naming.
It is possible for db2haicu configuration to fail even if all inputs are valid. During the configuration process, db2haicu would make TSA/RSCT calls to create and register various resources. If the vendor call invocation returns with an error, this would lead to db2haicu exiting unsuccessfully. To verify this, examine the db2diag.log for any error messages containing the "### --- ####-###" indicator. Listings 30 - 34 show typical error messages from db2haicu and db2diag.log to illustrate common failures.
Listing 30. Typical error message from db2haicu
Create the domain now? [1]
1. Yes
2. No
1
Creating domain HA in the cluster ...
Creating domain failed. Refer to db2diag.log and the DB2 Information Center for details. |
Listing 31. Typical error message from db2haicu
Adding DB2 database partition 0 to the cluster ...
There was an error with one of the issued cluster manager commands. Refer to db2diag.log
and the DB2 Information Center for details. |
Listing 32. Typical error message in db2diag.log
2010-01-06-14.31.50.656997-420 E3426E903 LEVEL: Error
PID : 3485 TID : 48008924417584PROC : db2haicu
INSTANCE: db2inst1 NODE : 000
FUNCTION: DB2 Common, SQLHA APIs for DB2 HA Infrastructure, sqlhaCreateCluster, probe:900
MESSAGE : ECF=0x90000544=-1879046844=ECF_SQLHA_CREATE_CLUSTER_FAILED
Create cluster failed
DATA #1 : String, 24 bytes
Error creating HA Domain
DATA #2 : String, 6 bytes
UC4GHA
DATA #3 : unsigned integer, 4 bytes
2
DATA #4 : signed integer, 4 bytes
21
DATA #5 : String, 350 bytes
Line # : 8839---2632-044 The domain cannot be created due to the following errors that
were detected while harvesting information from the target nodes:
db200: 2610-441 Permission is denied to access the resource class specified
in this command. Network Identity UNAUTHENT requires 's' permission for the
resource class IBM.PeerDomain on node db200.
|
Solution:
preprpnode was not run on each node. As
root, issue TSA/RSCT command
preprpnode <host1> <host2> on
each node.
Listing 33. Typical error message in db2diag.log
2010-07-20-14.52.28.537943-240 E15731289A765 LEVEL: Error
PID : 974914 TID : 1 PROC : db2haicu
INSTANCE: db2inst1 NODE : 000
EDUID : 1
FUNCTION: DB2 Common, SQLHA APIs for DB2 HA Infrastructure,
sqlhaAddResource, probe:1600
MESSAGE : ECF=0x90000542=-1879046846=ECF_SQLHA_CREATE_GROUP_FAILED
Create group failed
DATA #1 : String, 35 bytes
Error during vendor call invocation
DATA #2 : unsigned integer, 4 bytes
0
DATA #3 : String, 0 bytes
Object not dumped: Address: 0x0000000110165890 Size: 0 Reason: Zero-
length data
DATA #4 : unsigned integer, 8 bytes
1
DATA #5 : signed integer, 4 bytes
309
DATA #6 : String, 86 bytes
Line # : 7227---2621-309 Command not allowed as daemon does not have a
valid license. |
Solution: This is a TSA/RSCT licensing issue. Obtain
the license from Passport Advantage and apply it using the
samlic command.
Listing 34. Typical error message in db2diag.log
2010-07-20-14.52.28.538365-240 E15732055A369 LEVEL: Error
PID : 974914 TID : 1 PROC : db2haicu
INSTANCE: db2inst1 NODE : 000
EDUID : 1
FUNCTION: DB2 Common, SQLHA APIs for DB2 HA Infrastructure,
sqlhaCreateNetwork, probe:50
RETCODE : ECF=0x90000542=- RETCODE :
ECF=0x90000542=-1879046846=ECF_SQLHA_CREATE_GROUP_FAILED
Create group failed.
Line # : 6531---host1: 2661-011 The command specified for attribute MonitorCommand is
NULL, not a absolute path, does not exist or has insufficient permissions to
be run.
|
Solution: This indicates the sapolicies scripts are not found in the /usr/sbin/rsct/sapolicies/db2 location. See the technote "Error 'The command specified for attribute MonitorCommand is NULL' reported by db2haicu" (IBM, March 2010) for a resolution. (See Resources for a link.)
For other cluster manager errors, refer to the RSCT information center (see Resources for a link) or contact IBM Support.






