Configuring the cluster for high availability in a GDPC environment
The configuration procedure detailed in this topic is specific to the geographically dispersed Db2® pureScale® cluster (GDPC). This procedure must be performed after the initial instance creation or after any subsequent repair or deletion operation for resource model or peer domain.
Before you begin
Ensure that you have Spectrum Scale replication set up (see Setting up IBM Spectrum Scale replication in a GDPC environment.) If you are running on an AIX operating system on a RoCE network, ensure you have set up a RoCE network (see Configuring public network equivalency in a GDPC environment for high availability.)
Procedure
-
Update storage failure time-outs.
-
Ensure that in the case of storage controller or site failure, an error is returned quickly to
Spectrum Scale by setting the relevant device driver
parameters. Note that the relevant parameters differs for different device drivers. Check storage
controller documentation or consult a storage expert on site to ensure that errors are returned
within 20 seconds.
For example, on DS8K using the default AIX® SDDPCM, the updates are:
chdev -l hdiskX -a 'cntl_delay_time=20 cntl_hcheck_int=2' –P repeat for every hdiskx chdev -1 fscsiY -a dyntrk=yes -a fc_err_recov=fast_fail -P repeat for every fscsiY adapter reboot the host repeat chdevs for every host in the cluster
- Verify the attributes have been set correctly on every
computer:
root> lsattr -El fscsi0 attach switch How this adapter is CONNECTED False dyntrk yes Dynamic Tracking of FC Devices True fc_err_recov fast_fail FC Fabric Event Error RECOVERY Policy True root> lsattr -El hdiskA1 PCM PCM/friend/otherapdisk Path Control Module False PR_key_value none Persistent Reserve Key Value True Algorithm fail_over Algorithm True autorecovery no Path/Ownership Autorecovery True clr_q no Device CLEARS its Queue on error True cntl_delay_time 20 Controller Delay Time True cntl_hcheck_int 2 Controller Health Check Interval True
-
Ensure that in the case of storage controller or site failure, an error is returned quickly to
Spectrum Scale by setting the relevant device driver
parameters. Note that the relevant parameters differs for different device drivers. Check storage
controller documentation or consult a storage expert on site to ensure that errors are returned
within 20 seconds.
- Update the resource time-outs. Due to Spectrum Scale replication recovery requirements, recovery times for certain failures can be slightly longer in a geographically dispersed Db2 pureScale cluster (GDPC) environment than in a single-site Db2 pureScale environment. To account for this, some of the IBM Tivoli® System Automation for Multiplatforms resources need to have their timeout values adjusted. To adjust the time-outs, run the following commands once as root on any of the hosts in the cluster:
root> export CT_MANAGEMENT_SCOPE=2; # Update 2 member-specific timeouts. For these, the resource # names to update will look like db2_<instance>_<member_id>-rs. # In this example we have members 0-4, and our instance name is # db2inst1: root> chrsrc -s "Name like 'db2_db2inst1_%-rs'" IBM.Application CleanupCommandTimeout=600 root> chrsrc -s "Name like 'db2_db2inst1_%-rs'" IBM.Application MonitorCommandTimeout=600 # In the next two commands, replace ‘db2inst1’ with your instance # owning ID root> chrsrc -s "Name like 'primary_db2inst1_900-rs'" IBM.Application CleanupCommandTimeout=600 root> chrsrc -s "Name like 'ca_db2inst1_0-rs'" IBM.Application CleanupCommandTimeout=600 # In the following commands, replace ‘db2inst1’ with your # instance owning ID, and repeat for each host in your cluster, # except the tiebreaker host T root> chrsrc -s "Name like 'instancehost_db2inst1_hostA1'" IBM.Application MonitorCommandTimeout=600 root> chrsrc -s "Name like 'instancehost_db2inst1_hostA2'" IBM.Application MonitorCommandTimeout=600 root> chrsrc -s "Name like 'instancehost_db2inst1_hostA3'" IBM.Application MonitorCommandTimeout=600 root> chrsrc -s "Name like 'instancehost_db2inst1_hostB1'" IBM.Application MonitorCommandTimeout=600 root> chrsrc -s "Name like 'instancehost_db2inst1_hostB2'" IBM.Application MonitorCommandTimeout=600 root> chrsrc -s "Name like 'instancehost_db2inst1_hostB3'" IBM.Application MonitorCommandTimeout=600 # In the last two commands, replace ‘db2inst1’ with your instance # owning ID, and ‘hostA3’ with the hostname of the first CF added # to the cluster, and ‘hostB3’ with the hostname of the second # CF added to the cluster. root> chrsrc -s "Name like 'cacontrol_db2inst1_128_hostA3'" IBM.Application MonitorCommandTimeout=600 root> chrsrc -s "Name like 'cacontrol_db2inst1_129_hostB3'" IBM.Application MonitorCommandTimeout=600
To show the updated time-outs, run the following command as root:lsrsrc -t IBM.Application Name MonitorCommandTimeout CleanupCommandTimeout
- Verify the network resiliency scripts
List out the network resiliency scripts:
root> /home/db2inst1/sqllib/bin/db2cluster -cfs -list -network_resiliency -resources
For every host, a condition is listed and looks as follows:condition 6: Name = "condrespV10_hostA1_condition_en2" Node = "hostA1.torolab.ibm.com" MonitorStatus = "Monitored" ResourceClass = "IBM.NetworkInterface" EventExpression = "OpState != 1" EventDescription = "Adapter is not online" RearmExpression = "OpState = 1" RearmDescription = "Adapter is online" SelectionString = "IPAddress == '9.26.82.X'" Severity = "c" NodeNames = {} MgtScope = "l" Toggle = "Yes" EventBatchingInterval = 0 EventBatchingMaxEvents = 0 BatchedEventRetentionPeriod = 0 BattchedEventMaxTotalSize = 0 RecordAuditLog = "ALL"
The SelectionString must match the IB, RoCE, or TCP/IP private Ethernet IP address for the host, except for on the tiebreaker host. On configurations with RDMA and AIX IB or Linux® RoCE, it is the IB or or RoCE IP address. On configurations without RDMA or AIX RoCE, this is the IP address for the private Ethernet network. For any hosts where the SelectionString must match but does not, the IP address is not correct. In this case run:root> /home/db2inst1/sqllib/bin/db2cluster -cfs -repair -network_resiliency
-
In non-GDPC Db2
pureScale, Spectrum Scale uses the public Ethernet IP subnet (usually
associated with the hostname) for heartbeating among all hosts. A failure in this network will be
detected by Spectrum Scale directly through the loss of
heartbeat leading to the shutdown of Spectrum Scale on the
impacted hosts. With GDPC on AIX 10GE RoCE, the Spectrum Scale heartbeat IP subnet is changed to the second
private Ethernet network. In order to preserve the existing automatic shutdown of Spectrum Scale when the public Ethernet is down, a new condition response must
be setup as instructed below:
-
Run the following on a host other than the tiebreaker host to create the new condition response
pair on the public Ethernet:
IP_ADDRESS="<IP address for the node>" COND_NAME=condrespV105_<node name>_condition_en0 RESP_NAME="condrespV105_<node name>_response" /bin/mkcondition -r IBM.NetworkInterface -d 'Adapter is not online' -e 'OpState !=1' -D 'Adapter is online' -E 'OpState = 1' -m l -S c -s "IPAddress =='${IP_ADDRESS}'" ${COND_NAME} /bin/chcondition -L ${COND_NAME} /bin/mkcondresp ${COND_NAME} ${RESP_NAME}
-
Run the following command to activate and lock the new condition response:
/bin/startcondresp ${COND_NAME} ${RESP_NAME} /bin/rmcondresp -L ${COND_NAME} ${RESP_NAME}
-
Validate the network resiliency:
/home/<Db2 instance ID>/sqllib/bin/db2cluster -cfs -list -network_resiliency -resources
The output should be similar to:====> Conditions <==== Displaying condition information: condition 1: Name = "condrespV105_node1_condition_en0" Node = "node1.torolab.ibm.com" MonitorStatus = "Monitored" ResourceClass = "IBM.NetworkInterface" EventExpression = "OpState != 1" EventDescription = "Adapter is not online" RearmExpression = "OpState = 1" RearmDescription = "Adapter is online" SelectionString = "IPAddress == '<IP ADDRESS>'" Severity = "c" NodeNames = {} MgtScope = "l" Toggle = "Yes" EventBatchingInterval = 0 EventBatchingMaxEvents = 0 BatchedEventRetentionPeriod = 0 BattchedEventMaxTotalSize = 0 RecordAuditLog = "ALL" condition 2: Name = "condrespV105_node1_condition_en2" Node = "node2.torolab.ibm.com" MonitorStatus = "Monitored" ResourceClass = "IBM.NetworkInterface" EventExpression = "OpState != 1" EventDescription = "Adapter is not online" RearmExpression = "OpState = 1" RearmDescription = "Adapter is online" SelectionString = "IPAddress == '<IP ADDRESS>'" Severity = "c" NodeNames = {} MgtScope = "l" Toggle = "Yes" EventBatchingInterval = 0 EventBatchingMaxEvents = 0 BatchedEventRetentionPeriod = 0 BattchedEventMaxTotalSize = 0 RecordAuditLog = "ALL" . . . ====> Responses <==== Displaying response information: ResponseName = "condrespV105_node1_response" Node = "node1.torolab.ibm.com" Action = "condrespV105_node1_response event handler" DaysOfWeek = 1-7 TimeOfDay = 0000-2400 ActionScript = "/usr/sbin/rsct/sapolicies/db2/condrespV105_resp.ksh" ReturnCode = 0 CheckReturnCode = "y" EventType = "A" StandardOut = "n" EnvironmentVars = "" UndefRes = "y" EventBatching = "n" . . . ====> Associations <==== Dsiplaying condition with response information: condition-response link 1: Condition = "condrespV105_node1_condition_en0" Response = "condrespV105_node1_response" Node = "node1.torolab.ibm.com" State = "Active" condition-response link 2: Condition = "condrespV105_node1_condition_en2" Response = "condrespV105_node1_response" Node = "node1.torolab.ibm.com" State = "Active"
- Repeat steps 'a' to 'd' on other hosts except the tiebreaker host.
-
Run the following on a host other than the tiebreaker host to create the new condition response
pair on the public Ethernet:
Results
What to do next
You can create the database.