Configuring the cluster for high availability in a GDPC environment

The configuration procedure detailed in this topic is specific to the geographically dispersed Db2® pureScale® cluster (GDPC). This procedure must be performed after the initial instance creation or after any subsequent repair or deletion operation for resource model or peer domain.

Before you begin

Ensure that you have Spectrum Scale replication set up (see Setting up IBM Spectrum Scale replication in a GDPC environment.) If you are running on an AIX operating system on a RoCE network, ensure you have set up a RoCE network (see Configuring public network equivalency in a GDPC environment for high availability.)

Procedure

  1. Update storage failure time-outs.
    1. Ensure that in the case of storage controller or site failure, an error is returned quickly to Spectrum Scale by setting the relevant device driver parameters. Note that the relevant parameters differs for different device drivers. Check storage controller documentation or consult a storage expert on site to ensure that errors are returned within 20 seconds.
      For example, on DS8K using the default AIX® SDDPCM, the updates are:
      chdev -l hdiskX -a 'cntl_delay_time=20 cntl_hcheck_int=2' –P
      
      repeat for every hdiskx 
      
      chdev -1 fscsiY -a dyntrk=yes -a fc_err_recov=fast_fail -P
      
      repeat for every fscsiY adapter
      
      reboot the host
      
      repeat chdevs for every host in the cluster
    2. Verify the attributes have been set correctly on every computer:
      root> lsattr -El fscsi0  
      attach          switch     How this adapter is CONNECTED           False
      dyntrk          yes        Dynamic Tracking of FC Devices          True
      fc_err_recov    fast_fail  FC Fabric Event Error RECOVERY Policy   True
      
      root> lsattr -El hdiskA1
      PCM             PCM/friend/otherapdisk  Path Control Module              False
      PR_key_value    none                    Persistent Reserve Key Value     True
      Algorithm       fail_over               Algorithm                        True
      autorecovery    no                      Path/Ownership Autorecovery      True
      clr_q           no                      Device CLEARS its Queue on error True
      cntl_delay_time 20                      Controller Delay Time            True
      cntl_hcheck_int 2                       Controller Health Check Interval True
  2. Update the resource time-outs.
    Due to Spectrum Scale replication recovery requirements, recovery times for certain failures can be slightly longer in a geographically dispersed Db2 pureScale cluster (GDPC) environment than in a single-site Db2 pureScale environment. To account for this, some of the IBM Tivoli® System Automation for Multiplatforms resources need to have their timeout values adjusted. To adjust the time-outs, run the following commands once as root on any of the hosts in the cluster:
    	root> export CT_MANAGEMENT_SCOPE=2;
    	# Update 2 member-specific timeouts.  For these, the resource
    	# names to update will look like db2_<instance>_<member_id>-rs.
    	# In this example we have members 0-4, and our instance name is
    	# db2inst1:
    	root> chrsrc -s "Name like 'db2_db2inst1_%-rs'" IBM.Application 		CleanupCommandTimeout=600
    	root> chrsrc -s "Name like 'db2_db2inst1_%-rs'" IBM.Application 		MonitorCommandTimeout=600
    
    	# In the next two commands, replace ‘db2inst1’ with your instance
    	# owning ID
    	root> chrsrc -s "Name like 'primary_db2inst1_900-rs'" 			IBM.Application CleanupCommandTimeout=600
    	root> chrsrc -s "Name like 'ca_db2inst1_0-rs'" IBM.Application 		CleanupCommandTimeout=600
    
    	# In the following commands, replace ‘db2inst1’ with your
    	# instance owning ID, and repeat for each host in your cluster,
    	# except the tiebreaker host T
    	root> chrsrc -s "Name like 'instancehost_db2inst1_hostA1'" 			IBM.Application MonitorCommandTimeout=600
    	root> chrsrc -s "Name like 'instancehost_db2inst1_hostA2'" 			IBM.Application MonitorCommandTimeout=600
    	root> chrsrc -s "Name like 'instancehost_db2inst1_hostA3'" 			IBM.Application MonitorCommandTimeout=600
    	root> chrsrc -s "Name like 'instancehost_db2inst1_hostB1'" 			IBM.Application MonitorCommandTimeout=600
    	root> chrsrc -s "Name like 'instancehost_db2inst1_hostB2'" 			IBM.Application MonitorCommandTimeout=600
    	root> chrsrc -s "Name like 'instancehost_db2inst1_hostB3'" 			IBM.Application MonitorCommandTimeout=600
    
    	# In the last two commands, replace ‘db2inst1’ with your instance
    	# owning ID, and ‘hostA3’ with the hostname of the first CF added
    	# to the cluster, and ‘hostB3’ with the hostname of the second
    	# CF added to the cluster.
    	root> chrsrc -s "Name like 'cacontrol_db2inst1_128_hostA3'" 		IBM.Application MonitorCommandTimeout=600
    	root> chrsrc -s "Name like 'cacontrol_db2inst1_129_hostB3'" 		IBM.Application MonitorCommandTimeout=600
    To show the updated time-outs, run the following command as root:
    lsrsrc -t IBM.Application Name MonitorCommandTimeout CleanupCommandTimeout
  3. Verify the network resiliency scripts
    List out the network resiliency scripts:
    root> /home/db2inst1/sqllib/bin/db2cluster -cfs -list -network_resiliency -resources
    For every host, a condition is listed and looks as follows:
    condition 6:
            Name                        = "condrespV10_hostA1_condition_en2"
            Node                        = "hostA1.torolab.ibm.com"
            MonitorStatus               = "Monitored"
            ResourceClass               = "IBM.NetworkInterface"
            EventExpression             = "OpState != 1"
            EventDescription            = "Adapter is not online"
            RearmExpression             = "OpState = 1"
            RearmDescription            = "Adapter is online"
            SelectionString             = "IPAddress == '9.26.82.X'"
            Severity                    = "c"
            NodeNames                   = {}
            MgtScope                    = "l"
            Toggle                      = "Yes"
            EventBatchingInterval       = 0
            EventBatchingMaxEvents      = 0
            BatchedEventRetentionPeriod = 0
            BattchedEventMaxTotalSize   = 0
            RecordAuditLog              = "ALL"
    The SelectionString must match the IB, RoCE, or TCP/IP private Ethernet IP address for the host, except for on the tiebreaker host. On configurations with RDMA and AIX IB or Linux® RoCE, it is the IB or or RoCE IP address. On configurations without RDMA or AIX RoCE, this is the IP address for the private Ethernet network. For any hosts where the SelectionString must match but does not, the IP address is not correct. In this case run:
    root> /home/db2inst1/sqllib/bin/db2cluster -cfs -repair -network_resiliency
  4. In non-GDPC Db2 pureScale, Spectrum Scale uses the public Ethernet IP subnet (usually associated with the hostname) for heartbeating among all hosts. A failure in this network will be detected by Spectrum Scale directly through the loss of heartbeat leading to the shutdown of Spectrum Scale on the impacted hosts. With GDPC on AIX 10GE RoCE, the Spectrum Scale heartbeat IP subnet is changed to the second private Ethernet network. In order to preserve the existing automatic shutdown of Spectrum Scale when the public Ethernet is down, a new condition response must be setup as instructed below:
    1. Run the following on a host other than the tiebreaker host to create the new condition response pair on the public Ethernet:
      IP_ADDRESS="<IP address for the node>"
      COND_NAME=condrespV105_<node name>_condition_en0
      RESP_NAME="condrespV105_<node name>_response"
      
      /bin/mkcondition -r IBM.NetworkInterface -d 'Adapter is not online' -e 'OpState !=1' -D 'Adapter is online' -E 'OpState = 1' -m l -S c -s "IPAddress =='${IP_ADDRESS}'" ${COND_NAME}
      /bin/chcondition -L ${COND_NAME}
      /bin/mkcondresp ${COND_NAME} ${RESP_NAME}
    2. Run the following command to activate and lock the new condition response:
      /bin/startcondresp ${COND_NAME} ${RESP_NAME}
      /bin/rmcondresp -L ${COND_NAME} ${RESP_NAME}
    3. Validate the network resiliency:
      /home/<Db2 instance ID>/sqllib/bin/db2cluster -cfs -list -network_resiliency -resources
      The output should be similar to:
      ====> Conditions <====
      Displaying condition information:
      condition 1:
             Name                        = "condrespV105_node1_condition_en0"
             Node                        = "node1.torolab.ibm.com"
             MonitorStatus               = "Monitored"
             ResourceClass               = "IBM.NetworkInterface"
             EventExpression             = "OpState != 1"
             EventDescription            = "Adapter is not online"
             RearmExpression             = "OpState = 1"
             RearmDescription            = "Adapter is online"
             SelectionString             = "IPAddress == '<IP ADDRESS>'"
             Severity                    = "c"
             NodeNames                   = {}
             MgtScope                    = "l"
             Toggle                      = "Yes"
             EventBatchingInterval       = 0
             EventBatchingMaxEvents      = 0
             BatchedEventRetentionPeriod = 0
             BattchedEventMaxTotalSize   = 0
             RecordAuditLog              = "ALL"
      
      condition 2:
             Name                        = "condrespV105_node1_condition_en2"
             Node                        = "node2.torolab.ibm.com"
             MonitorStatus               = "Monitored"
             ResourceClass               = "IBM.NetworkInterface"
             EventExpression             = "OpState != 1"
             EventDescription            = "Adapter is not online"
             RearmExpression             = "OpState = 1"
             RearmDescription            = "Adapter is online"
             SelectionString             = "IPAddress == '<IP ADDRESS>'"
             Severity                    = "c"
             NodeNames                   = {}
             MgtScope                    = "l"
             Toggle                      = "Yes"
             EventBatchingInterval       = 0
             EventBatchingMaxEvents      = 0
             BatchedEventRetentionPeriod = 0
             BattchedEventMaxTotalSize   = 0
             RecordAuditLog              = "ALL"
      .
      .
      .
      
      ====> Responses <====
      Displaying response information:
             ResponseName    = "condrespV105_node1_response"
             Node            = "node1.torolab.ibm.com"
             Action          = "condrespV105_node1_response event handler"
             DaysOfWeek      = 1-7
             TimeOfDay       = 0000-2400
             ActionScript    = "/usr/sbin/rsct/sapolicies/db2/condrespV105_resp.ksh"
             ReturnCode      = 0
             CheckReturnCode = "y"
             EventType       = "A"
             StandardOut     = "n"
             EnvironmentVars = ""
             UndefRes        = "y"
             EventBatching   = "n"
      .
      .
      .
      
      ====> Associations <====
      Dsiplaying condition with response information:
      condition-response link 1:
             Condition = "condrespV105_node1_condition_en0"
             Response  = "condrespV105_node1_response"
             Node      = "node1.torolab.ibm.com"
             State     = "Active"
      condition-response link 2:
             Condition = "condrespV105_node1_condition_en2"
             Response  = "condrespV105_node1_response"
             Node      = "node1.torolab.ibm.com"
             State     = "Active"
    4. Repeat steps 'a' to 'd' on other hosts except the tiebreaker host.

Results

Your GDPC environment is installed and configured.

What to do next

You can create the database.