IBM Support

Maintanance Host Reboot Process of PureData Systems for Analytics

Question & Answer


Question

How to properly reboot hosts if needed for maintanance or before performing an upgrade or a part replacement?

Cause

This is done to assure hosts are healthy and able to boot back up after upgrades. Before rebooting a host one needs to check and make sure the cluster is healthy and a migrate test has been completed with no problems. If a host is tested properly before any shutdown, many problems can be avoided. It is also a good idea to complete a host backup before doing any type of update or hardware replacement.

Answer


I - Verify the health of the cluster
As root, carry out the following tasks to verify the cluster is healthy before rebooting a host:

    1 - Check DRBD status and make sure it looks like below, otherwise contact IBM PureData Systems for Analytics technical support for a systems specialist to take a look.
    [root@nzhostHA1 ~/?]# service drbd status
    drbd driver loaded OK; device status:
    m:res cs st ds p mounted fstype
    0:r1 Connected
    Primary/Secondary UpToDate/UpToDate
    1:r0 Connected
    Primary/Secondary UpToDate/UpToDate

    [root@nzhostHA2 ~/?]# service drbd status
    drbd driver loaded OK; device status:
    m:res cs st ds p mounted fstype
    0:r1 Connected Secondary/Primary UpToDate/UpToDate C /export/home ext3
    1:r0 Connected Secondary/Primary UpToDate/UpToDate C /nz ext3


    2 - Check Heartbeat status
    [root@nzhostHA1 ~/?]# service heartbeat status
    heartbeat OK [pid 1234 et al] is running on server [
    nzhostHA1]...

    [root@nzhostHA2 ~/?]# service heartbeat status
    heartbeat OK [pid 5678 et al] is running on server [nzhostHA2]...


    3 - Check current state of active cluster using crm_mon utility
    [root@nzhostHA1 ~/?]# crm_mon

    ============
    Last updated: Sun Jul 7 00:42:33 2013
    Current DC: p150-81e-d (ec5f70e6-368a-4415-9e7f-f97c4865135d)
    2 Nodes configured.
    3 Resources configured.
    ============

    Node: nzhostHA1 (f6a8cb61-555f-476b-9d23-82f179c1b973): online
    Node: nzhostHA2 (ec5f70e6-368a-4415-9e7f-f97c4865135d): online

    Resource Group: nps
    drbd_exphome_device (heartbeat:drbddisk): Started nzhostHA1
    drbd_nz_device (heartbeat:drbddisk): Started nzhostHA1
    exphome_filesystem (heartbeat::ocf:Filesystem): Started nzhostHA1
    nz_filesystem (heartbeat::ocf:Filesystem): Started nzhostHA1
    fabric_ip (heartbeat::ocf:IPaddr): Started nzhostHA1
    wall_ip (heartbeat::ocf:IPaddr): Started nzhostHA1
    nz_dnsmasq (lsb:nz_dnsmasq): Started nzhostHA1
    nzinit (lsb:nzinit): Started nzhostHA1
    fencing_route_to_ha1 (stonith:apcmastersnmp): Started nzhostHA2
    fencing_route_to_ha2 (stonith:apcmastersnmp): Started nzhostHA1


II - Confirm you are connected to correct host and migrate database
    4 - Confirm you are connected to the active host:
      4.1 - First determine which host you are connected to using whichHost
      [root@nzhostHA1~/?]# /nzlocal/scripts/whichHost
      ha1


      4.2 - Check DRBD output above and confirm host you are connected to shows Primary/Secondary which confirms it is the active host

    5 - As nz, run nzstop. If the appliance is in production, it is *strongly* recommended to manually stop NPS using `nzstop` before running the heartbeat_admin.sh --migrate command. This will allow the NPS processing to quiesce before the migration is attempted, and eliminates a potential race condition on hosts that are under high I/O load.
    [nz@nzhostHA1~/?]# nzstop

    6 - Failover database using the following command from active host:
    [root@nzhostHA1~/?]# /nzlocal/scripts/heartbeat_admin.sh --migrate

III - Shut down services and hosts
    7 - Re-check cluster and make sure everything worked once done with migrate using steps on section I.

    8 - Turn off DRBD and heartbeat
    [root@nzhostHA1~/?]# service heartbeat stop
    Stopping High-Availability services:
    [ OK ]
    [root@nzhostHA1~/?]# service drbd stop
    Stopping all DRBD resources.

    [root@nzhostHA1~/?]# chkconfig heartbeat off
    [root@nzhostHA1~/?]# chkconfig drbd off


    9 - Verify everything was stopped correctly. If failures are noticed, STOP and contact IBM PureData Systems for Analytics technical support for a systems specialist to take a look.
    [root@nzhostHA1~/?]# service drbd status
    drbd not loaded
    [root@nzhostHA1~/?]# service heartbeat status
    heartbeat is stopped. No process


    10 - If everything behaved as expected, shutdown the host.
    [root@nzhostHA1~/?]# shutdown -r now

    11 - Verify everything was stopped correctly. If failures are noticed, STOPand contact IBM PureData Systems for Analytics technical support for a systems specialist to take a look.

    12 - Ping active host from passive host, to determine when it is up.
    [root@nzhostHA2~/?]# ping ha1

IV - Restart services and verify cluster is healthy
    13 - Once system is up turn the services back on that were turned off as follows:
    [root@nzhostHA1~/?]# service drbd start
    Starting DRBD resources: [ d(r0) d(r1) s(r0) s(r1) n(r0) n(r1) ]
    [root@nzhostHA1~/?]# service heartbeat start
    Starting High-Availability services:
    [ OK ]


    14 - Once the services come back up then check to make sure the cluster is healthy as in section I.
    If failures are noticed, STOPand contact IBM PureData Systems for Analytics technical support for a systems specialist to take a look.

    15 - Once the services come back up then check to make sure the cluster is healthy. Once everything is looks healthy make sure to chkconfig heartbeat and drbd:
    [root@nzhostHA1~/?]# chkconfig heartbeat on
    [root@nzhostHA1~/?]# chkconfig drbd on

V - Migrate back to the other side and follow procedures again to reboot the other host

[{"Product":{"code":"SSULQD","label":"IBM PureData System"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":null,"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"1.0.0","Edition":"All Editions","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
17 October 2019

UID

swg21652916