• 2 replies
  • Latest Post - ‏2012-08-31T17:45:40Z by osc
2092 Posts

Pinned topic Replace NSD Server OS

‏2012-08-31T08:55:21Z |
Dear GPFS Experts,

I'd like to replace the operating system we are running across our cluster (currently CentOS 6.0) with Red Hat 6.3 as we have just recently acquired a site-license for Red Hat that comes with all the nice technical support (and is also officially supported by IBM for GPFS).

We have two NSD servers (both quorum-managers) and six client nodes. I can easily remove the client nodes, rebuild them and re-join them one at a time with a new OS - but I can't remove one of the NSD servers and rebuild it without taking the cluster offline (I don't want to do this)!

I've got an idea that I could just back-up all the directories containing the GPFS binaries, source for portability layer and GPFS node configuration; then wipe the node. Rebuild it and re-install the rpm packages, restore the directories, rebuild the portability layer then start GPFS... node re-joins cluster with no problems... or crashes cluster with hideous errors.

Does this sound like the ravings of a madman? Or, could I just get on and do this on the NSD servers whilst the cluster remains on-line?

Any help greatly appreciated,
Updated on 2012-08-31T17:45:40Z at 2012-08-31T17:45:40Z by osc
  • SystemAdmin
    2092 Posts

    Re: Replace NSD Server OS

    Hi Luke,

    depending on your hardware setup you should be able to do what you want: Upgrade one server while the other one is providing NSD services. The essential condition is that all your LUNs are attached to both NSD servers. If this is the case, then I propose the following action plan:
    • Move the quorum and manager function from NSD1 temporarily to another node (does not need to be a NSD)
    • Remove NSD1 from the cluster
    • Reinstall NSD1
    • Reintegrate NSD1 into the cluster
    • Mode the quorum and manager function back to NSD1

    Then you use the same procedure for the second NSD.

  • osc
    11 Posts

    Re: Replace NSD Server OS

    If you wipe or reload a node that is part of a larger cluster, or have an OS or disk failure, there isn't too much you need to get back up and running. (unless it's the management node. we've seen odd issues with that) Really just re-loading the node, rebuilding the source/port layer (only take a min) for GPFS and restoring the /var/mmfs/gen/mmsdrfs file from a working node/backup. then start it, and 'mmrefresh -f' once you've gotten things re-named the same as before. a little poking around to make sure it's re-joined properly and you will be up and running. We've done this a good bit with RHEL 5. You probably should backup logs as well, but it's not an absolute necessity.
    This assumes a great deal of course... From the basics such as taking down an NSD doesn't lose any disks, to the more advanced such as supplemental network configs, snapshots,etc. Sounds like the disk thing might be an issue for you? With 2 servers it could possibly be done if both saw all disk, but I have my doubts about this from what you've said already.