pureScale on Linux
with Tags: recovery X
I am referring to the standard form of database disaster recovery whereby the production system is duplicated at a second "standby" site. The changes made at the production site are made at the standby site keeping it more or less up to date with the production site. This hardware and software, often costing as much as the production system, only gets used in the case of a pretty major disaster when it takes over the function of the production site. Fortunately these events are quite rare. In my experience the rarity of the disaster scenario makes it psychologically difficult to spend all of that money on stuff that will "probably never get used". This can be an even more onerous decision when the database is clustered and several servers are involved. So what can we do?
Those clever folks at the IBM labs have done it again. Why not simply stretch the cluster out so that half is at one data center and half is at another. If one data center fails for whatever reason the application keeps going. The best ideas are usually simple! Of course we retain the existing features of purescale: High availability, capacity on demand etc. This is the Geographically Dispersed PureScale Cluster (GDPS). More details are available. in this white paper
The question of preventing split brain scenario comes up again and again with regard to pureScale (PS).
The scenario is as follows:
In a standard PS setup we have a primary and a standby CF. If the connection between these two machines fails but both keep going then the secondary node would "think" that the primary has failed and perform a failover. Now both CFs would take control of the shared data (the database) and the database would end up in a big mess. This would happen if the networking between the two machines broke down or if one got really busy and couldn't respond to the other fast enough.
Of course if this was true the we would be in big trouble but fortunately it is not. A technology called I/O fencing is used to ensure the above scenario can't happen.
I/O fencing is implemented via SCSI-3 Persistent Reserve technology. The core of the technology involves “registration” and “reservation” rights to disk partitions. Registration allows access to data. Many nodes (members and Cfs) can have “registration” access but only one can hold ”reservation” on a partition. Registered nodes can eject others. Ejection is a final and atomic action. An ejected node cannot eject another node.
Cluster services software on each node
manages various failover scenarios in the cluster. There are
numerous failover scenarios and these things are worked out to the
nth degree. In outline if any failures are detected then all nodes work out what to do in a similar way. First of all to say what a quorum is. A quorum is a group of nodes in a cluster that can communicate with each other, the number of nodes in a quorum must be more than half of the total in the cluster or if exactly half must have "reserve" on the tie break partition. If I am part of a "quorum" I can continue and take part in a failover and recovery, the first part of which is to eject or fence any nodes that are not part of the quorum. This prevents the "bad" nodes from updating the shared data. If I am a "bad" node i.e. not part of a quorum I wait to regain access to the other nodes and when I regain access I must undo anything I have done locally since the problem started (tidy up). I can then rejoin the cluster.