See the WebSphere eXtreme Scale Wiki for links to eXtreme Scale Version 7.0 documentation.
If you log in
with your developerWorks ID, you can leave comments and feedback for the development team.
ObjectGrid automatically repairs failures. For example, a partition has a primary and synchronous replica. The primary fails. ObjectGrid promotes the synchronous replica to be a primary, then places the synchronous replica on a compatible surviving container. Eventually, the synchronous replica enters peer mode and the system is back at the required level of fault tolerance.
This operation has a cost. Extra CPU and memory load was placed on the surviving shards to create the replacement replica. This extra load is sometimes not desirable, such as in HTTP or Session Initiation Protocol (SIP) sessions. These sessions are short lived, usually less than 10 minutes. The probability of the Java virtual machine (JVM) with the new primary failing within the short time period is very low. The task of repairing, or adding a new replica, might overload the surviving JVMs, especially when using the real-time JVMs.
You can disable disable this automatic repair. When a failure occurs, the primary fails over as before, but no new replicas are created. Following is a scenario that illustrates the behavior differences. This scenario has three servers (A, B, C) and three partitions with one replica (P0,P1,P2,R0,R1,R2):
A -> P0, R2
B -> P1, R0
C -> P2, R1
Auto repair on server A fails. ObjectGrid promotes server B to a primary and adds a new replica on server C, leaving the following state.
B -> P1, P0, R2
C -> P2, R1, R0
The R0 replica was upgraded to the P0 primary and replaced the R2 replica on the B server and added a replica R0 on server C. This operation increases the memory load on the survivor servers and adds CPU load while the replicas are being recreated.
Auto repair off
Server A fails again. ObjectGrid upgrades the R0 replica to a primary replica, but does nothing to repair. The P0 and P2 partitions are left with no replicas. This operation works if the probability of another failure happening is unlikely. The system is single fault tolerant but is not double fault tolerant.
B -> P1, P0
C -> P2, R1
Besides upgrading R0 to P0, no extra CPU or memory load occurs from repair. Server B has additional load because the primary is now running on it, but it had replication load before the failure.
© Copyright IBM Corporation 2007,2009. All Rights Reserved.