IBM®
Skip to main content
    Country/region [select]      Terms of use
 
 
    
     Home      Products      Services & solutions      Support & downloads      My account     
 
developerworks > My developerWorks >  Dashboard > WebSphere eXtreme Scale V6.1 User Guide > ... > ObjectGrid overview > ObjectGrid high availability > Information > Page Comparison
developerWorks
Log In   View a printable version of the current page.
Overview Connect Spaces Forums Wikis
ObjectGrid high availability
compared with
Current by saif.patel@us.ibm.com
on Feb 09, 2009 12:27.

(show comment)
 
Key
These lines were removed. This word was removed.
These lines were added. This word was added.

View page history


There are 1 changes. View first change.

 {include:pageTitle=OG_HEADER}
 {excerpt}With high availability, ObjectGrid provides redundancy and detection of failures.
 {excerpt}
  
 ObjectGrid self-organizes grids of Java virtual machines (JVM) into a loosely federated tree, with the catalog service at the root and core groups holding containers at the leaves of the tree. See the [topology|ObjectGrid architecture] topic for more information. Each core group is automatically created by the catalog service, into groups of about 20 servers. The core group members monitor other members of the group for health. A core group member is elected to be the leader to communicate group information to the catalog service. By limiting the core group size, good health monitoring and a highly scalable environment can be maintained.
  
 This section covers the following topics:
 {toc:minLevel=3}
  
 h3. Failures
 There are several ways that a process can fail. The process could fail because some resource limit was reached, maximum heap size for instance, or some process control logic terminated a process. The operating system could fail, causing all of the processes running on the system to be lost. Hardware can fail, though less frequently, like the network interface card (NIC), causing the operating system to be disconnected from the network. Many more points of failure can occur, causing the process to be unavailable. In this context, all of these failures can be categorized into one of two failure types: _process failure_ and _loss of connectivity_.
  
 h4. Process failure
 When a process fails, the operating system is responsible for cleaning up any left over resources that the process was using. This cleanup includes port allocation and connectivity. When a process fails, a signal is sent over the connections that were being used by that process to close each connection. With these signals, a process failure can be instantaneously detected by any other process that is connected to the failed process. ObjectGrid reacts to process failures very quickly.
  
 h4. Loss of connectivity
 Loss of connectivity occurs when the operating system becomes disconnected. As a result, the operating system cannot send signals to other processes. There are several reasons that loss of connectivity can occur, the reasons can be split into two categories: host failure and islanding.
  
 h5. Host failure
 If the machine is unplugged from the power outlet, then it is gone instantly.
  
 h5. Islanding
 This scenario presents the most complicated failure condition for software to handle correctly. This failure is difficult to handle because the process is presumed to be unavailable, but is not.
  
 h3. ObjectGrid container failure
 ObjectGrid container failures are generally discovered by peer containers through the core group mechanism. When a container or set of containers fails, the catalog service migrates the shards that were hosted on that container or containers. The catalog service looks for a synchronous replica first before migrating to an asynchronous replica. After the primary shards are migrated to new host containers, the catalog service looks for new host containers for the replicas that are now missing.
  
 {info:title=Container Islanding}
 The catalog service migrates shards off of containers when the container is discovered to be unavailable. If those containers then become available, the catalog service considers the containers eligible for placement just like in the normal startup flow.
 {info}
  
 h4. Container failover detection latency
 Failures can be categorized into soft and hard failures. Soft failures are typically caused when a process fails. Such failures are detected by the operating system, which can recover used resources, such as network sockets, very quickly. Typical failure detection for soft failures is less than one second.
  
 Hard failures may take up to 200 seconds to detect using the default heart beat tuning. Such failures include: physical machine crashes, network cable disconnects or operating system failures. The ObjectGrid must rely on heart beating to detect hard failures which can be configured. See the topic: [Configuring failover detection] for details on lowering the time it takes to detect a hard failure.
  
 h3. Catalog service failure
 The catalog service cluster is an ObjectGrid cluster. The catalog service cluster uses the core grouping mechanism in the same way as the container failure process. The primary difference is that the catalog service cluster uses a peer election process for defining the primary shard instead of the catalog service algorithm that is used for the containers.
  
 Note that the placement service and the core grouping service are one-of-N services, but the location service and administration run everywhere. The placement service and core grouping service are singletons because they are responsible for laying out the system. The location service and administration are read-only services and are everywhere to provide scalability.
  
 {note:title=Placement of the catalog server}
 The catalog service uses replication to make itself fault tolerant. If a catalog service process fails, then the service should restart to restore the system to the desired level of availability. After all of the processes that are hosting the catalog service fail, the loss of critical data occurs for the ObjectGrid. This failure results in a required restart of all the containers. Because the catalog service can run on many processes, this failure is an unlikely event. However, if you are running all of the processes on a single box, within a single blade chassis, or from a single network switch, a failure is more likely to occur. Try to remove common failure modes from boxes that are hosting the catalog service to reduce the possibility of failure.
 {note}
  
 h3. Multiple container failures
 A replica is never placed in the same process as its primary because if the process is lost, it would result in a loss of both the primary and the replica. The deployment policy defines an attribute that the catalog service uses to determine whether a replica can be placed on the same machine as a primary. In a development environment on a single machine, you might want to have two containers and replicate between them. However, in production, using a single machine is not a good idea because loss of that host results in the loss of both containers. To change between development mode on a single machine and a production mode with multiple machines, use the [development mode flag|Deployment policy configuration reference].
  
 h3. Additional information
 ObjectGrid high availability uses replication. A shard can replicate from one JVM to another. For more details about how high availability can be achieved by using replication, refer to the following links:
 * [Replication architecture]
 * [Replication programming]
 * [How ObjectGrid places shards on a grid]
 * [Replication events]
 * [Replication automatic repair mode]
 * [Client-side replication]
  * [Peer to Peer Replication with JMS]
  * [Peer to peer Replication with JMS]
 * [Configuring failover detection]
  
  
  
 {include:pageTitle=OG_FOOTER}

 
    About IBM Privacy Contact