Legacy platform

Ensuring against node failure

There are multiple ways to protect your system from node failure.

The term ‘node' refers to the physical computing hardware on which Sterling™ Order Management System Software runs.

Fortunately, due to advancements in hardware design, component redundancy, and automatic fault detection and correction, node failures due to hardware fault are rare events. Take for example memory on an industrial-strength computer. Error Checking and Correcting (ECC) codes are built into the memory to correct single bit errors and to detect double bit errors. If needed, parts of the memory can be selectively disabled. Through techniques such as bit-scattering, memory chips are organized such that failure of an entire memory module only affects a single bit within the ECC word. In addition, with techniques such as bit-steering, bits can be dynamically routed to spare memory chips. (Source: IBM® eServer p5 590 and 595 System Handbook, SG24-9119-00, IBM Corp., March 17, 2005.)

Similarly, nodes typically are configured with multiple critical components such as power and fans so that they can continue to run after one or more components fail. Most of these components are also hot swappable allowing one to replace failed components without the need to shut down the node.

Unfortunately, if the node fails, the mean-time-to-repair (MTTR) could be very high. In the best case, you may only have to restart the node, restart the services, initiate recovery and make the service available. Depending on the size of the configuration, this could take up to 20 minutes or more. In the worst case, for example if the fault was due to a hardware failure, you may have to wait for replacement parts. In those situations, the MTTR could be days.

The impact of a node outage depends on the service that runs on that node. If the node was running a few agent servers, the impact could be isolated to just the services provided by those agents. In contrast, if the outage was in the database server node (and the database is not clustered), the outage will be to the entire application.

If your tolerance for downtime is low, you have the following options

  • Ensure that your nodes are composed of high redundant servers (as described above) to reduce the likelihood of a node outage caused by hardware faults
  • Use active/passive or primary/standby failover configuration where one or more passive or standby nodes are available to take over for failed nodes. See "Active/Passive Cluster Failover Configurations" for more information. You can use this approach in subsequent sections to protect critical components such as message queues and the application and agent servers. In "Active/Passive Failover Configurations", we present active/passive configurations for database servers.
  • Use the clustering capabilities built into application servers and in Oracle Real Application Cluster to protect against outage from a single node failure. "Active/Active Failover Configurations" describes in the use of an active/active clustered database failover configuration.