Reliability of Clustered Installations

A reliable Sterling B2B Integrator environment requires a variety of other reliable components, including the network, disk storage, and database.

Mission-critical business processes require reliable systems. As organizations rely more on automated processes, the consequences of even small, unintentional outages can be severe:

Payments not made
Orders not received
Products not delivered

Redundancy improves reliability, as measured by Mean Time Between Failures (MTBF).

A reliable installation makes use of the following:

Redundant network connections
More than one connection from the server to the network is necessary. This includes the connections from the data center to the point of presence, where a communications service provider takes over. More than one connection must be leaving the premises, by widely separated routes.
Redundant network paths
Upstream network components are duplicated, including DNS (Domain Name System) servers, firewalls, HTTP servers, and any other components with which the application server communicates.

The following illustration shows both redundant network connections and end-to-end paths:
Reliable server hardware
Reliable server hardware includes error detecting/correcting memory, like ECC or SECDED, and hot swappable components. Because the basic components of servers are very reliable, the hardware prevents data corruption by diagnosing internal failures and allowing repair without interrupting operation. This increases the reliability of the entire environment, because these kinds of errors are not easily detectable by other means.
Mirrored or RAID disk storage
Disk storage must be either mirrored or Redundant Array of Inexpensive Disks (RAID), including any temporary storage that is used. The risk of data loss or corruption is minimized by having data stored on at least two disks and duplicating the interface, and by controllers cross-connecting.

Mirrored disks write each byte of data to two disks. If one disk fails, data is read from the other disk. A RAID array stores data on some number of disks plus parity on another. If a disk fails, data is regenerated from the remaining disks plus parity.

The following illustration shows the mirrored and RAID techniques:
High availability database server
A database server environment must be configured for high availability, either through clustering or failover. This is a total database vendor-based solution for load balancing and/or failover. For Sterling B2B Integrator, any JDBC database driver that already supports transparent failover should work, but might require a small amount of customization. If database redundancy is achieved through failover (including IP address takeover), then Sterling B2B Integrator rolls back any transactions during the time the database is unavailable.
High availability auxiliary components
If other facilities are being incorporated into the Sterling B2B Integrator environment, like middleware, ERP systems, or web servers, these additional components must be included in the high availability evaluation and design. If these auxiliary components fail, a substantial amount of the business functionality for the environment might be unavailable, even if Sterling B2B Integrator is fully functional.
Reliable power
Reliable power is required, ideally with full redundancy all the way from the power substations, through electrical panels, to the components themselves. Un-interruptible power supplies (UPS) and/or generators are also necessary. Single points of failure and outages in this area, such as cable cuts, often require hours to remedy.
Technical and procedural methods
Business process developers must account for possible failure modes, both technical and procedural. You should focus on detecting problems and ensuring that correct and timely notification occurs. Not every kind of failure can be handled through technical solutions. Providing for human intervention in troubleshooting and correcting high-level process problems can be the critical factor in deploying a truly reliable process.
Monitoring
Sterling B2B Integrator detects certain kinds of system errors, but it is vitally important that you monitor the network, participating servers, and database environment to provide detailed diagnostic information and detect and locate failures quickly in any domain. This infrastructure is outside the scope of Sterling B2B Integrator. Although Sterling B2B Integrator is not packaged with components to integrate specifically with such environments, it is generally possible to use standard Sterling B2B Integrator functionality (OPS Command Line, HTTP, FTP, file system or database adapters).

Many monitoring packages are available, including some from network or server vendors and others from third parties. IBM®, HP, SUN, Computer Associates and many others provide monitoring products. The open source products Nagios and Big Brother also are widely used.

At a higher level, Sterling B2B Integrator provides an alert mechanism that provides notification in the event of a business process failure. On-fault processing also is available to take corrective action (possibly different action) for each individual business process. These mechanisms can be integrated with other monitoring or help desk-type software that is already in use.
Documentation and automation
From an operational standpoint, documentation and automation are the two biggest improvements that can be made to overall system reliability. Documentation dramatically reduces the risk of a staff not understanding what to do, and automation helps ensure that complex procedures are executed consistently and reliably. Automation is unique for each installation and is probably the most important element of any reliable business process.