Planning for network availability

High availability networks provide redundant infrastructure that can be switched on when the primary network resources experience performance problems or failures of any kind. Learn how to plan for network availability by taking a closer look at your network.

When planning for network availability, you must first determine your necessary degree of uptime. Systems with better than 99% uptime are considered fault tolerant. As the availability percentage approaches 100%, you move into the high availability networks. The closer you get to 100% percent uptime, the more expensive this availability gets. Therefore, you need to develop a good business case for high availability networks. For example, application service providers need high availability (99.9999% uptime). But, your corporate Web site might only need 99.9% uptime. The difference in cost can be substantial, depending on the size and scale of your network.

Before you begin
__ Create a table of applications that require fault tolerant or high-availability networks.
__ Identify the parts of the network topology that are used by those applications.
Network availability planning tasks
__ Identify single points of failure

The easiest and most economical way to improve network availability is to remove single points of failure. A single point of failure occurs when there is just one physical connection between parts of a network. Many different network topologies can help you remove single points of failure. The basic principle is to connect more nodes to individual servers and other network resources. If one of the nodes fails, traffic can be rerouted around the failed system.

__ Plan for fault tolerance
Fault-tolerant networks have very few single points of failure, if any. In addition, fault-tolerant networks have disaster-recovery hardware at each node. Typical hardware measures per node include:
  • Replicated hardware subsystems

    If a network is important enough, a second server, router, or other device is available at each node in case of system failure of the primary device.

  • Standby hardware

    An example of standby hardware is a redundant array of independent disks (RAID), which enables hot-swappable storage media.

  • Fast boot methods

    You need to be able to dump and reboot in the shortest possible time to maximize uptime.

  • Backup power

    Plan to connect as many nodes as you can to uninterruptible power supplies. Large data centers should have backup generators as well.

  • Total remote management

    You should be able to remotely diagnose and reboot servers regardless of their state.

  • Concurrent backup and restore

    Make sure that you can use the backup system as soon as a failure is detected, and begin backing up again in real time.

To learn how to plan for high availability and clusters, see Planning for availability.

__ Plan for clustering

Clustering is the process of connecting a large number of servers to achieve continuous, or 100% uptime. Many families of servers enable clustering, and several software packages, such as WebSphere® application and Web server software, enable clustering. Clustering can be relatively straightforward for continuous or steady-state usage. The challenge is to maintain uptime during routine maintenance or while upgrading systems within a cluster.

The basic principle behind clustering is virtualization. That is, though a group of servers are physically distinct, they are logically indistinct. Part of the virtualization process includes virtual IP addressing, which assigns IP addresses to a pool of servers rather than each physical server. In this way, no routing is involved when one server goes down and one of the backup servers that are connected to the same cluster as the primary server takes its workload.

In System i®, you can use virtual IP addresses to provide redundancy of physical adapters by not having a given virtual IP address assigned to a single physical adapter.

When you have completed these tasks, you should have a network availability plan that identifies these elements:

After you finish
__ Record a list of all single points of failure and plan to create redundancy.
__ Record a list of hardware that requires backup and disaster recovery measures.
__ Record a list of servers that will be part of a cluster, and develop a plan for clustering software that enables you to implement your clustering plan.