Recovery and avoidance strategies
When considering purchase and system design choices about availability, it is tempting to dive into long lists of high availability features and technologies. However, best practices with respect to making and keeping your system highly available are just as much about making good design and configuration choices, and designing and practicing sound administrative procedures and emergency plans, as they are about buying technology.
You will get the most comprehensive availability for your investment by first identifying the high availability strategies that best suit your business demands. Then you can implement your strategies, choosing the most appropriate technology.
When designing or configuring your database solution for high availability, consider how outages may be avoided, their impact minimized, and your system quickly recovered.
- Avoid outages
Whenever possible, avoid outages. For example, remove single points of failure to avoid unplanned outages, or investigate methods for performing maintenance activities online to avoid planned outages. Monitor your database system to identify trends in system behavior that indicate problems, and resolve the problems before they cause an outage.
- Minimize the impact of outages
You can design and configure your database solution to minimize the impact of planned and unplanned outages. For example, distribute your database solution so that components and functionality are localized, allowing some user applications to continue processing transactions even when one component is offline.
- Recover quickly from unplanned outages
Make a recovery plan: create clear and well-documented procedures that administrators can follow easily and quickly in the event of an unplanned outage; create clear architectural documents that describe all components of the systems involved; have service agreements and contact information well organized and close to hand. While recovering quickly is vitally important, also know what diagnostic information to collect in order to identify the root cause of the outage and avoid it in the future.