Outage signatures

An outage signature is a collection of symptoms and behaviors which characterize an outage. The signature of an outage may vary from temporary performance issues resulting in slow response time for end users to complete site failure. Consider how these variations impact your business when devising strategies for avoiding, minimizing, and recovering from outages.
Blackout

A blackout type of outage is experienced when a system is completely unavailable to its end users. This type of outage may be caused by problems at the hardware, operating system, or database level. When a blackout occurs, it is imperative that the scope of the outage is immediately identified. Is the outage purely at the database level? Is the outage at the instance level? Or is it at the operating system or hardware level?

Brownout

A brownout type of outage is experienced when system performance slows to a point where end users cannot effectively get their work done. The system as a whole may be up and running, but essentially, in the eyes of the end users it is not working as expected. This type of outage may occur during system maintenance windows and peak usage periods. Typically, the CPU and memory are near capacity during such outages. Poorly tuned or overutilized servers often contribute to brownouts.

Frequency and duration of outages

In conversations about database availability, the focus is often on the total amount or the percentage of down time (or conversely the amount of time the database system is available) for a given time period. However, the frequency and duration of planned or unplanned outages makes a significant difference to the impact that those outages have on your business.

Consider a situation in which you have to make some upgrades to your database system that will take seven hours to perform, and you can choose between taking the database system offline for an hour every day during a period of low user activity or taking the database offline for seven hours during the busiest part of your busiest day. Clearly, several small outages would be less costly and harmful to your business activities than the single, seven-hour outage. Now consider a situation in which you have intermittent network failures, possibly for a total of a few minutes every week, which cause a small number of transactions to fail with regular frequency. Those very short outages might end up costing you a great deal of revenue, and irreparably damage the confidence of your customers in your business-resulting in even greater losses of future revenue.

Don't focus exclusively on the total outage (or available) time. Weigh the cost of fewer, longer outages against the cost of multiple, smaller outages when making decisions about maintenance activities or when responding to an unplanned outage. In the middle of an outage, it can be difficult to make such judgments; so create a formula or method to calculate the cost to your business of these outage signatures so that you can make the best choices.

Multiple and cascading failures

When you are designing your database solution to avoid, minimize, and recover from outages, keep in mind the possibility for multiple components to fail at the same time, or even for the failure of one component to cause another component to fail.