Independent versus correlated failures

Usually, failures are considered to be randomly occurring events. It is true even outside the domain of storage systems.

Independent system failures

Consider that a car has some probability of breaking down during its commute. Often, it is also assumed that failures are statistically independent events. It means the engine that breaks down in one car does not increase the likelihood that another car on the road breaks down.

A correlated failure is an event where this assumption of statistical independence does not hold.

Correlated System Failures

If the first car failed due to defective motor, you can no longer assume that the failure is independent, as some chance exists that other cars on the road were also given the same bad oil. It might cause many vehicles to fail together, within a short period, and due to the same underlying reason. These failures are no longer independent, but correlated.

Correlated failures can cause a storage system, which can have a high degree of redundancy or replication, to nonetheless fail.

In the storage industry, the failure of one hard disk drive, if statistically independent, does not increase the failure chance for other drives.