Hybrid Cloud

Predicting disk failures for reliable clouds

Share this post:

When IT equipment in data centers fail, valuable information and economic resources are at stake. Downtimes in data centers are notoriously expensive. According to the Emerson Network Power Study, nearly $9,000 US are lost for every minute a data center is missing in action.

Four scientists at IBM Research – Zurich have devised an algorithm to combat disk failure preemptively. By applying machine learning techniques, they can predict with up to 98 percent accuracy the need to replace a disk, thus saving data center operators a lot of anguish.

One of those four scientists, Mirela Botezatu, the lead author  of “Predicting Disk Replacements towards Reliable Data Centers” will present this work at this week’s ACM SIGKDD Conference in San Francisco. I caught up with Mirela to ask about her research before she left for the conference.

Why did you choose to focus on hard drive disks in your paper?

Mirela Botezatu, pre-doctoral researcher at IBM Research - Zurich, works on performance analysis and optimization of workflow graphs

Mirela Botezatu, pre-doctoral researcher at IBM Research – Zurich, works on performance analysis and optimization of workflow graphs

Mirela Botezatu: In the cloud-based world, we store most of our personal data in data centers containing thousands of disks. Even though storage systems run several defense mechanisms to prevent data loss, there are still threats the systems cannot cope with, such as simultaneous disk failures on RAID 5 arrays. Predicting well in advance when a disk should be replaced could mitigate this risk.

Why are disks susceptible to failure?

MB: Unlike most other components, which simply send electrical signals to one another, disks have many moving parts, which make them more prone to mechanical failure. Furthermore, as opposed to power supplies or batteries, which also fail frequently but only cause temporary data unavailability, disk failures can lead to permanent data loss.

What are the steps from data collection to a prediction of disk failure?

MB: To begin, we transposed the replacement prediction problem to a binary classification task, discriminating between disks that are healthy and disks that need to be replaced.

For this we devised an analysis pipeline consisting of several steps. First, we identified which of the SMART attributes (i.e. monitoring data collected from disk’s sensors) correlated with disk failure. Not only does this provide the community with the subset of SMART attributes which are indicative of failure, but it is also of value for our prediction pipeline to rule out the “noisy” parameters. We addressed these factors via change-point detection in time series.

Next, we aggregated the disk’s sensor data into a compact representation that is stable and highly informative. We accomplished this via exponential smoothing over a time window.

As we had many more data points labeled “healthy” than those labeled “replace,” we performed informed down-sampling of the healthy class to keep only the most informative data points.

Lastly, we explored and tweaked a classification algorithm to render high-performance identification of disks that need to be replaced.

Predicting Disk Replacement
Towards Reliable Data Centers

In which other scenarios can one apply the methodology used in your research?

MB: Broadly speaking, our analysis pipeline constructs a model for predicting the occurrence of (rare) events. Rare events are a commonly used parameter in the IT industry, but our model could theoretically be used for predicting anything from car engine failure, based on periodically collected sensor data, to the health outcome of a person from medical condition monitoring. Although we haven’t tested it yet in these scenarios, I would welcome the opportunity.

What conclusion did your research reach? What are the implications of these conclusions?

MB: We have shown that, for specific disk manufacturers such as Seagate, SMART parameters are a useful determinant in predicting when a disk needs to be replaced. Reasonable disk replacement prediction accuracy has several practical benefits. It mitigates reliability issues by allowing administrators to backup the data and schedule replacement tasks in advance.

Where is your algorithm already in use? Where else could it be applied?

MB: Our algorithm has not yet been deployed. Currently, we are exploring the possibility of applying these techniques to the data produced by IBM DS8000 storage systems.

About the author: Millian Gehrer is a summer intern at IBM Research – Zurich, where he is interviewing scientists to learn more about their work and motivations. In the fall, he will begin studying Computer Science as an undergraduate at Princeton University. 

More Hybrid Cloud stories

Using iter8 and Kiali to evolve your cloud applications while gaining insights into their behavior

IBM Research has partnered with Red Hat to bring iter8 into Kiali. Iter8 lets developers automate the progressive rollout of new microservice versions. From Kiali, developers can launch these rollouts interactively, watch their progress while iter8 shifts user traffic to the best microservice version, gain real-time insights into how competing versions (two or more) perform, and uncover trends on service metrics across versions.

Continue reading

Hybrid clouds will rely on magnetic tape for decades to come

New IBM, Fujifilm prototype breaks world record, delivers record 27X more areal density than today’s tape drives

Continue reading

IEDM 2020: Advances in memory, analog AI and interconnects point to the future of hybrid cloud and AI

At this year’s IEEE International Electron Devices Meeting, IBM researchers will describe a number of breakthroughs aimed at advancing key hardware infrastructure components, including: Spin-Transfer Torque Magnetic Random-Access Memory (STT-MRAM), analog AI hardware, and advanced interconnect scaling designed to meet those hardware infrastructure demands.

Continue reading