Math vs. Massive Data Overload

Share this post:

This year digital information will grow to 988 exabytes or equivalent to a stack of books from the sun to Pluto and back.

Sure, lots of data is great for predicting the future and producing models, but how do you know if the data is any good? IBM scientists have developed an algorithm that can tell you.

Data, data everywhere

Much of this data is gathered by sensors, actuators, RFID-tags, and GPS-tracking-devices, which measure everything from the degree of pollution of ocean water to traffic patterns to food supply chains.
But the question remains, how do you know if the data is good and not filled with errors, anomalies or if it was generated by a busted sensor? For example, if a scientist attempts to predict climate change based on a broken sensor that is off by 25 degrees for an entire year, the model is going to reflect that error. As the saying goes, “garbage in, garbage out.”

“In a world with already one billion transistors per human and growing daily, data is exploding at an unprecedented pace,” said Dr. Alessandro Curioni, manager of the Computational Sciences team at IBM Research – Zurich. “Analyzing these vast volumes of continuously accumulating data is a huge computational challenge in numerous applications of science, engineering and business.”

Lines of efficiency

To solve this challenge IBM scientists in Zurich have patented a mathematical algorithm (for the details click here) with less than 1000 lines code that reduces the computational complexity, costs, and energy usage for analyzing the quality of massive amounts of data by two orders of magnitude.
To confirm their method, scientists validated nine terabytes of data—nine million million (or a number with 12 zeros) on the fourth largest supercomputer in the world in Germany, a Blue Gene/P system at the Forschungszentrum Jülich.

The result, what would normally have taken a day, was crunched in 20 minutes. In terms of energy savings, the JuGene supercomputer at Forschungszentrum Jülich requires about 52800 kWh for one day of operation on the full machine, the IBM demonstration required an estimated 700 kWh  – only 1 percent of what was previously needed.

“Determining how typical or how statistically relevant the data is, helps us to measure the quality of the overall analysis and reveals flaws in the model or hidden relations in the data,” explains Dr. Costas Bekas of IBM Research – Zurich. “Efficient analysis of huge data sets requires the development of a new generation of mathematical techniques that target at both reducing computational complexity and at the same time allow for their efficient deployment on modern massively parallel resources.”

More stories

A new supercomputing-powered weather model may ready us for Exascale

In the U.S. alone, extreme weather caused some 297 deaths and $53.5 billion in economic damage in 2016. Globally, natural disasters caused $175 billion in damage. It’s essential for governments, business and people to receive advance warning of wild weather in order to minimize its impact, yet today the information we get is limited. Current […]

Continue reading

DREAM Challenge results: Can machine learning help improve accuracy in breast cancer screening?

        Breast Cancer is the most common cancer in women. It is estimated that one out of eight women will be diagnosed with breast cancer in their lifetime. The good news is that 99 percent of women whose breast cancer was detected early (stage 1 or 0) survive beyond five years after […]

Continue reading

Computational Neuroscience

New Issue of the IBM Journal of Research and Development   Understanding the brain’s dynamics is of central importance to neuroscience. Our ability to observe, model, and infer from neuroscientific data the principles and mechanisms of brain dynamics determines our ability to understand the brain’s unusual cognitive and behavioral capabilities. Our guest editors, James Kozloski, […]

Continue reading