Tachyon for ultra-fast Big Data processing

Share this post:

Editor’s note: This article is by cloud analytics infrastructure expert Gil Vernik, IBM Research-Haifa.

Today’s massive growth in data sets means that storage is increasingly becoming a critical bottleneck for system workloads. My storage team in Haifa, Israel wants to analyze and understand these massive volumes of data, and we need to store them somewhere reliable. Although disk space is an option, it’s too slow to carry out fast Big Data processing. In-memory computing, which keeps the data in a server’s RAM for fast access and processing, offers a good solution for processing Big Data workloads but it’s limited and expensive. 
Enter Tachyon, a memory-centric distributed storage system that offers processing at memory-speed and reliable storage. Its software works with servers in clusters so there’s plenty of room for storage, and a unique proprietary feature eliminates the need for replication to ensure fault tolerance. Now, we’ve connected Tachyon to Swift so it can work effortlessly with Swift and SoftLayer. The result? Tachyon is even more flexibile and efficient.
Building efficiency into Big Data analytics
Let’s take something like Facebook’s data. Tons of data need to be stored and analyzed, such as logs, activities, connections, media, locations, messages, and so forth. A good solution would be to store them as objects or files in an object store like Swift. Why as objects? Object storage offer two important things: low-cost storage and reliability; so even if my computer fails, I know my data is safe.  
Several applications can analyze this data, including Apache Spark and Apache Flink. Often, while one set of analytics is being done, say to find out what ads to display on your news page, another user might analyze the same data set to find out which geographies you have visited most often. In short, we can have different instances of analytics workloads all reading and writing the same data. Tachyon uses memory aggressively and can serve up the results to the users so the work doesn’t have to be done separately, multiple times.
The latest evaluations show that Tachyon outperforms in-memory HDFS by 110x for writes. It also improves the end-to-end latency of a realistic workflow by 4x.
Faster, cheaper, and reliable, Tachyon solves the problem
Tachyon, a project out of UC-Berkeley’s AMPLab, is intended to help organizations quickly store and access all that information. (The term tachyon refers to a particle that moves faster than light.) And it has been gaining momentum. We certainly recognize its tremendous potential for improving the efficiency and fault tolerance of computation frameworks, such as Spark and Flink. With the help of Tachyon Nexus founder Haoyuan Li and the Tachyon community, we were able to turn this potential into reality.
Many frameworks like Spark take advantage of memory. When they share the data between different frameworks or jobs, they need to write the data to the different systems, which takes time. And making sure they are synchronized for write, is even more difficult. Tachyon helps achieve memory throughput without unnecessary replication and still provides reliability.
If the computer fails, the system re-computes the data using lineage, and in this way provides reliability through fault tolerance. Data lineage is generally defined as a kind of data life cycle that includes the data’s origins and its transformations. Because it’s a distributed system, Tachyon works as a cluster, using the memory of a whole bunch of computers. So if one computer fails, there’s no problem. Even if the entire cluster fails, everything is backed up since Tachyon saves everything on disk from time to time.
Tachyon is open source and already deployed in production at multiple companies. In addition, the project has more than 100 contributors from more than 30 institutions, including Yahoo, Tachyon Nexus, Redhat, Baidu, Intel, and of course, IBM. The project is the storage layer of the Berkeley Data Analytics Stack (BDAS) and also part of the Fedora distribution.
What next?
We’ll continue investing effort in Tachyon so that more organizations can take advantage of the performance boost offered. This collaboration goes a long way towards preventing repetitive work, improving memory utilization, and allowing processing to reach new levels of memory usage – especially when it comes to Big Data analytics frameworks like Spark and others.
More stories

A new supercomputing-powered weather model may ready us for Exascale

In the U.S. alone, extreme weather caused some 297 deaths and $53.5 billion in economic damage in 2016. Globally, natural disasters caused $175 billion in damage. It’s essential for governments, business and people to receive advance warning of wild weather in order to minimize its impact, yet today the information we get is limited. Current […]

Continue reading

DREAM Challenge results: Can machine learning help improve accuracy in breast cancer screening?

        Breast Cancer is the most common cancer in women. It is estimated that one out of eight women will be diagnosed with breast cancer in their lifetime. The good news is that 99 percent of women whose breast cancer was detected early (stage 1 or 0) survive beyond five years after […]

Continue reading

Computational Neuroscience

New Issue of the IBM Journal of Research and Development   Understanding the brain’s dynamics is of central importance to neuroscience. Our ability to observe, model, and infer from neuroscientific data the principles and mechanisms of brain dynamics determines our ability to understand the brain’s unusual cognitive and behavioral capabilities. Our guest editors, James Kozloski, […]

Continue reading