Editor’s note: This article is by cloud analytics infrastructure expert Gil Vernik, IBM Research-Haifa.
August 17, 2015
Posted in: Uncategorized
Tachyon for ultra-fast Big Data processing
Today’s massive growth in data sets means that storage is increasingly becoming a critical bottleneck for system workloads. My storage team in Haifa, Israel wants to analyze and understand these massive volumes of data, and we need to store them somewhere reliable. Although disk space is an option, it’s too slow to carry out fast Big Data processing. In-memory computing, which keeps the data in a server’s RAM for fast access and processing, offers a good solution for processing Big Data workloads – but it’s limited and expensive.
Enter Tachyon, a memory-centric distributed storage system that offers processing at memory-speed and reliable storage. Its software works with servers in clusters so there’s plenty of room for storage, and a unique proprietary feature eliminates the need for replication to ensure fault tolerance. Now, we’ve connected Tachyon to Swift so it can work effortlessly with Swift and SoftLayer. The result? Tachyon is even more flexibile and efficient.
Building efficiency into Big Data analytics
Let’s take something like Facebook’s data. Tons of data need to be stored and analyzed, such as logs, activities, connections, media, locations, messages, and so forth. A good solution would be to store them as objects or files in an object store like Swift. Why as objects? Object storage offer two important things: low-cost storage and reliability; so even if my computer fails, I know my data is safe.
Several applications can analyze this data, including Apache Spark and Apache Flink. Often, while one set of analytics is being done, say to find out what ads to display on your news page, another user might analyze the same data set to find out which geographies you have visited most often. In short, we can have different instances of analytics workloads all reading and writing the same data. Tachyon uses memory aggressively and can serve up the results to the users so the work doesn’t have to be done separately, multiple times.
The latest evaluations show that Tachyon outperforms in-memory HDFS by 110x for writes. It also improves the end-to-end latency of a realistic workflow by 4x.
Faster, cheaper, and reliable, Tachyon solves the problem
Tachyon, a project out of UC-Berkeley’s AMPLab, is intended to help organizations quickly store and access all that information. (The term tachyon refers to a particle that moves faster than light.) And it has been gaining momentum. We certainly recognize its tremendous potential for improving the efficiency and fault tolerance of computation frameworks, such as Spark and Flink. With the help of Tachyon Nexus founder Haoyuan Li and the Tachyon community, we were able to turn this potential into reality.
Many frameworks like Spark take advantage of memory. When they share the data between different frameworks or jobs, they need to write the data to the different systems, which takes time. And making sure they are synchronized for write, is even more difficult. Tachyon helps achieve memory throughput without unnecessary replication and still provides reliability.
If the computer fails, the system re-computes the data using lineage, and in this way provides reliability through fault tolerance. Data lineage is generally defined as a kind of data life cycle that includes the data’s origins and its transformations. Because it’s a distributed system, Tachyon works as a cluster, using the memory of a whole bunch of computers. So if one computer fails, there’s no problem. Even if the entire cluster fails, everything is backed up since Tachyon saves everything on disk from time to time.
Tachyon is open source and already deployed in production at multiple companies. In addition, the project has more than 100 contributors from more than 30 institutions, including Yahoo, Tachyon Nexus, Redhat, Baidu, Intel, and of course, IBM. The project is the storage layer of the Berkeley Data Analytics Stack (BDAS) and also part of the Fedora distribution.
We’ll continue investing effort in Tachyon so that more organizations can take advantage of the performance boost offered. This collaboration goes a long way towards preventing repetitive work, improving memory utilization, and allowing processing to reach new levels of memory usage – especially when it comes to Big Data analytics frameworks like Spark and others.