Since its inaugural event in 2013, thousands of developers, scientists, analysts, researchers and executives from around the globe have trekked to the Spark Summit to talk about how the open source processing engine known as Apache Spark can apply big data, machine learning and data science to deliver new insights.
With more than 1,000 contributors from 250+ organizations Apache Spark has also become the largest open source community and this week in San Francisco they will gather. Including in this Sparkfest are several scientists from IBM’s research labs in Almaden, Haifa, Tokyo and Zurich. If you can’t make it, here is a sneak peak.
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash (presents on 6 June)
The name may sound like a brutal medieval torture device, but crail may just be the answer to the nightmare many CIOs and data center operators face –how to get the best out of modern hardware for their Big Data and high-performance analytics.
Started in 2014 by a team of software engineers at IBM’s Zurich lab, crail, is an open source distributed storage system designed for very fast network and storage hardware. Crail integrates both DRAM and NVMe flash across an remote direct memory access (RDMA) fabric to provide storage for performance critical temporary data in analytics workloads. The challenge it addresses is rather simple, given a set of really fast devices like flash, DRAM, and remote direct memory access (RDMA) network, each operate at different speeds and densities, therefore how can you make sure the data is accessed efficiently? Think of it like this, you just bought the best engine, tires, and chassis for your car and crail is the blueprint that shows how they can be integrated together to put the best lap times for various races.
Since the team launched crail and put it on Github earlier they have benchmarked it on Spark achieving a sorting time of 12.8 TB of data in 98 seconds, which calculates to a sorting rate of 3.13 GB/min/core. This is about a factor of 5 faster than the sorting performance of the Spark 2014 benchmark winner. More recently the team has done additional benchmarking including SQL and machine learning and the results will be presented at the Summit by IBM researcher Patrick Stuedi. The team’s effort in the integration of high-performance network and storage devices was recognized last week with the best paper award at the 10th ACM International Systems and Storage Conference (SYSTOR’17) in Haifa, Israel.
Very large data files, object stores and deep learning: lessons learned while looking for signs of extra-terrestrial life (presents on 7 June)
IBM scientist Gil Vernik has been fascinated by computers ever since he was a boy, when his grandfather bought him his first programmable computer, a Sharp PC-1150, that you could program in Basic, with a whopping 16K of memory. That same fascination is now helping an IBM team enable NASA’s SETI project to listen for signs of alien life in outer space–on the IBM Cloud platform. Vernik believes that we need a strong connection between Spark and object stores and at the Spark Summit he will be presenting his Stocator technology which enhances the way large data files are stored and analyzed. Together with fellow IBMer Graham Mackintosh, Vernik will present details on how Stocator is being applied in a collaborative project with the SETI Institute, NASA, Swinburne University, Stanford University and IBM.
More specifically, the Allen Telescope Array in northern California has been continuously scanning the skies for over two decades, generating data archives with over 200 million signal events. Today, astronomers and researchers are using Apache Spark to train neural net models for signal classification, and to perform computationally intensive Spark workloads on multi-terabyte binary signal files. Stocator, now an open source (Apache License 2.0) object store connector for Hadoop and Apache Spark, is specifically designed to optimize their performance with object stores. It is helping the project by greatly improving performance and reducing the quantity of resources used, both for ground-to-cloud uploads of very large signal files, and for subsequent access of radio data for analysis using Spark.
Demystifying Dataframe and Dataset (presents on 7 June)
During college years, Kazuaki Ishizaki was into computer hardware research. The more he studied hardware, the more he became interested in how to make computing faster. That’s when his research focus shifted from hardware to programming. When the early Spark release came out back in 2014, not many people paid attention to it, but it certainly grabbed Ishizaki’s attention. He was interested in Spark simply because he saw a potential in making parallel distributed processing easier with Spark, and his expertise in optimizing programs should be able to make the processing much faster.
As one of the key open source contributors for Spark, Ishizaki’s Summit presentation will focus on the machine learning library framework and its aging internal APIs which no longer accommodate the latest technologies that achieved performance improvements in SQL. Thankfully, he has made several improvements to bring the code up to “free-lunch” speed which will be welcome news for developers who write apps for example, for the IoT, autonomous driving and weather forecasting, which will now benefit from improved real time processing and learning precision, enabling the apps to offer more timely information to avoid a sudden thunderstorm, for example.