I work with customers and business partners every day on planning, architecting, testing or deploying Big Data solutions. For the last two years my work has mostly centered around Apache Hadoop; the parallel, distributed environment for processing mountains of data, in almost any format, quickly. Many companies have implemented Hadoop as a way to improve their analytical capabilities and ultimately their decision making processes.
Hadoop does a lot of things really well. It has been evolving and gradually maturing with new features and capabilities to make it easier to setup and use. There is a large ecosystem of applications that now leverage Hadoop. So, if Hadoop has been so successful, why do we need something called Spark?
Using Hadoop comes at a cost. It requires lots of servers, large amounts of storage especially when data replication is used, extensive networking infrastructure, and a potentially large data center. Customers who have deployed Hadoop already (and also those who are evaluating it), are asking how they can get a more cost effective solution for their environment. They tell us they want to be able to use the same environment for multiple purposes or use Hadoop in a new way. They want to get more data processed with the infrastructure they have in place. How can we get Hadoop to be even faster?
The answer is Spark!
Spark at its core was developed to improve performance of data processing. It doesn’t just eke out a little more performance, rather Spark boasts a 10x-100x improvement in performance. That definitely catches your attention! In general, Spark can improve performance across the board for most workloads, while drastically improving it for a subset of workloads. So what is the magic sauce inside Spark?
The Hadoop premise is that you get good performance by having the processor close to the data it needs to process. Hadoop processes data on disk while Spark processes data in-memory and only goes to disk when it needs to. Spark keeps data in memory between Map and Reduce phases avoiding the need to always go back to disk (i.e., HDFS in many cases). If an application tends to do many iterative computations on the same data, the performance boost is magnified. For example, Spark’s in-memory approach is especially beneficial for machine learning algorithms that are iterative in nature.
Apache Spark consists of a Spark core and a set of libraries similar to those available for Hadoop. The core is the distributed execution engine and a set of languages (Java, Scala, Python and R) are supported for distributed application development. Additional libraries are built on top of the Spark core to enable workloads that use streaming, SQL, graph and machine learning. While some of these workloads are also supported on Hadoop, Spark offers an unprecedented ease of development and ability to combine all of those seamlessly into the same application. Spark is also unique in that it can be used interactively from Scala, Python or R shell environments. Unlike Hadoop, Spark can be run in a variety of modes, each offering a different method of managing the cluster and its resources: standalone, Apache Mesos, Hadoop YARN, and in the cloud.
IBM is a big contributor and a big player in the Spark community. IBM announced a major commitment to Spark, and has created a new Spark Technology Center. The focus will be on accelerating open source innovation for the Spark ecosystem and educating a massive number of data scientists and engineers on Spark through extensive partnerships.
IBM Power Systems and OpenPower will be key platforms for hosting Spark workloads. Today, the new IBM Open Platform with Apache Hadoop , which is based on the Open Data Platform, supports Spark. Work is going on to extend the capabilities and benefits of Spark even further through the enablement of POWER8 accelerators such as CAPI Flash, CAPI FPGAs and GPUs. Get involved with the IBM Spark Technology Center to be the first to hear updates as new announcements are made.
The good news is that Apache Hadoop and Apache Spark are available to everyone. You don’t have to choose just one; you can use both and you can use them together. If you are already using Hadoop, you should start learning about Spark so that you are well positioned to shift to Spark to take advantage of the innovations coming out.
And, join Power Systems on October 5th for a webcast highlighting new capabilities and product announcements that will help you go faster than ever before! http://bit.ly/1OcrNru