Apache Spark is an in-memory analytics platform, and is viewed as both a library of analytics functionality as well as a platform. Unlike Hadoop (with its HDFS component), Spark doesn’t have distributed storage of its own; it is just an analytics engine and a very capable one at that. In the Hadoop world, it can replace the traditional MapReduce engine while bringing so much more out of the box. With Hadoop, you would need to have a huge cluster and run your jobs in batch because of processing latency but with Spark, you can get near-real-time processing potential because of its in-memory analytics capabilities. This lets you take your analytics to the data and not the other way round.
There are five main reasons that Spark is important:
1- Speed: We’ve seen certain algorithms run over 100 times faster than Hadoop MapReduce so speed is certainly one reason Sparks gained traction in the analytics ecosystem. Time to insight is one of the KPI’s we use in differentiating analytic systems and Spark definitely wins in the batch and near-real-time scenarios.
2- Polyglot: Everyone has their own go-to language and Spark lets you choose between Java, Scala, Python, and most recently R to implement core analytics pipelines. With great REST support from plugins like Spark Job Server, you can essentially leverage Spark analytics from any REST enabled language, even bash shell (through curl!).
3-Platform: It doesn’t just do batch analytics. It can do that and do it well but it also allows you to do streaming analytics for your low-response time needs as well as great out of box graph analytics capabilities through GraphX. I’ve had good success combining GraphX with graph databases like Titan, neo4j etc
4-Easy ML capability: In the Hadoop world, you would typically need a complex pipeline including Pig/Hive/Mahout etc to get a decent ML pipeline. With Spark, you get Spark MLlib which give you most of the basic ML building blocks needed for analytics. Plug that into NiFi and you have a fairly sophisticated streaming pipeline with the same ML code used for batch analytics.
5-Community: The Spark codebase has been evolving rapidly since its initial release in 2012 and Databricks has been doing a great job shepherding the community. Spark 1.5 has some great perf updates from Catalyst and Tungsten and 1.6 has some neat Parquet and query execution improvements (among other things) and my personal favorite, ML Pipeline persistence.
Spark enables rapid innovation because of a combination of all these factors.
Why Spark is Important to IBM
IBM is ushering in the cognitive era. While Cognitive Computing is far more than just ‘plain old analytics’ of structured and unstructured data, Spark is a building block that’s helping make cognitive systems a reality. Several of our financial, retail, healthcare, government, infrastructure services rely on Spark based solutions as it helps take analytics to the source which could be anything from the relational DB2 to the noSQL DB Cloudant and several databases in between. A little known fact is that IBM is one of four founding members of the UC Berkeley AMPLab, the origin on Spark, and IBM has always provided guidance for the evolution of BSAS (the Berkley Data Analytics Stack).
Spark, IBM z Systems and IBM LinuxONE
IBM supports Spark on Linux on z, as well as on z/OS. In fact, we’ve showcased demo’s at some of the biggest technical conferences of 2015
(https://www.youtube.com/watch?v=VWBNoIwGEjo at LinuxCon and https://www.youtube.com/watch?v=sDmWcuO5Rk8 at Insight) where Spark was a key enabler of innovation. The former showcases the nondestructive vertical scalability and unmatched reliability of the Linux on z ecosystem with Spark performing a mix of unstructured and structured analytics from various Open Source noSQL and relational data sources. 80%+ of the world’s business data resides on our platform and taking analytics to the data provides new insights that may have been technically infeasible in the past. In fact, the latter demo showcases how taking analytics to a typical system of record such as CICS, IMS, DB2 and VSAM (& Cloudant) can lead to right-time insights at the source, that wasn’t possible before.
IBM recently did a competitive analysis between Spark on LinuxONE and Spark on a distributed system. Spark performed aggregations up to three times faster than on a distributed system, when connected to a competitor’s database in both environments. Even Databricks’ Spark performance tests run faster on z Systems than on distributed platforms.
These are amazing numbers and a testament to the 50 years of co-designing hardware and software the z platform. This allows us to run new and popular open source projects on our platforms with expectations of high performance and scalability, especially for data intensive workloads. Other benefits for Spark on LinuxONE and z Systems are enabled by co-location. You need a data source to plug into Spark and on these platforms, because of 10TB of shared memory, and because all the data resides on the platform itself, you can have Spark and your databases all co-located so you don’t have to worry about security or network latency or keeping your data caches consistent. Data transfer over the network is at the speed of an in-memory copy. In fact, once we got rid network latency, we saw some interesting opportunities for optimization in other areas that can only be optimized on a diagonally scalable platform such as z Systems.
Spark runs on the Java Virtual Machine and at IBM, we have 15+ years of investment into the Java runtime and ecosystem. We have our own enterprise grade cleanroom environment independent from the typical Java consumer’s use. We’ve broken performance records in industry standard benchmarks and running on a processor with the industry’s highest clock speeds (5Ghz+) and largest caches certainly helps. We also have deep platform exploitation of features like Transactional Memory, Runtime Instrumentation and we are proud to be the only production JVM out there with auto-vectorization of code enabled by default. This lets us convert serial operations to parallel operations that can exploit our processors SIMD capabilities which is especially useful in several machine learning algorithms with matrix intensive operations at the core.
What is next for IBM and Spark?
Spark is important to our business as an enabler of rapid innovation so evolving and supporting the vibrant Spark community is a priority. Last year, IBM announced the goal of educating more than 1 million data scientists and data engineers on Spark through extensive partnerships with AMPLab, DataCamp, MetiStream, Galvanize, and Big Data University MOOC. This was a very aggressive goal and we’re making great progress towards it.
We also actively contribute to the Spark codebase and related projects and recently donated the SystemML language as an Apache incubating project. We will also continue to invest in exploiting the unique and differentiating features and accelerators of IBM Platforms in the Spark runtime so customers can expect to get the best analytics experience running on our systems.