Share this post:
Apache Spark has emerged as a leading analytics framework, supplanting MapReduce in many cases. With Spark’s surging popularity, many firms’ IT groups want to avoid the cost, complexity and inefficiency of deploying separate Spark clusters. They are looking to deploy Spark as a centralized service yet have raised the same concern as they did when deploying MapReduce in a multi-tenant environment. In a 2014 survey of large banks, they were concerned about maintaining service levels in the presence of “noisy neighbors.”
Resource managers for Spark, including Apache YARN, Apache Mesos and IBM Platform Conductor for Spark, claim to maintain service levels by allocating resources appropriately to each job. Just as a Gemological Institute of America (GIA) condition report can help you assess a diamond, independent technology reviews and benchmarks help firms evaluate technology. To help firms evaluate Spark resource managers, IBM introduces the Spark Multi-User Benchmark v1 (SMB-1). SMB-1 is available for download at no charge. We welcome your feedback at firstname.lastname@example.org.
Using SMB-1 to evaluate Spark workloads on YARN, Mesos and IBM Platform Conductor
SMB-1 measures the cluster throughput, average job runtimes and job fairness for resource managers under a scenario in which multiple users at different times concurrently submit short-duration jobs. Future revisions will cover more complex scenarios. IBM contracted STAC, a technology benchmark developer to review SMB-1 and to benchmark the above managers using SMB-1. To ensure that all three managers were properly and fairly configured for multi-user Spark workloads, STAC invited an industry-recognized Mesos expert to review the benchmark and test configurations.
- IBM Platform Conductor for Spark has 57 percent higher throughput than Mesos and 41 percent higher than YARN.
- IBM Platform Conductor for Spark was 59 percent faster at 59 seconds than both YARN and Mesos.
- IBM Platform Conductor for Spark and Spark on YARN demonstrates much better fairness than Spark on Mesos. “Fairness” is defined as the standard deviation of the job duration divided by the average job duration. IBM Platform Conductor for Spark has the best fairness at 22 percent, with Spark on YARN at 27 percent and Spark on Mesos at 235 percent. Click here for the full STAC benchmark report.
The community now has a tool to explore performance improvement and evaluate workloads that matter to them. This advancement in tooling will be a valuable addition to the community and can be extended and enhanced as Spark grows.
Collaborating with open source Spark communities including Apache Mesos
Spark and in particular Apache Mesos are open, promising foundations for analytics and distributed systems. IBM actively supports the open source communities, providing both expertise and contributions in multiple areas, including resource scheduling where IBM resource management software is used at 2000+ global customers (including 23 of the 30 largest enterprises) to manage over 5 million CPUs. In addition to contributing SMB-1, IBM is leading multiple workstreams to improve the Spark ecosystem, including Apache Mesos-focused workstreams ranging from task resizing to optimistic offers. To read more about SMB-1 and IBM contributions, please click here. IBM Platform Conductor for Spark is available as a trial at no charge and as a service.
Choosing the right Spark resource manager made simpler
As firms increasingly adopt analytics for insight and as data volumes continue to grow rapidly, new paradigms and technologies are needed to drive down the cost of analysis while accelerating time to insight. Advanced resource managers such as the IBM Platform Conductor for Spark allow organizations to create a well-behaved, high performance, high throughput analytics platform with Spark. To help you choose the right one, IBM has developed SMB-1, an open source Spark multi-user benchmark suite to simplify their evaluation.