New tool to drive Spark performance advancement and adoption

Share this post:

Apache Spark has emerged as a leading analytics framework, supplanting MapReduce in many cases. With Spark’s surging popularity, many firms’ IT groups want to avoid the cost, complexity and inefficiency of deploying separate Spark clusters. They are looking to deploy Spark as a centralized service yet have raised the same concern as they did when deploying MapReduce in a multi-tenant environment. In a 2014 survey of large banks, they were concerned about maintaining service levels in the presence of “noisy neighbors.”

Resource managers for Spark, including Apache YARN, Apache Mesos and IBM Platform Conductor for Spark, claim to maintain service levels by allocating resources appropriately to each job. Just as a Gemological Institute of America (GIA) condition report can help you assess a diamond, independent technology reviews and benchmarks help firms evaluate technology. To help firms evaluate Spark resource managers, IBM introduces the Spark Multi-User Benchmark v1 (SMB-1). SMB-1 is available for download at no charge. We welcome your feedback at

Using SMB-1 to evaluate Spark workloads on YARN, Mesos and IBM Platform Conductor
SMB-1 measures the cluster throughput, average job runtimes and job fairness for resource managers under a scenario in which multiple users at different times concurrently submit short-duration jobs. Future revisions will cover more complex scenarios. IBM contracted STAC, a technology benchmark developer to review SMB-1 and to benchmark the above managers using SMB-1. To ensure that all three managers were properly and fairly configured for multi-user Spark workloads, STAC invited an industry-recognized Mesos expert to review the benchmark and test configurations.

Results are:

  • IBM Platform Conductor for Spark has 57 percent higher throughput than Mesos and 41 percent higher than YARN.
  • IBM Platform Conductor for Spark was 59 percent faster at 59 seconds than both YARN and Mesos.
  • IBM Platform Conductor for Spark and Spark on YARN demonstrates much better fairness than Spark on Mesos. “Fairness” is defined as the standard deviation of the job duration divided by the average job duration. IBM Platform Conductor for Spark has the best fairness at 22 percent, with Spark on YARN at 27 percent and Spark on Mesos at 235 percent. Click here for the full STAC benchmark report.

The community now has a tool to explore performance improvement and evaluate workloads that matter to them. This advancement in tooling will be a valuable addition to the community and can be extended and enhanced as Spark grows.

Collaborating with open source Spark communities including Apache Mesos
Spark and in particular Apache Mesos are open, promising foundations for analytics and distributed systems. IBM actively supports the open source communities, providing both expertise and contributions in multiple areas, including resource scheduling where IBM resource management software is used at 2000+ global customers (including 23 of the 30 largest enterprises) to manage over 5 million CPUs. In addition to contributing SMB-1, IBM is leading multiple workstreams to improve the Spark ecosystem, including Apache Mesos-focused workstreams ranging from task resizing to optimistic offers. To read more about SMB-1 and IBM contributions, please click here. IBM Platform Conductor for Spark is available as a trial at no charge and as a service.

Choosing the right Spark resource manager made simpler
As firms increasingly adopt analytics for insight and as data volumes continue to grow rapidly, new paradigms and technologies are needed to drive down the cost of analysis while accelerating time to insight. Advanced resource managers such as the IBM Platform Conductor for Spark allow organizations to create a well-behaved, high performance, high throughput analytics platform with Spark. To help you choose the right one, IBM has developed SMB-1, an open source Spark multi-user benchmark suite to simplify their evaluation.

More Storage stories

IBM Storage brings enterprise data services to containers

Cloud object storage, Flash storage, Hybrid cloud storage...

Today, 85 percent of enterprises are operating in a hybrid multicloud environment;[1] at the same time, IDC expects the worldwide installed base of container instances to reach three billion in 2021.[2] While hybrid multicloud architectures and container adoption have become common, challenges exist for these interwoven technologies. Containers are easily deployed for experimentation and new more

Storage for the exabyte future

AI, Cloud object storage, Storage

“There is no AI without IA (information architecture)” is a common phrase here at IBM. It describes the business and operation platform every business needs to connect and manage the lifecycle of their AI applications. Data scientists, analytic teams, and line of business need access to the data that helps drive innovation, insight, and ultimately more

The next big leaps for IBM modern data protection

Data security, Multicloud, Storage

Recent analyst research indicates why hybrid multicloud support is becoming increasingly important. According to a 2019 ESG report [1], 67 percent of organizations surveyed currently use public cloud services in their data protection environment. Among those companies, on average 26 percent of their protection environments (measured by amount of data) are housed in the cloud, more