Storage

Powering up Apache Spark for the enterprise

Share this post:

The Apache Spark community and enterprise Spark adoption have both been growing rapidly. Many enterprises are now experimenting with Spark as an in-memory engine to accelerate many common analytics workloads. However early adoption often results in individual, isolated Spark clusters as different lines of business or functional groups set up their own infrastructure to learn to take advantage of the power of Spark. As initial adoption moves into production, IT organizations need to respond with systems than can efficiently host these different groups on shared resources. They must also provide an optimized infrastructure that delivers even faster time-to-insights on a manageable platform.

How IBM Spectrum Conductor with Spark provides outstanding management of Spark applications

IBM Spectrum Conductor with Spark integrates the Apache Spark framework to run multiple instances of Apache Spark, including different versions simultaneously in a shared multi-tenant environment. This capability helps reduces complexity and helps users manage Apache Spark in the face of frequent updates to open-source Spark distributions. IBM Spectrum Conductor with Spark uses high-efficiency resource scheduling technology to put idle resources to work running Spark jobs, speeding time to results and optimizing resource utilization. In addition, it provides critical monitoring, alerting, reporting and diagnostic capabilities required to run Spark in the enterprise.

How Power Systems provides optimal time-to-insights for key Spark workloads

IBM Power Systems with the POWER8 processor provides an ideal environment for deploying Spark applications and accelerating big data workloads thanks to industry-leading memory bandwidth, cache size and processor performance.  Some of the most popular Spark workloads utilize Spark SQL, Spark Streaming, Spark MLlib machine learning and Spark GraphX analytics. SQL and streaming workloads benefit from POWER8’s simultaneous multithreading (SMT) density of up to 8 threads per core–which is 4X more than Intel offers[ref]POWER8 supports 8 threads per core, x86 supports 2 threads per core[/ref].   Machine learning and graph workloads have complex computation often iterating over the same data set; such workloads benefit from POWER8’s large memory bandwidth[ref]Up to 4X depending on specific x86 and POWER8 servers being compared[/ref] and caches[ref]Up to 4.5X more cache comparing Intel e7-8890 servers to 12 core POWER8 servers [/ref]–also 4X more than Intel offers. The balanced system design of the POWER8 servers ensures maximum utilization across the compute, memory, cache and I/O resources of the individual servers. The net result is a 2X[ref]All results are based on IBM Internal Testing of 3 SparkBench benchmarks consisting of SQL RDD Relation, Logistic Regression, SVM
6 Data Nodes and 1 Management Node. Each node is IBM Power System S812LC 10 cores / 80 threads, POWER8; 2.92GHz, 256 GB memory, RedHat 7.2, Spark 1.5.1, OpenJDK 1.8
6 Data Nodes and 1 Management Node. Each node is x86 E5-2620V3 12 cores / 24 threads, E5-2620 V3; 2.4GHz, 256 GB memory, RedHat 7.1, Spark 1.5.1, OpenJDK 1.8
[/ref] per-core average performance advantage across key Spark workloads, which translates to faster insights and more efficient clusters capable of hosting multi-tenant environments.

The combined solution value

Together, IBM Spectrum Conductor with Spark and Power Systems offer an integrated solution ideal for running multi-tenant enterprise Spark deployments with blazing speed and efficiency.    The IBM Data Engine for Hadoop and Spark, built with storage-dense S812LC POWER8 servers, offers an integrated cluster configuration that delivers a ready-to-use environment for deploying IBM Spectrum Conductor with Spark.

More Storage stories

Storage for the exabyte future

AI, Cloud object storage, Storage

“There is no AI without IA (information architecture)” is a common phrase here at IBM. It describes the business and operation platform every business needs to connect and manage the lifecycle of their AI applications. Data scientists, analytic teams, and line of business need access to the data that helps drive innovation, insight, and ultimately ...read more


The next big leaps for IBM modern data protection

Data security, Multicloud, Storage

Recent analyst research indicates why hybrid multicloud support is becoming increasingly important. According to a 2019 ESG report [1], 67 percent of organizations surveyed currently use public cloud services in their data protection environment. Among those companies, on average 26 percent of their protection environments (measured by amount of data) are housed in the cloud, ...read more


IBM drives innovation in storage for AI and big data, modern data protection and hybrid multicloud

Cloud object storage, Multicloud, Storage

Storage for AI and big data IBM continues to enhance our storage solutions for AI and big data so our clients get the most out of their growing data on premises and in the cloud. Today, IBM announces innovations that allow our clients to leverage more heterogenous data sources and data types for deeper insights ...read more