Storage

Powering up Apache Spark for the enterprise

Share this post:

The Apache Spark community and enterprise Spark adoption have both been growing rapidly. Many enterprises are now experimenting with Spark as an in-memory engine to accelerate many common analytics workloads. However early adoption often results in individual, isolated Spark clusters as different lines of business or functional groups set up their own infrastructure to learn to take advantage of the power of Spark. As initial adoption moves into production, IT organizations need to respond with systems than can efficiently host these different groups on shared resources. They must also provide an optimized infrastructure that delivers even faster time-to-insights on a manageable platform.

How IBM Spectrum Conductor with Spark provides outstanding management of Spark applications

IBM Spectrum Conductor with Spark integrates the Apache Spark framework to run multiple instances of Apache Spark, including different versions simultaneously in a shared multi-tenant environment. This capability helps reduces complexity and helps users manage Apache Spark in the face of frequent updates to open-source Spark distributions. IBM Spectrum Conductor with Spark uses high-efficiency resource scheduling technology to put idle resources to work running Spark jobs, speeding time to results and optimizing resource utilization. In addition, it provides critical monitoring, alerting, reporting and diagnostic capabilities required to run Spark in the enterprise.

How Power Systems provides optimal time-to-insights for key Spark workloads

IBM Power Systems with the POWER8 processor provides an ideal environment for deploying Spark applications and accelerating big data workloads thanks to industry-leading memory bandwidth, cache size and processor performance.  Some of the most popular Spark workloads utilize Spark SQL, Spark Streaming, Spark MLlib machine learning and Spark GraphX analytics. SQL and streaming workloads benefit from POWER8’s simultaneous multithreading (SMT) density of up to 8 threads per core–which is 4X more than Intel offers[ref]POWER8 supports 8 threads per core, x86 supports 2 threads per core[/ref].   Machine learning and graph workloads have complex computation often iterating over the same data set; such workloads benefit from POWER8’s large memory bandwidth[ref]Up to 4X depending on specific x86 and POWER8 servers being compared[/ref] and caches[ref]Up to 4.5X more cache comparing Intel e7-8890 servers to 12 core POWER8 servers [/ref]–also 4X more than Intel offers. The balanced system design of the POWER8 servers ensures maximum utilization across the compute, memory, cache and I/O resources of the individual servers. The net result is a 2X[ref]All results are based on IBM Internal Testing of 3 SparkBench benchmarks consisting of SQL RDD Relation, Logistic Regression, SVM
6 Data Nodes and 1 Management Node. Each node is IBM Power System S812LC 10 cores / 80 threads, POWER8; 2.92GHz, 256 GB memory, RedHat 7.2, Spark 1.5.1, OpenJDK 1.8
6 Data Nodes and 1 Management Node. Each node is x86 E5-2620V3 12 cores / 24 threads, E5-2620 V3; 2.4GHz, 256 GB memory, RedHat 7.1, Spark 1.5.1, OpenJDK 1.8
[/ref] per-core average performance advantage across key Spark workloads, which translates to faster insights and more efficient clusters capable of hosting multi-tenant environments.

The combined solution value

Together, IBM Spectrum Conductor with Spark and Power Systems offer an integrated solution ideal for running multi-tenant enterprise Spark deployments with blazing speed and efficiency.    The IBM Data Engine for Hadoop and Spark, built with storage-dense S812LC POWER8 servers, offers an integrated cluster configuration that delivers a ready-to-use environment for deploying IBM Spectrum Conductor with Spark.

More Storage stories

Powerful new storage for your mission-critical hybrid multicloud

Flash storage, Multicloud, Storage

Cloud computing and mainframes processors are two of the most important contributors in information technology solutions. They are increasingly being linked together to unlock value for clients. 85 percent of companies are already operating in multicloud and by 2021, most existing apps will have migrated to the cloud. [1] At the same time, mainframe utilization ...read more


A conversation about metadata management featuring Forrester Analyst Michele Goetz

AI, Big data & analytics, Storage

I recently had an opportunity to speak with Forrester Principal Analyst Michele Goetz following a research study on the subject of metadata management conducted by Forrester Consulting and commissioned by IBM. Michele’s research covers artificial intelligence technologies and consultancies, semantic technology, data management strategy, data governance and data integration, and includes within its scope the ...read more


IBM is transforming data storage for media and entertainment

Data security, Storage, Tape and virtual tape storage

Disruptions in media and entertainment have been occurring at a rapid pace over the past few years and many organizations are still struggling with optimizing their IT infrastructure for new data requirements. With growth in capacity, new larger file formats, keeping more data online for business opportunities, and the growing use of AI, organizations are ...read more