(Chief Architect, Analytics Stack Design and Performance, IBM Systems Group)
(Distinguished Engineer, IBM Power Systems Software & Solutions)
Apache Spark is an in-memory distributed compute engine that speeds analysis on large-scale data up to 100X faster than current technologies. IBM’s POWER8 processor has been touted as the first processor designed for big data, with its industry leading memory bandwidth, thread density, and cache architecture. We set out to answer the question how much better can POWER8 run Spark compared to existing platforms.
To do this, we wanted to select a set of workloads that would be representative across a wide range of use cases for Spark. IBM Research has developed a benchmark suite for Spark, called SparkBench, whose initial version consists of 10 benchmarks that cover Machine Learning, Graph, SQL and Streaming Spark use cases. This benchmark is publically available and can be obtained from GITHUB . SparkBench 1.0 is further described in this paper  and uses real-world datasets as input. Our benchmark runs use datasets from Google Web Graph, Wikipedia data dumps, E-Commerce Transaction dataset  and a “live” Twitter stream.
We compared two systems – (i) a Power Systems S822L with POWER8 and (ii) Intel Xeon E5-2690 V3 (referred to as “x86” platform hereafter) equally configured with 24 cores and 256 GB of RAM. Both systems have similar storage controller and disk configurations, 3 disks, 300 GB each in a RAID 0 configuration, presented as a logical volume to the OS. Both systems use the same software stack, consisting of Spark 1.4, Open JDK 1.8 and Ubuntu 15.04 LE Linux. The POWER8 has 8 SMT (Simultaneous Multi-threading) threads per core resulting in 192 logical CPUs available to Spark across the 24 physical cores. The x86 platform has 2 SMT threads per core resulting in only 48 logical CPUs available to Spark across its 24 physical cores. The following illustrates an individual POWER8 core with 8 threads per core (SMT-8):
POWER8 Processor with 12 cores:
S822L 2-socket server with 2 POWER8 Processors:
Our approach was to tune Spark and the JVM to fully benefit from POWER8’s strengths – memory bandwidth, cache structure and thread density. Note that there were no alterations or changes to Spark code itself, it was 100% OpenSource Apache Spark 1.4 as included in BigInsights 4.1 IBM Open Platform. Overall what we found was that the average performance of POWER8 across those 10 representative Spark workloads was 2.32X the performance of the x86 platform.
Average Performance Across 10 Spark Workloads
Now we’ll take a deeper look at the individual workloads and highlight what attributes of the POWER8 were exploited on each. For the categories of Spark Streaming and Spark SQL we found that both of these workloads benefited heavily from POWER8’s thread density (8-way SMT per core, 192 threads across 24 cores). The Twitter streaming benchmark uses a “live” Twitter stream to calculate the most popular Twitter tag every 60 seconds. Pageview streaming uses a synthetic stream to generate user clicks and counts statistics such as active user counts and page view counts every 60 seconds. These streaming workloads can benefit from a collection of threads operating on different packets in a stream simultaneously. They can also benefit from collections of threads operating on different stages of a message stream pipeline. The SQL RDD Relation and SQL Hive benchmarks do select, aggregate and join functions representative of most SQL based workloads. The SQL RDD Relation benchmark uses native Spark RDD caching as opposed to SQL Hive that executes SQL queries with Spark for orchestration. The arguments above related to threaded execution also hold for SQL workloads where “rows” returned from a query can be acted on by multiple threads simultaneously. Spark launches “tasks” to process data and being able to distribute those tasks on up to 192 independent logical CPUs was a huge benefit to the streaming and SQL workloads. We were able to configure Spark to use all of POWER8’s SMT-8 threads by modifying the SPARK_WORKER_CORES configuration variable. This allowed us to run more Spark tasks concurrently, leading to higher stream and query processing parallelism than the x86 platform’s 48 SMT-2 threads. Here is an illustration of the Spark Workers configuration for these workloads:
Detailed performance results for these workloads are below, normalized to the x86 platform results (red bar)
The next category of workloads we looked at was Spark Machine Learning. These algorithms tend to be iterative in nature and Spark accelerates machine learning algorithms by caching frequently used data items in its RDD caches. Each POWER8 core has a private 64K L1 cache, private 512K L2 cache, and a shared 96MB L3 cache. Being able to fit each Spark worker instance’s working set in cache allows fewer core pipeline stalls and drives overall processing throughput higher. A POWER8 socket supports 160GBps bandwidth to RAM memory. Spark writes intermediate data to local disk storage but we steered intermediate data reads/writes to RAMfs for higher performance. This allowed us to fully utilize the high memory bandwidth capability of the POWER8 socket. The x86 platform also had RAMfs enabled for fair comparison purposes. Logistic Regression, Matrix Factorization and SVM (Support Vector Machine) are three benchmarks in the machine learning category that are commonly used for data classification, prediction and recommendation systems. We were able to alter the number of Spark Workers and SPARK_WORKER_CORES to find the optimal combinations for these workloads. For instance with Logistic Regression the optimal configuration was to use 4 Spark Workers and only 24 POWER8 SMT-4 threads per worker. This creates a balance between cache sharing and affinity for the individual workers, while exploiting the memory bandwidth for inter-worker parallelism. For Matrix Factorization, the optimal configuration was 2 Spark Workers and 24 POWER8 SMT-2 threads per worker. While SVM benefited from 12 Spark Workers, 16 threads each, consuming all 192 threads:
Logistic Regression Configuration
Matrix Factorization Configuration
Detailed performance results for these workloads are below, normalized to the x86 platform results (red bar).
Machine Learning Workloads
The last set of workloads we explored were Spark GraphX workloads. Graph algorithms tend to be memory and cache bound and run well on the POWER8 system architecture. All three graph oriented benchmarks in SparkBench 1.0, PageRank, SVD++ (Singular Value Decomposition) and TriangleCount, are popular graph computation algorithms. PageRank is used by web search engines to rank pages based on links to the page, SVD++ provides quality recommendations, and TriangleCount is used for discovering and understanding relationships which is common in social networking applications. The data consumed by these workloads is based on the Google Web Graph data set. The data set can be found in . Each graph has 875713 nodes with 5105039 edges. We steered all intermediate read/writes from Spark to RAMfs to use POWER8’s memory bandwidth efficiently. For comparison purposes, RAMfs was also used on the x86 platform. Since the SVD++ and TriangleCount graph workloads are more computationally intensive, we used 2 Spark workers and 24 POWER8 SMT-2 threads each just like the Matrix Factorization configuration earlier, resulting in stronger individual thread computational strength and reducing cache misses by effectively providing more cache per logical CPU. For PageRank, we created 12 Spark worker instances with 8 POWER8 SMT-4 threads per worker.
This gave us optimal memory bandwidth utilization and allowed us to balance utilized threads in each physical core for higher performance. The resulting performance for these workloads is below:
In conclusion, our results show across all 10 SparkBench 1.0 benchmarks, the POWER8 system was able to achieve a mean of 2.32X over the x86 platform. We achieved this by exploiting POWER8’s thread density, memory bandwidth, and cache hierarchy advantages over the x86 platform. Spark achieves its data parallelism through multiple Spark workers and associated in-memory map-reduce algorithms. POWER8 is a great fit for this paradigm because of its efficiency in executing multi-JVM workloads with high memory bandwidth requirements. Spark also creates numerous tasks within each worker to process different stages of the workload DAG (Directed Acyclic Graph) and this again is a great fit for POWER8 because of its SMT-8 thread density, allowing more tasks to be executed concurrently. Also as shown above POWER8 has the flexibility to cover cases needing fewer threads with stronger individual computational strength or more threads providing better throughput performance. Spark also requires IO bandwidth for intermediate data reads/writes and this is abundantly available on POWER8 systems.
All-in-All, POWER8 is a great fit for Spark because of its balanced system design across processor, memory, network and IO and is able to accelerate a broad range of Spark workloads by more than 2X over x86 platforms.
And, join Power Systems on October 5th for a webcast highlighting new capabilities and product announcements that will help you go faster than ever before! http://bit.ly/1OcrNru
Note: Please post any questions about the performance data or methodology by contacting the authors or as comments to this blog post.
 Min Li, Jian Tan, Yandong Wang, Li Zhang, and Valentina Salapura. 2015. SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark. In Proceedings of the 12th ACM International Conference on Computing Frontiers (CF '15). ACM, New York, NY, USA, , Article 53 , 8 pages. DOI=10.1145/2742854.2747283 http://doi.acm.org/10.1145/2742854.2747283