Understanding multi-user Spark application performance differences

Technical Blog Post

Abstract

Body

In an earlier blog I discussed differences among popular resource managers (Apache Mesos, Apache Hadoop YARN, and IBM Spectrum Conductor with Spark) and how these can affect performance of Spark applications. Since the original Phase 1 study was published, we made enhancements to the Spark Multi-User Benchmark (SMB) that intended to make it more realistic. The result was the new version of the benchmark (SMB-2) that submits pseudo-random sequences of Spark SQL queries and machine learning jobs. Results of Phase 2 of the study have recently been released by the Securities Technology Analysis Center (STAC). Phase 2, like Phase 1, evaluated the impact of resource manager software on Spark applications, and demonstrated strong advantages for IBM Spectrum Conductor with Spark. In this blog I expand on the reasons for these advantages reported by STAC.

Like in Phase 1, during the Phase 2 audit, STAC reported significant throughput advantages for IBM Spectrum Conductor with Spark. I will first focus on throughput performance, and then talk about Relative Standard Deviation (RSD, termed Fairness in my Phase 1 blog).

Although the workload in SMB-2 is quite different from SMB-1 and includes three use cases that are not covered by SMB-1, the same considerations apply. Granularity of resource allocation and the ability to fully utilise all available resources on the cluster play a major role in the overall performance equation. When we executed SMB-2 Use Case 1, we were executing sequences of Spark SQL queries. For each query, the Spark SQL engine parsed the query and translated it into one or more (depending on the complexity of the query), elemental Spark jobs and operations. These jobs then interacted with the resource manager to obtain the necessary resources. Thus, at the run-time-level, SMB-2 Use Case 1 workload was actually quite similar to SMB-1. The main difference is that in the SMB-2 case there are more jobs that have even shorter duration. This placed greater stress on the resource manager.

As discussed in my first blog on this subject, IBM Spectrum Conductor with Spark uses the slot as the most granular element of resource sharing. The slot can be defined in various ways, but most commonly, and by default, it represents a single virtual CPU as seen by the operating system. On a machine with 32 virtual CPUs (16 physical Intel cores for example), the resource manager will use 32 slots. By comparison, YARN has a multi-dimensional approach. If the server has 96 GB RAM and Spark has been configured to use 4 GB per executor, YARN will essentially use 96/4 = 24 'slots'. So in this case, YARN will disburse resources in larger chunks. If a given job is unable to use all the resources in this chunk, the unused resources will be effectively lost, because no tasks will be scheduled on them by the Spark scheduler.

In earlier versions of Mesos, the granularity of resource disbursement was coarser – essentially Mesos would offer resources at the granularity of one server at a time. In version 1.0.1 this was changed, and now Mesos uses an approach similar to that of YARN to offer multiple executors per server, based on the ratio of the dominant resource. Similarly to YARN, the resulting granularity of resource sharing is coarser than for IBM Spectrum Conductor with Spark, and resulting throughput performance is lower.

For SMB-2 Use Case 2, which measured performance of pure batch workload, STAC was also able to measure a significant throughput advantage of 1.3x. This was a bit unexpected because when the jobs were long-running, it was Spark, rather than the resource manager that did most of the work, and Spark was the same in all cases with the exception of the scheduler backend integration piece of course. Again, though it is important to remember that the Spark SQL engine decomposes the SQL queries into many shorter jobs. This places stress on the resource manager that has to arbitrate resources among these many jobs. Just like with Use Case 1, greater granularity of resource sharing allows for more complete utilisation of cluster resources and consequently greater throughput.

Use cases 3 and 4 are simply a mix of the interactive and batch workloads that were used in use cases 1 and 2. It is very logical that the throughput advantages, which were reported for those use cases by STAC, are similar to the advantages that were reported for use cases 1 and 2, for the same reasons.

Another key measure of performance, discussed in the STAC report, is the RSD. During Phase 1 of our study, we called this measure “Fairness”; however, in Phase 2 we decided to adopt the more common statistical term in order to avoid confusion with other definitions of Fairness that are published in scientific literature. RSD is defined as the standard deviation for a population of data points that are divided by the average for that population, which is multiplied by 100. Essentially it measures the degree of scatter among the data points. In our case, since duration of each job of a given type is directly related to the resources allocated to that job, it measures the consistency of resource allocation.

Figure 1. Plot of query duration vs. elapsed time for the three configurations for SMB-2 Use Case 1 (figure taken from STAC report).

Figure 1 shows a plot of query durations vs. time for SMB-2 Use Case 1. You can see that in SMB-2, unlike SMB-1, the jobs are different, and you can see that the pattern of data points on the plot show stratification. This is because different queries will naturally have different durations. This is driven by the amount of data the query is pulling in, the number of joins and scans that have to be performed, and the amount of filtering that has to be done. Due to this, RSD is calculated not for all queries, but for each query individually. To get the overall value of RSD, these individual query RSD values are averaged (weighted average).

STAC reported significant advantages, up to 2.2x, for IBM Spectrum Conductor with Spark over YARN and Mesos. There are two main reasons for this. The first reason, just like with throughput performance discussed above, is that the finer granularity of resource sharing allows IBM Spectrum Conductor with Spark to arbitrate resources among queries more evenly. The second reason is a feature that is called preemption.

IBM Spectrum Conductor with Spark can preempt (take away) resources that are assigned to jobs that are already in progress and give them to jobs that are just starting. YARN has a similar feature. However, greater granularity of resource sharing combined with preemption still gives IBM Spectrum Conductor with Spark a clear edge. Mesos v1.0.1 lacks this feature, and thus we see that Mesos posts the worst results for RSD among the three configurations.

I hope that this blog helps you understand why IBM Spectrum Conductor with Spark was able to post these significant performance advantages, as verified by the independent auditing organization STAC. In subsequent blogs, I will do deeper dives into SMB-2 design, as well as the Spark SQL engine and Mlib architecture.

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SS4H63","label":"IBM Spectrum Conductor"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm16163683

IBM Support

Tips

Understanding multi-user Spark application performance differences

Technical Blog Post

Abstract

Body

UID

Share your feedback

Need support?