Spark Multitenancy with IBM Platform Conductor

Technical Blog Post

Abstract

Body

Apache Spark, one of the hottest general-purpose cluster-computing systems known for its speed, ease of use, and sophisticated analytics is increasingly being accepted as a main platform for analytics workloads. More and more applications are being developed based on Apache Spark, but with different Spark versions, using different tools, and for different users. These “Spark multitenancy” challenges become the nightmare of many IT departments of modern enterprises.

The easiest way to handle these challenges is to set up isolated clusters for different lines of business, but that’s way too expensive for the ever-growing demand for modern IT infrastructure. IT demands a software product to provide Spark multitenancy on a shared physical cluster.

We recently announced the General Availability of IBM Platform Conductor for Spark (Platform Conductor). This new product offering combines Apache Spark with Platform Enterprise Grid Orchestrator (EGO) to address many of the challenges that enterprises faces, including Spark multitenancy.

Platform Conductor introduces a new concept called “Spark Instance Group” for the notion of Spark tenant. Each Spark instance group is an installation of Apache Spark that can run Spark core services (Master, Shuffle, and History) and notebooks as configured.

What is required for a Spark instance group?

Spark core services

The Apache Spark community releases feature a core Spark platform. In addition to Spark runtime, there are several core services such as the shuffle service and the history service. To have everything working inside the cluster, there’s the Spark Master service. The Spark core platform has frequent releases; against this background, keeping applications running with the right Spark version is the key to keep them running reliably.

Spark tools/notebooks

Notebooks are a great innovation to provide Interactive analytics/programming with Spark. Just like other great innovations, there are different notebook implementations for different language bindings, and each has its own release versions. Providing the right tools/notebooks with the right version is also critical for data analysis.

Users and applications

Different user roles are required for a Spark instance group. Some basic roles include (but are not limited to):

Administrator, who must be able to manage services inside a Spark instance group.
User, who must be able to submit Spark applications to the Spark instance group for execution.

Some services can be stopped and restarted anywhere in the cluster, but some other services require HA support, so that a long job can survive/resume after an outage.

Basic isolation

Security could be an endless debate for multitenancy, but we consider the following common isolation requirements as basic:

Authentication isolation: The users or user groups who can access a Spark instance group must be configurable.
Runtime isolation: All Spark instance group processes must run with credentials specific to that Spark instance group, so that it is protected from other users or user processes.
Data at rest isolation: All the data used/generated within the Spark instance group must be protected at least by ACL. Data encryption would be something to add if the security level is high.

How to create a Spark instance group with Platform Conductor

To create a Spark instance group, log in to the PMC as administrator, navigate to Workload>Spark>Spark Instance Groups, and click New.

This will launch the three-step Spark Instance Group creation wizard.

Step 1: Spark settings

This is where the administrator can choose the right Spark core services and notebook versions. We have everything for community Apache Spark as well as for notebooks here.

Administrators can also drill down to configure Spark and notebook configurations by clicking “the default configuration” link. In the dialog that pops up, you can view all Apache Spark settings, organized in the same way as the community does (with a link to the community help document). Setting some storage-related parameters to a shared storage location (such as IBM Spectrum Scale provided by Platform Conductor itself) is the simplest way to provide HA for some services.

You can also configure any notebooks that you select for this Spark instance group here.

Step 2: EGO settings

This is where the administrator can leverage all enterprise features that Platform EGO provides to achieve maximum resource utilization for the cluster.

The settings are extremely simple: In addition to an “execution user”, you only need to configure a “Consumer” and a “Resource Group”. The EGO “Consumer”is the basic unit for resource scheduling, and the EGO “Resource Group” is the collection of physical hosts to be deployed to. For more details of EGO concepts,refer to the IBM Knowledge Center.

There are many ways to configure Consumers for various purposes: for best performance, all services inside the Spark instance group (Spark core services and notebooks) could share the same consumer; for a highly competitive environment, different consumers from the priority hierarchy can be assigned to different services, so that all Spark instance groups on the same physical cluster can compete in an organized way.

It is also acceptable to keep the default settings, to use the root consumer and the ComputeHosts resource group for a new Spark instance group.

The “execution user” setting is how Platform Conductor achieves “Runtime” and “Data at Rest” isolation for a Spark instance group. Once the execution user is specified here, all Spark instance group processes will run under that user on all the hosts. All the data generated by those processes will naturally be owned by that execution user, thus isolating the data from other execution users.

Step 3: Deployment settings

This is where the administrator can configure how the Spark Instance group resides on each host. This step also provides a way to:

Upload other Spark instance group-specific packages (such as applications and required third-party libraries) to be deployed together with Spark core services.
Specify the location on each host where the packages are to be deployed.
Automatically start the deployment after configuration.
Enable high availability for the Spark Master and configure the shared location storing data for recovery.

After everything is set up, click Create, and a new Spark instance group is created. If you selected the auto-deploy option, Platform Conductor will start the deployment and after a while (depending on the size of your cluster), a new Spark instance group is ready to be started.

How to deploy and start to use a Spark instance group

Start a Spark instance group

Once a Spark Instance Group is created, it is in the “Registered” state. If the auto-deploy option is selected, it moves to “Deploying” status automatically. Otherwise, an administrator can deploy the Spark instance group manually. Soon after, the state becomes “Ready”. Once you click Start, all services will be in the “Started” state and are ready to use.

Here’s a sample view of a host in the cluster based on the preceding configuration: all Spark instance groups are deployed to directories under different users: lobb and lobc (LOBA is not there because the Spark instance group is not yet deployed):

Also, all processes run under the corresponding users (only LOBC is running):

Submit a Spark job

Once a Spark instance group is started, the user with the required permissions can go to Workload>Spark>Applications & Notebooks and submit a Spark application as a batch job by clicking Run a batch application:

Enter all the parameters for spark-submit and select the right Spark Master service to submit the application.

After the job is submitted, you can continue to monitor the status in the application status list and drill down to the details of every single application.

Use notebooks

If a notebook is selected during Spark instance group creation, the administrator can assign users to the notebook. For more details on working with notebooks, watch for a separate blog later.

Go try it

Now that you understand a Spark instance group in Platform Conductor, how it works with users and applications, and how we provide basic HA and isolation for the Spark instance group, try it out! Download an evaluation version of Platform Conductor from our Service Management Connect page. If you have any questions, post them in our forum!

For more information on Platform Conductor, look up the IBM Knowledge Center.

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SS4H63","label":"IBM Spectrum Conductor"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm16163851

Tips

Spark Multitenancy with IBM Platform Conductor

Technical Blog Post

Abstract

Body

What is required for a Spark instance group?

Spark core services

Spark tools/notebooks

Users and applications

Basic isolation

How to create a Spark instance group with Platform Conductor

Step 1: Spark settings

Step 2: EGO settings

Step 3: Deployment settings

How to deploy and start to use a Spark instance group

Start a Spark instance group

Submit a Spark job

Use notebooks

Go try it

UID

Share your feedback

Need support?