When your instances (instance groups, Anaconda distribution
instances, and application instances) are deployed to a local file system , the shuffle service is
enabled by default. When your instances are is deployed to a shared file system, the shuffle service
is disabled by default to increase application performance.
About this task
When your instances (instance groups, Anaconda distribution
instances, and application instances) are deployed to a shared file system, data that is
written to a shared file system does not need to be redistributed across the network. Therefore, the
shuffle service is not available by default. You can optionally enable the shuffle service to run on
each host in your cluster independent of your Spark applications and their executors. When the
shuffle service is enabled, Spark executors fetch shuffle files from the service instead of from
each other. Thus, any shuffle state that is written by an executor continues to be served beyond the
executor’s lifetime.
Procedure
-
In the Basic Settings, select the Spark version that the instance group must use.
-
Click Configuration to define the following shuffle service settings:
| Spark parameter |
Description |
| spark.shuffle.service.port |
Define an exclusive port for use by the Spark shuffle service (default 7337). To ensure a
unique environment for each instance group, the default port number
increments by 1 for each instance group
that you subsequently create. So the shuffle service for your second instance group runs by default on port 7338, the
next on port 7339, and so on. However, when you create a instance group by using RESTful APIs, the
shuffle service uses port 7337 by default for all instance groups. Ensure that the shuffle
service port is not already used by the system or by processes external to IBM® Spectrum
Conductor. This port is not used on all
hosts that start the shuffle service; you need to only check hosts in the shuffle service resource
group.
|
| spark.local.dir |
When your
instances (instance groups, Anaconda
distribution instances, and application instances) are deployed to a local file system (with
the shuffle service enabled), specify a directory for scratch space (including map output
files and RDDs) that is on a fast, local disk in your system (default is /tmp).
It can also be a comma-separated list of multiple directories on different disks.
If spark.local.dir is specified to an existing directory, ensure that the
execution user for the instance group
and the execution user for the Spark shuffle service consumer have read/write/execute
(rwx) permissions for that directory.
If you change the default consumer for each component, the consumer execution user of the driver
and executor processes for any submitted applications, the Spark batch master service, and the Spark notebook master service must have execute
(x) permission to the spark.local.dir directory.
Note: With the shuffle service enabled, you must manually clean up data under the
spark.local.dir location when the instance group is removed.
|
| SPARK_EGO_LOGSERVICE_PORT |
Specify the UI port for the EGO log service (default 28082). To ensure a unique environment
for each instance group, the default
port number increments by 1 for each instance group that you subsequently create. So
the log service for your second instance group runs by default on port 28083,
the next on port 28084, and so on. However, when you create a instance group by using RESTful APIs, the log
service uses port 28082 by default for all instance groups. Ensure that the log service
UI port is not already used by the system or by processes external to IBM Spectrum
Conductor. This port is not used on all
hosts that start the log service; you need to only check hosts in the shuffle service resource
group.
|
-
If you want to change the default shuffle service consumer, click Edit detailed
consumer assignments, then the consumer for the shuffle service.
- To select an existing child consumer, click Select an existing child
consumer, select the consumer under the top-level consumer, and click
Select.
- To create a new consumer, click Create a new consumer under
top_level_consumer, enter a consumer name, and click
Create.
If you plan to use exclusive slot allocation, the shuffle service consumer and the Spark
executors consumer must be different.
-
If you want the shuffle service and executors to use different resource groups, in the Resource
Groups and Plans section, select a different resource group for the shuffle service. Both resource
groups must use the same set of hosts; otherwise, applications that are submitted to this instance group fail.
What to do next
- Finish configuring the instance group. See Defining basic settings for an instance group.
- Create and deploy the instance group.
- Click Create and Deploy Instance Group to create the instance group and deploy its packages
simultaneously. In this case, the new instance group appears on the Instance Groups page in the Ready state. Verify your deployment and
then start the instance group.
- Click Create Only to create the instance group but manually deploy its packages
later. In this case, the new instance group appears on the Instance Groups page in the Registered state. When you are ready to
deploy packages, deploy the instance group and verify the deployment. Then,
start the instance group.
- Set up the SparkCleanup service to run on all hosts in the cluster.
- Edit the SparkCleanup service profile at
$EGO_CONFDIR/../../eservice/esc/conf/services/sparkcleanup_service.xml.
- Modify the consumer. Find the
<ego:ConsumerID> element and change it to use
the /ClusterServices/EGOClusterService consumer:
<ego:ConsumerID>/ClusterServices/EGOClusterServices</ego:ConsumerID>
- Modify the resource group: Find the
<ego:ResourceGroupName> element and
change it to use the InternalResourceGroup resource group:
<ego:ResourceGroupName>InternalResourceGroup</ego:ResourceGroupName>
- Modify the number of service instances: Find the
<sc:MaxInstances> element
and change it to use 5000
instances:<sc:MaxInstances>5000</sc:MaxInstances>
- Save the changes.
- From the command console, restart the
cluster:
egosh service stop all
egosh ego shutdown all
egosh ego start all