Complete this task to configure the port access and other networking
customization that Apache Spark
requires.
About this task
Apache Spark makes heavy use
of the network for communication between various processes, as shown in Figure 1.
Figure 1. Network ports used in a typical Apache Spark environment
These ports are further described in Table 1 and Table 2, which list the ports that Spark uses, both on the cluster side
and on the driver side.
Table 1. Network ports used by the Spark clusterPort name |
Default port number |
Configuration property* |
Notes |
Master web UI |
8080 |
spark.master.ui.port or SPARK_MASTER_WEBUI_PORT |
The value set by the spark.master.ui.port property takes precedence. |
Worker web UI |
8081 |
spark.worker.ui.port or SPARK_WORKER_WEBUI_PORT |
The value set by the spark.worker.ui.port takes precedence. |
History server web UI |
18080 |
spark.history.ui.port |
Optional; only applies if you use the history server. |
Master port |
7077 |
SPARK_MASTER_PORT |
|
Master REST port |
6066 |
spark.master.rest.port |
Not needed if you disable the REST service. |
Worker port |
(random) |
SPARK_WORKER_PORT |
|
Executor port |
(random) |
spark.executor.port |
For Spark 1.5.2
only. |
Block manager port |
(random) |
spark.blockManager.port |
|
Shuffle server |
7337 |
spark.shuffle.service.port |
Optional; only applies if you use the external shuffle service. |
Table 2. Network ports used by the Spark driverPort name |
Default port number |
Configuration property* |
Notes |
Application web UI |
4040 |
spark.ui.port |
|
Driver port |
(random) |
spark.driver.port |
|
Block manager port |
(random) |
spark.blockManager.port |
|
File server |
(random) |
spark.fileserver.port |
For Spark 1.5.2
only. |
HTTP broadcast |
(random) |
spark.broadcast.port |
For Spark 1.5.2
only. Not used if spark.broadcast.factory is set to
TorrentBroadcastFactory (default). |
Class file server |
(random) |
spark.replClassServer.port |
For Spark 1.5.2 only
and only used in Spark
shells. |
*The Spark
properties in the Configuration property column can either be set in the
spark-defaults.conf file (if listed in lower case) or in the
spark-env.sh file (if listed in upper case).
Spark must be able to bind to all
the required ports. If Spark cannot bind to a specific port, it tries again with the next port number. The default number of
retries is 16. The maximum number of retries is controlled by the
spark.port.maxRetries property in the spark-defaults.conf
file.
- For your planned deployment and ecosystem, consider any port access and firewall implications
for the ports listed in Table 1 and
Table 2, and configure specific port
settings, as needed. For instance, if your application developers need to access the Spark application web UI from
outside the firewall, the application web UI port must be open on the firewall.
Each time a Spark
process is started, a number of listening ports are created that are specific to the intended
function of that process. Depending on your site networking policies, limit access to all ports and
permit access for specific users or applications.
In z/OS®, you can use settings in z/OS Communications
Server and RACF® to enforce controls. For instance, you can
specify PORT UNRSV DENY in your TCPIP.PROFILE to deny all applications access to unreserved ports
for TCP or UDP. You can also specify PORT UNRSV SAF to grant specific access to specific users, such
as the user ID that starts the Spark cluster and the Spark users. For more information
about the PORT statement, see z/OS Communications Server: IP Configuration Reference.
- Consider disabling the REST server.
The REST server interface, which listens on port 6066 by default, is currently not included
in the Apache Spark documentation. The
REST server does not support TLS nor client authentication; however, Spark applications can be submitted
through this interface. The REST server is used when applications are submitted using cluster deploy
mode (--deploy-mode cluster). Client deploy mode is the default behavior for
Spark, and is the way that
notebooks, like Jupyter Notebook, connect to a Spark cluster. Depending on your
planned deployment and environment, access to the REST server might be restricted by other controls.
However, if you want to disable it, you can do so by setting
spark.master.rest.enabled to false in
$SPARK_CONF_DIR/spark-defaults.conf.
- Configure Spark
environment variables for common enterprise networking configurations. You can set each of the following environment variables in the
spark-env.sh file:
- SPARK_PUBLIC_DNS
- For environments that use network address translation (NAT), set
SPARK_PUBLIC_DNS to the external host name to be used for the Spark web UIs.
SPARK_PUBLIC_DNS sets the public DNS name of the Spark master and workers. This
allows the Spark Master to
present in the logs a URL with the host name that is visible to the outside world.
- SPARK_LOCAL_IP
- Set the SPARK_LOCAL_IP environment variable to configure Spark processes to bind to a
specific and consistent IP address when creating listening ports.
- SPARK_MASTER_HOST
- On systems with multiple network adaptors, Spark might attempt the default
setting and give up if it does not work. Set the SPARK_MASTER_HOST (known as
SPARK_MASTER_IP prior to Spark 2.0) to avoid this.
What to do next
Continue with Configuring IBM Java.