Configuring networking for Apache Spark

Complete this task to configure the port access and other networking customization that Apache Spark requires.

About this task

Apache Spark makes heavy use of the network for communication between various processes, as shown in Figure 1.

The figure shows the network ports that are used to communicate between the Spark driver, master, worker, and executor processes in a Spark cluster. The surrounding text and tables describe the ports and their usage. — Figure 1. Network ports used in a typical Apache Spark environment

These ports are further described in Table 1 and Table 2, which list the ports that Spark uses, both on the cluster side and on the driver side.

Table 1. Network ports used by the Spark cluster
Port name	Default port number	Configuration property*	Notes
Master web UI	8080	`spark.master.ui.port` or SPARK_MASTER_WEBUI_PORT	The value set by the `spark.master.ui.port` property takes precedence.
Worker web UI	8081	`spark.worker.ui.port` or SPARK_WORKER_WEBUI_PORT	The value set by the `spark.worker.ui.port` takes precedence.
History server web UI	18080	`spark.history.ui.port`	Optional; only applies if you use the history server.
Master port	7077	SPARK_MASTER_PORT	`SPARK_MASTER_PORT` (or the default, 7077) is the starting point for connection attempts and not the actual port that might be connected. In addition, the value can be 0, which means it uses a random port number. Therefore, `SPARK_MASTER_PORT` (or the default, 7077) might not be the port that is used for the master. This statement is true for all methods of starting the master, including BPXBATCH, the start*.sh scripts, and the started task procedure. Note: You should not choose 0 for `SPARK_MASTER_PORT` if you intend to use client authentication.
Master REST port	6066	`spark.master.rest.port`	Not needed if the REST service is disabled. If you wish to use the REST service and are planning to use authentication, you should configure AT-TLS port authentication for this port. Note that this automatically disabled by default.
Worker port	(random)	SPARK_WORKER_PORT
Block manager port	(random)	`spark.blockManager.port`
External shuffle server	7337	`spark.shuffle.service.port`	Optional; only applies if you use the external shuffle service.
PySpark daemon	(random)	`spark.python.daemon.port`	Optional; only applies if you use PySpark.

Table 2. Network ports used by the Spark driver
Port name	Default port number	Configuration property*	Notes
Application web UI	4040	`spark.ui.port`
Driver port	(random)	`spark.driver.port`
Block manager port	(random)	`spark.blockManager.port`	The value set by the `spark.driver.blockManager.port` property takes precedence
Driver block manager port	(Value of `spark.blockManager.port`)	`spark.driver.blockManager.port`	If `spark.driver.blockManager.port` is not set, the `spark.blockManager.port` configuration is used.

*The Spark properties in the Configuration property column can either be set in the spark-defaults.conf file (if listed in lower case) or in the spark-env.sh file (if listed in upper case).

Spark must be able to bind to all the required ports. If Spark cannot bind to a specific port, it tries again with the next port number. (+1). The maximum number of retries is controlled by the spark.port.maxRetries property (default: 16) in the spark-defaults.conf file.

Note: The external shuffle server port does not support the port binding retries functionality.

The port binding retries functionality also implies a limit on the number of simultaneous instances of those ports across all Spark processes that use the same configuration. Assume the spark.port.maxRetries property is at default (16), here are a few examples:

If the Spark application web UI is enabled, which it is by default, there can be no more than 17 Spark applications running at the same time, due to the 18th Spark driver process will fail to bind to an Application UI port.
When both spark.blockManager.port and spark.driver.blockManager.port are set, there can be no more than 17 executor processes running at the same time, because the 18th executor process will fail to bind to a Block manager port.
When spark.blockManager.port is set but spark.driver.blockManager.port is not set, the combined total of executor and driver processes cannot exceed 17, as the 18th process will fail to bind to a Block manager port.

Careful consideration is needed and you may need to increase the spark.port.maxRetries value if you are going to run multiple Spark applications at the same time, and/or planning to utilize a high number of executors within the cluster simultaneously.

Procedure

For your planned deployment and ecosystem, consider any port access and firewall implications for the ports listed in Table 1 and Table 2, and configure specific port settings, as needed.
For instance, if your application developers need to access the Spark application web UI from outside the firewall, the application web UI port must be open on the firewall.

Each time a Spark process is started, a number of listening ports are created that are specific to the intended function of that process. Depending on your site networking policies, limit access to all ports and permit access for specific users or applications.

On z/OS®, you can use settings in z/OS Communications Server and RACF® to enforce controls. For instance, you can specify PORT UNRSV DENY in your TCPIP.PROFILE to deny all applications access to unreserved ports for TCP or UDP. You can also specify PORT UNRSV SAF to grant specific access to specific users, such as the user ID that starts the Spark cluster and the Spark users. For more information about the PORT statement, see z/OS Communications Server: IP Configuration Reference.
Consider your planned usage of the REST server.

The REST server interface, which listens on port 6066 by default, is currently disabled by default. The REST server supports TLS client authentication and Spark applications can be submitted through this interface.
Configure Spark environment variables for common enterprise networking configurations.
You can set each of the following environment variables in the spark-env.sh file:

SPARK_PUBLIC_DNS

For environments that use network address translation (NAT), set SPARK_PUBLIC_DNS to the external host name to be used for the Spark web UIs. SPARK_PUBLIC_DNS sets the public DNS name of the Spark master and workers. This allows the Spark Master to present in the logs a URL with the host name that is visible to the outside world.

SPARK_LOCAL_IP

Set the SPARK_LOCAL_IP environment variable to configure Spark processes to bind to a specific and consistent IP address when creating listening ports.

SPARK_MASTER_HOST

On systems with multiple network adaptors, Spark might attempt the default setting and give up if it does not work. Set the SPARK_MASTER_HOST (known as SPARK_MASTER_IP prior to Spark 2.0) to avoid this.

What to do next

Continue with Configuring z/OS Spark client authentication.