Configuring networking for Apache Spark

Complete this task to configure the port access and other networking customization that Apache Spark requires.

About this task

Apache Spark makes heavy use of the network for communication between various processes, as shown in Figure 1.

Figure 1. Network ports used in a typical Apache Spark environment
The figure shows the network ports that are used to communicate between the Spark driver, master, worker, and executor processes in a Spark cluster. The surrounding text and tables describe the ports and their usage.

These ports are further described in Table 1 and Table 2, which list the ports that Spark uses, both on the cluster side and on the driver side.

Table 1. Network ports used by the Spark cluster
Port name Default port number Configuration property* Notes
Master web UI 8080 spark.master.ui.port or SPARK_MASTER_WEBUI_PORT The value set by the spark.master.ui.port property takes precedence.
Worker web UI 8081 spark.worker.ui.port or SPARK_WORKER_WEBUI_PORT The value set by the spark.worker.ui.port takes precedence.
History server web UI 18080 spark.history.ui.port Optional; only applies if you use the history server.
Master port 7077 SPARK_MASTER_PORT SPARK_MASTER_PORT (or the default, 7077) is the starting point for connection attempts and not the actual port that might be connected. In addition, the value can be 0, which means it uses a random port number. Therefore, SPARK_MASTER_PORT (or the default, 7077) might not be the port that is used for the master. This statement is true for all methods of starting the master, including BPXBATCH, the start*.sh scripts, and the started task procedure. 
Note: You should not choose 0 for SPARK_MASTER_PORT if you intend to use client authentication.
Master REST port 6066 spark.master.rest.port Not needed if the REST service is disabled. If you wish to use the REST service and are planning to use authentication, you should configure AT-TLS port authentication for this port. Note that this automatically disabled by default.
Worker port (random) SPARK_WORKER_PORT  
Block manager port (random) spark.blockManager.port  
External shuffle server 7337 spark.shuffle.service.port Optional; only applies if you use the external shuffle service.
PySpark daemon (random) spark.python.daemon.port Optional; only applies if you use PySpark.
Table 2. Network ports used by the Spark driver
Port name Default port number Configuration property* Notes
Application web UI 4040 spark.ui.port  
Driver port (random) spark.driver.port  
Block manager port (random) spark.blockManager.port The value set by the spark.driver.blockManager.port property takes precedence
Driver block manager port (Value of spark.blockManager.port) spark.driver.blockManager.port  If spark.driver.blockManager.port is not set, the spark.blockManager.port configuration is used.

*The Spark properties in the Configuration property column can either be set in the spark-defaults.conf file (if listed in lower case) or in the spark-env.sh file (if listed in upper case).

Spark must be able to bind to all the required ports. If Spark cannot bind to a specific port, it tries again with the next port number. (+1). The maximum number of retries is controlled by the spark.port.maxRetries property (default: 16) in the spark-defaults.conf file.

Note: The external shuffle server port does not support the port binding retries functionality.
The port binding retries functionality also implies a limit on the number of simultaneous instances of those ports across all Spark processes that use the same configuration. Assume the spark.port.maxRetries property is at default (16), here are a few examples:
  • If the Spark application web UI is enabled, which it is by default, there can be no more than 17 Spark applications running at the same time, due to the 18th Spark driver process will fail to bind to an Application UI port.
  • When both spark.blockManager.port and spark.driver.blockManager.port are set, there can be no more than 17 executor processes running at the same time, because the 18th executor process will fail to bind to a Block manager port.
  • When spark.blockManager.port is set but spark.driver.blockManager.port is not set, the combined total of executor and driver processes cannot exceed 17, as the 18th process will fail to bind to a Block manager port.

Careful consideration is needed and you may need to increase the spark.port.maxRetries value if you are going to run multiple Spark applications at the same time, and/or planning to utilize a high number of executors within the cluster simultaneously.

Procedure

  1. For your planned deployment and ecosystem, consider any port access and firewall implications for the ports listed in Table 1 and Table 2, and configure specific port settings, as needed.
    For instance, if your application developers need to access the Spark application web UI from outside the firewall, the application web UI port must be open on the firewall.

    Each time a Spark process is started, a number of listening ports are created that are specific to the intended function of that process. Depending on your site networking policies, limit access to all ports and permit access for specific users or applications.

    On z/OS®, you can use settings in z/OS Communications Server and RACF® to enforce controls. For instance, you can specify PORT UNRSV DENY in your TCPIP.PROFILE to deny all applications access to unreserved ports for TCP or UDP. You can also specify PORT UNRSV SAF to grant specific access to specific users, such as the user ID that starts the Spark cluster and the Spark users. For more information about the PORT statement, see z/OS Communications Server: IP Configuration Reference.

  2. Consider your planned usage of the REST server.

    The REST server interface, which listens on port 6066 by default, is currently disabled by default. The REST server supports TLS client authentication and Spark applications can be submitted through this interface.

  3. Configure Spark environment variables for common enterprise networking configurations.
    You can set each of the following environment variables in the spark-env.sh file:
    SPARK_PUBLIC_DNS
    For environments that use network address translation (NAT), set SPARK_PUBLIC_DNS to the external host name to be used for the Spark web UIs. SPARK_PUBLIC_DNS sets the public DNS name of the Spark master and workers. This allows the Spark Master to present in the logs a URL with the host name that is visible to the outside world.
    SPARK_LOCAL_IP
    Set the SPARK_LOCAL_IP environment variable to configure Spark processes to bind to a specific and consistent IP address when creating listening ports.
    SPARK_MASTER_HOST
    On systems with multiple network adaptors, Spark might attempt the default setting and give up if it does not work. Set the SPARK_MASTER_HOST (known as SPARK_MASTER_IP prior to Spark 2.0) to avoid this.

What to do next

Continue with Configuring z/OS Spark client authentication.