Submitting Spark batch applications to Kerberos-enabled HDFS with keytab

Submit Spark workload to a Kerberos-enabled HDFS by using keytab authentication.

Before you begin

Spark versions not supported: 1.5.2, 2.0.1, and 2.1.0.
Kerberos authentication must be enabled. See Kerberos user authentication for Spark workload.
Hadoop security must be enabled on all hosts in the cluster. In the core-site.xml configuration file, ensure that the authorization and authentication properties are configured as follows:
```
<property>
  <name>hadoop.security.authentication</name>
  <value>kerberos</value>
</property>
<property>
  <name>hadoop.security.authorization</name>
  <value>true</value>
</property>
```
To enable access to Kerberos-enabled HDFS, the Spark instance group to which you submit Spark batch applications must reference the path to your Hadoop configuration. Modify the configuration of the Spark instance group to which you submit Spark batch applications and set the HADOOP_CONF_DIR environment variable for the Spark version to the path of your Hadoop configuration, for example: HADOOP_CONF_DIR=/opt/hadoop-2.6.5/etc/hadoop.

About this task

When you submit Spark workload with keytab to a Kerberos-enabled HDFS, specify the Kerberos principal and its keytab as options that are passed with the --conf flag. You must also specify the fully qualified URL that includes the hostname of the HDFS Namenode.

Limitations:

If impersonation (to have Spark batch applications run as the submission user) for the Spark instance group is not enabled, the workload submission user keytab file must be readable by consumer execution user for the driver and executor.
When submitting using the cluster management console or ascd Spark application RESTful APIs, the keytab file must be in a shared file system.

Procedure

You can submit Spark batch applications from the cluster management console (on the My Applications & Notebooks page or the Spark Instance Groups page), by using ascd Spark RESTful APIs, or by using the spark-submit command in the Spark deployment directory.

Submit a Spark batch application using the following spark-submit syntax for keytab authentication:

spark-submit --master spark://Spark master_url -–conf spark.yarn.keytab=path_to_keytab -–conf spark.yarn.principal=principal@REALM.COM 
--class main-class application-jar hdfs://namenode:9000/path/to/input

where:

spark://Spark master_url identifies the Spark master URL of the Spark instance group to submit the Spark batch application.
spark.yarn.keytab=path_to_keytab specifies the full path to the file that contains the keytab for the specified principal, for example, /home/test/test.keytab. Ensure that the execution user for the Spark driver consumer in the Spark instance group has access to the keytab file.
spark.yarn.principal=principal@REALM.COM specifies the principal used to log in to the KDC while running on Kerberos-enabled HDFS, for example, user@EXAMPLE.COM.
hdfs://namenode:9000/path/to/input specifies the fully qualified URL of the HDFS Namenode. Submitting workload with keytab enables the HDFS delegation token to be refreshed and generates the Spark YARN credential file in the home directory of the submission user in HDFS. Ensure that this directory already exists in HDFS.

To submit workload in client or cluster mode, add the --deploy-mode client or --deploy-mode cluster options, respectively.

For example:

To submit SparkPi with keytab in client mode to Kerberos-enabled HDFS, enter:

spark-submit --master spark://test18.lab.example.com:7077 --deploy-mode client -–conf spark.yarn.keytab=/home/test/test.keytab -–conf spark.yarn.principal=user@EXAMPLE.COM 
--class org.apache.spark.examples.JavaWordCount $SPARK_HOME/spark-2.1.0-hadoop-2.7/examples/jars/spark-examples_2.11-2.1.0.jar hdfs://testNameNode:9000/user/test/input

To submit SparkPi with keytab in cluster mode to Kerberos-enabled HDFS, enter:

spark-submit --master spark://test18.lab.example.com:7077 --deploy-mode cluster -–conf spark.yarn.keytab=/home/test/test.keytab -–conf spark.yarn.principal=user@EXAMPLE.COM 
--class org.apache.spark.examples.JavaWordCount $SPARK_HOME/spark-2.1.0-hadoop-2.7/examples/jars/spark-examples_2.11-2.1.0.jar hdfs://testNameNode:9000/user/test/input