Configuring Apache ZooKeeper HA for YARN resource manager failover

If you set up HA through Apache ZooKeeper for your YARN resource manager, when an active resource manager fails, Apache ZooKeeper selects a standby resource manager to become active.

Before you begin

To use Apache ZooKeeper for HA, ensure that Apache ZooKeeper cluster is installed and configured for your YARN resource manager.

Procedure

  1. Modify the service configuration file ($EGO_ESRVDIR/esc/conf/services/EGOYARN.xml) on the resource manager host:
    1. Add the names of the hosts to be used as resource managers when the active resource manager fails. Do this by modifying the <ego:ResourceRequirement> property.

      Ensure that all resource managers run on management hosts (or in a dedicated resource group that consists of management hosts, not compute hosts).

      Surround each resource manager host name with single quotation marks ('), and separate each host name with ||. The following is an example:

      <ego:ResourceRequirement>select('rm_host_1'||'rm_host_2')</ego:ResourceRequirement> 
    2. Set the minimum and maximum number of resource manager instances to be used as resource managers. Do this by modifying the <sc:MinInstances> and <sc:MinInstances> properties. Ensure that the maximum value is greater than the minimum value. The following is an example:
      <sc:MinInstances>1</sc:MinInstances>
      <sc:MaxInstances>2</sc:MaxInstances>
  2. Modify the YARN configuration file ($HADOOP_CONF_DIR/yarn-site.xml) on all resource manager hosts and the node manager hosts. Do this by adding Apache ZooKeeper HA properties to the file. For example, the following is based on open source YARN configuration:
    <property>
        <name>yarn.resourcemanager.ha.enabled</name>
        <value>true</value>
    </property>
    <property>
        <name>yarn.resourcemanager.zk-address</name>
        <value>zk1:port1,zk2:port2,..., zkN:portN</value>
    </property>
    <property>
        <name>yarn.resourcemanager.recovery.enabled</name>
        <value>true</value>
    </property>
    <property>
        <name>yarn.resourcemanager.ha.automatic-failover.enabled</name>
        <value>true</value>
    </property>
    
    Notes:
    1. If the ego.yarn.resourcemanager.ha.virtualip property exists in the file, ensure that it is set to false as this property applies to virtual IP-based HA.
    2. If your yarn-site.xml file included the yarn.resourcemanager.store.class and yarn.resourcemanager.cluster-id properties, then the system uses these values. If your yarn-site.xml did not contain these properties, the system adds the following default values when the resource manager and node manager start:
      <property>
          <name>yarn.resourcemanager.store.class</name>
          <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
      </property>
      <property>
          <name>yarn.resourcemanager.cluster-id</name>
          <value>defaultClusterID</value>
      </property>
    3. When the resource manager and node manager start, the system automatically generates the following values for the yarn.resourcemanager.ha.rm-ids, yarn.resourcemanager.hostname.rmhost2, and yarn.resourcemanager.hostname.rmhost1 properties in yarn-site.xml:
      <property>
          <name>yarn.resourcemanager.ha.rm-ids</name>
          <value>rm_host_1,rm_host_2</value>
      </property>
      <property>
          <name>yarn.resourcemanager.hostname.rmhost2</name>
          <value>rm_host_2</value>
      </property>
      <property>
          <name>yarn.resourcemanager.hostname.rmhost1</name>
          <value>rm_host_1</value>
      </property>
  3. Restart the EGO primary and the EGOYARN service:
    egosh ego restart primary_host_name
    egosh service stop EGOYARN
    egosh service start EGOYARN
  4. Verify that there are multiple resource manager instances in the cluster:
    egosh service list -l
    Note that one resource manager is active; the others are on standby for HA.

Results

Once you have configured the YARN resource manager for HA, if the active resource manager is down or is no longer on the list, one of the standby resource managers becomes active and resumes resource manager responsibilities. This way, jobs continue to run and complete successfully.