Host blocking

When hosts in an instance group contain environment errors, Spark drivers and executors fail to start or run on these hosts. You can set host blocking rules in the cluster management console so that the Spark master does not allocate resources from these hosts.

When host environment errors occur (for example, Spark installation errors, Java home misconfiguration, or inaccessible working directories), drivers and executors on these hosts fail. With host blocking, you can set host blocking rules to block hosts for an instance group from starting drivers and executors so that resources are not allocated from these hosts when these errors occur. The cluster management console also provides error messages and details so that you can resolve these errors and unblock the blocked hosts. Hosts that are automatically blocked for Spark services due to missing or outdated deployments are automatically unblocked by the ASCD when deployment is completed.

When drivers or executors fail to start or run properly, you can view the Spark master log to see the driver's exit reason and exit code or view the driver logs to see the executor's exit reason. For example, the log might provide INFO EgoApplicationManager: Driver driver-20170717043349-0001-b4c6d750-3618-4fa2-bd0b-d1597b095bbb exit due to No reason, exitCode is 15. INFO EGOMaster: Driver throw exception:my io exception. You can then add my io exception as a rule.

You can only set host blocking rules with certain Spark versions. Spark versions not supported: 1.5.2 and 3.0.0.

Host blocking is enabled by default. To enable host blocking for an instance group from the cluster management console if it is disabled, see Blocking hosts for an instance group.

Hosts are automatically blocked under the following default conditions:
  • When the driver or executor cannot start on a host for any of the following reasons:
    • User account does not exist (code: 6)
    • Failed to change the container CWD (code: 7)
    • Container failed to start (code: 16)
    • Start up command does not exist (code: 17)
    • Command cannot be executed (code: 18)
    • The stdout, stderr redirection files cannot be created (error code: 25)
    • The deployment is incomplete or outdated (error code: 99)
    • Not an executable (code: 126)
    • Command not found (code: 127)
  • When the driver or executor starts on a host but then fails because the application exit reason matches the host-blocking rules that you specified.
The following exit codes are built in with Spark to use for blocking hosts:
  • Code 0: No reason.
  • Code 1: Exit because of setup fail.
  • Code 2: Fork fail.
  • Code 3: Failed to set process group ID.
  • Code 4: Failed to set environment variables.
  • Code 5: Failed to set process limits.
  • Code 6: User account does not exist.
  • Code 7: Failed to change container CWD.
  • Code 8: Terminated by SIGKILL.
  • Code 9: Unknown reason.
  • Code 10: Failed to reach PEM host.
  • Code 11: VEMKD and PEM sync issue.
  • Code 12: Execution host is not a VEM host.
  • Code 13: Allocation does not exist.
  • Code 14: Host is not allocated.
  • Code 15: Client does not exist.
  • Code 16: Container start fails.
  • Code 17: Startup command does not exist.
  • Code 18: Command not executed.
  • Code 19: Terminated by job controller.
  • Code 20: Terminated by SIGKILL, job controller does not exist or failed.
  • Code 21: Failed to set process priority.
  • Code 22: Killed by job monitor.
  • Code 23: The IP address failed to apply to the target host.
  • Code 24: Killed by cgroup OOM Killer.
  • Code 25: Failed to created stdout, stderr redirection files.
  • Code 26: The Docker Controller stop command timed out..
  • Code 27: The Docker Controller run command timed out.
  • Code 28: Failed to read from the Docker Controller.
  • Code 29: Failed to connect to the Docker daemon.
  • Code 30: The Docker operation encountered errors. For more information, see the Docker controller logs.
  • Code 31: There is an internal error in egodocker. For more information, see the Docker controller logs.
  • Code 32: The Docker operation timed out. For more information, see the Docker controller logs.
  • Code 33: The requested GPU is not available; (default: Undefined exit reason).
  • Code 99: The deployment is incomplete or outdated.

Limitations

Consider the following rules and limitations when you are using host blocking:
  • Set block host rules only when host environment issues occur to avoid unnecessarily blocking hosts.
  • When you block or unblock allocations from a host by using the CLI or RESTful APIs, you might see incorrect or inconsistent information about the blocked host in the cluster management console.
  • Drivers and executors are blocked only if the user has host blocking enabled (which is enabled by default for new instance groups).
  • Host blocking for drivers is not supported in client mode.
  • Host blocking does not support user-defined exit reason rules for executors.
  • Host blocking does not support user-defined exit code rules for executors with Python applications.
  • To define exit code rules for drivers with Python applications, you must define an exit reason rule with the following reason:

    User application exited with #exitcode.