Enabling multicluster session failover

Enable session failover by modifying the workload placement policy file on the multicluster primary host, or with environment variables on the client host. All client side settings take precedence (that is, they override values in the primary host policy file). If a setting is not specified in the policy file, the system uses the value on the client instead. This topic shows how to complete this task by modifying the workload placement policy file on the multicluster primary host.

Procedure

  1. Enable multicluster session-level failover, configure either the multicluster primary host or the client host:
    • Modify the workload placement policy file settings on your multicluster primary host:
      • workloadRedirection: Enables task-level redirection. Specify session (which is also the default value).
      • workloadRedirectionFailover: Enables or disables failover for workload redirection. When the workloadRedirection parameter is set to session, enables or disables task recovery. Specify enabled (the default is disabled).
      • topNClusterForTaskRedirection: Defines the first N clusters to distribute tasks from a session for task-level redirection. All tasks are submitted only to these N clusters. Valid values are 1 to 20. The default is 3.
      • topNClusterShareValue: Defines the share value for the first N clusters. This share value uses a smooth weighted round-robin configuration to distribute tasks that are submitted to the clusters. Valid values are 1 to 100, separated by a comma (for example, 3,2,1). Default is , indicating that all the highest-ranking clusters have the same share value.
      <?xml version="1.0" encoding="UTF-8"?>
      <Policy name="SessionFailover" description="Session Failover Policy" owner="" 
              xmlns="http://www.ibm.com/Symphony/schema" 
              xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
              xsi:schemaLocation="http://www.ibm.com/Symphony/schema ../7.2.0/schema/SmcPlacementPolicy.xsd">     
          <Clusters>
              <PrimaryGroup>
              </PrimaryGroup>
              <OverflowGroup enableOverflow="false">
              </OverflowGroup>    
          </Clusters>
            <Application
              workloadRedirection="session"
              workloadRedirectionFailover="enabled"/>
      </Policy>
    • Override workload placement file configurations defined in the policy file by configuring the SMC_WORKLOAD_REDIRECTION and SMC_WORKLOAD_REDIRECTION_FAILOVER environment variables on the client host. By default, these variables are set as SMC_WORKLOAD_REDIRECTION=session and SMC_WORKLOAD_REDIRECTION_FAILOVER=disabled.
      When session-level redirection is enabled (SMC_WORKLOAD_REDIRECTION=session), set SMC_WORKLOAD_REDIRECTION_FAILOVER to enabled to enable session failover and task recovery. For example, for bash:
      export SMC_WORKLOAD_REDIRECTION_FAILOVER=enabled
      To enable session failover but not resubmit the session's unfinished tasks to other sessions, set SMC_WORKLOAD_REDIRECTION_FAILOVER to session_failover_without_task_recovery. For example, for bash:
      export SMC_WORKLOAD_REDIRECTION_FAILOVER=session_failover_without_task_recovery
  2. Optional: If you have sessions running on a cluster, and IBM® Spectrum Symphony detects zero resources available (that is resource starvation for the cluster), then it closes the current session (and cancels active tasks), creates a new session on another cluster, and fails over to that cluster. To enable IBM Spectrum Symphony to query resource starvation, by configuring either the multicluster primary host or the client host:
    • Modify the workload placement policy file to include the resubmitOnZeroResourcesTimeoutMinutes setting on your multicluster primary host.
      For example, to specify that if a cluster has zero allocation for an application for 5 minutes, the policy should fail active tasks and resubmit to an overflow cluster, specify:
      <Application resubmitOnZeroResourcesTimeoutMinutes="5" 
      topNClusterForTaskRedirection="2" 
      topNClusterShareValues="3,2" 
      workloadRedirection="task" 
      workloadRedirectionFailover="enabled"/>
      
    • Override workload placement file configurations defined the policy file by configuring the SMC_RESUBMIT_ON_ZERO_RESOURCES_TIMEOUT_MINUTES environment variables on the client host.

      By default, SMC_RESUBMIT_ON_ZERO_RESOURCES_TIMEOUT_MINUTES is disabled. All client host multicluster variables override the policy file on the multicluster primary host; therefore, if SMC_RESUBMIT_ON_ZERO_RESOURCES_TIMEOUT_MINUTES is not specified, the system uses the resubmitOnZeroResourcesTimeoutMinutes value from the policy file.

      For example, for bash, to specify that if a cluster has zero allocation for an application for 5 minutes, the policy should fail active tasks and resubmit to an overflow cluster for bash, set SMC_RESUBMIT_ON_ZERO_RESOURCES_TIMEOUT_MINUTES to 5:
      export SMC_RESUBMIT_ON_ZERO_RESOURCES_TIMEOUT_MINUTES=5
    Note that if both the resubmitOnZeroResourcesTimeoutMinutes (or SMC_RESUBMIT_ON_ZERO_RESOURCES_TIMEOUT_MINUTES) and the workloadRedirectionFailover (or SMC_WORKLOAD_REDIRECTION_FAILOVER) parameters are set, then the system closes the current session and cancels active tasks, and also resubmits the tasks to another cluster. (If workloadRedirectionFailover (or SMC_WORKLOAD_REDIRECTION_FAILOVER) is not enabled, the policy only fails the active tasks (and does not resubmit them to other clusters).
    Tip: To enable scalable clusters for workload overflow, see Overflowing workload to scalable clusters with IBM Spectrum Symphony multicluster.
  3. Optional: To allow migrated sessions to use their original creation times from their original clusters, and maintain a relative spot in the execution queue, you can enable the useInitialSessionCreationTime parameter in your multicluster workload placement policy (or set the SMC_USE_INITIAL_SESSION_CREATION_TIME environment variable on the client host):
    • Modify the workload placement policy file to enable the useInitialSessionCreationTime setting on your multicluster primary host.
      For example, to enable this setting:
      <Application rerankIntervalMinutes="0"
      resubmitOnZeroResourcesTimeoutMinutes="1"
      useInitialSessionCreationTime="enabled"
      topNClusterForTaskRedirection="3"
      topNClusterShareValues="1,1,1"
      workloadRedirection="session"
      workloadRedirectionFailover="enabled"/>
    • Override workload placement file configurations defined the policy file by configuring the SMC_USE_INITIAL_SESSION_CREATION_TIME environment variables on the client host.
      For example, for bash, to enable this setting:
      export SMC_USE_INITIAL_SESSION_CREATION_TIME=enabled

Results

Multicluster session failover is enabled.
Additionally, if you also configured:
resubmitOnZeroResourcesTimeoutMinutes or SMC_RESUBMIT_ON_ZERO_RESOURCES_TIMEOUT_MINUTES
If set, then the policy also checks for resource starvation, and if detected, fails active tasks and resubmits them to an overflow cluster.
useInitialSessionCreationTime or SMC_USE_INITIAL_SESSION_CREATION_TIME
If set, migrated sessions maintain their original creation time from their original cluster, maintaining their relative order in the execution queue.

What to do next

To check whether multicluster session failover is enabled, follow these steps:
  1. From the multicluster management console, select Workload > Application Settings.
  2. Expand Application Settings to reveal the applications and policies table.
  3. Under Bins, select the information icon next to a workload bin to check whether workload redirection applies to sessions, and whether failover for session-level workload redirection is enabled for that bin.