Using standby services to reduce service start times

Standby services minimize the need to restart services at the time resources are allocated to an application by allowing these services to run idle when there is no workload. You can configure a default standby service, and additionally, non-default (most currently running) standby services. Standby services is at the consumer (not cluster) level, and is supported on slot-based scheduling and multidimensional scheduling.

The standby service feature is not supported by IBM® Spectrum Symphony Developer Edition or MapReduce workload. In this case, standby service configuration is ignored by IBM Spectrum Symphony.

About standby services

To maximize the utilization of resources, by default, IBM Spectrum Symphony releases resources as soon as there are no running or pending tasks. Each time the resources are released, service instances on these resources are terminated. When more tasks are received, new resources are allocated and the services start again. Sometimes starting up a service takes much longer than the actual run time of the workload; for time-critical workload, this may not be acceptable.

Standby services minimize the need to restart services when resources are allocated to an application by allowing these services to keep running. Standby services also allow other consumers to use these resources when there is no workload for the running service. Standby services do not occupy slots, and thereby, allow EGO to allocate these resources to other applications. Once the service instance is associated with a slot and is used to run tasks, it is no longer considered a standby service.

Standby services are only available after the application's services have processed some workload and those services have gone idle. Once a service is put into standby mode, it remains running until the application is unregistered or disabled.

Note: For multidimensional scheduling, if the standby service resource usage exceeds the active resource usage, IBM Spectrum Symphony will not create the service as a standby service, and will log a warning message in the SSM log file.

Standby services versus preloaded services

IBM Spectrum Symphony offers two ways to handle services with long starts times: standby services and preloaded services. Preloaded services are started before workload is submitted by the client and require the resource to remain allocated to the application. Consequently, the resource cannot be shared with other consumers. A resource with standby services running, on the other hand, is only allocated to an application when there is workload to be processed. Once workload is finished, the resource is released (with the service running in standby mode), and made available to other consumers.

Standby services can be combined with preloaded services. In this case, an application can be configured to have a number of slots for preloaded services in addition to standby services. When the application receives workload, if the number of pre-loaded services cannot satisfy the demand, the SSM requests new resources from EGO and uses the standby services to supplement the requested resources.

When to use standby services

Standby services are recommended for environments where the tasks are sent intermittently and the service startup time is relatively long in comparison to the running time of the tasks. By using this feature, users can reduce the impact of starting services on the overall task turnaround time.

Here are additional considerations when deciding if standby services are the best choice:

Since standby services are kept running and they occupy host resources such as memory, they are not recommended for services with nominal startup times or services that are memory-intensive, which makes the host unusable by other applications. In this case, it is recommended to use preloaded services to retain the resources and keep the services running.
Does this application's resource plan entitle it to own (ownership) or deserve (share ratio) a number of slots? If the answer is no, then it is not suitable for standby services.
Do other applications need the selective reclaim feature? If the answer is yes, do not use standby services as standby service configuration will not allow selective reclaim to take effect.

Default services in standby mode

For applications with multiple services and maxOtherInstances set to a value greater than 0 in the Service section of the profile, the service instance manager can keep multiple services running with this service concurrently. The number of services that can be kept running is the value specified for maxOtherInstance, plus one. If the default service is running on a resource unit (one slot if service-to-slot ratio=1) at the time it is evaluated to be put into standby mode, the default service along with all of the other service instances residing on that resource unit will be put into standby mode. If the default service is not running on a resource unit at the time it is evaluated to be put into standby mode, the default service instance will be started. (maxOtherInstances is enforced at this time and the most idle service instance may be terminated to comply with the value of maxOtherInstances). Then the default service instance plus all other remaining service instances on that resource unit are put into standby mode.

To configure a default service as the standby service, use the cluster management console to edit your application profile to set Consumer as the standby service scope.

Non-default (most currently running) services in standby mode

Instead of the default service, you can configure the last service used to run workload as the standby service. For applications with multiple services and maxOtherInstances set to a value greater than 0, all service instances residing on a resource unit are put into standby mode when that resource unit is released back to EGO, regardless of whether or not the default service is running on that resource unit. These standby services are beneficial to the next burst of workload if the next burst of workload uses the same service types as the services that were most recently running.

To configure a non-default service as the standby service, you must first have the standbyServiceScope attribute set to Consumer in the Consumer section of the application profile, and then additionally enable enableStandbyCurrentService attribute (note that this can only be done manually in the application profile; there is no cluster management console support for enableStandbyCurrentService).

Notes:

This feature requires you to set the standbyServiceScope attribute to Consumer; it does not work if standbyServiceScope set to Cluster.
For customers before IBM Spectrum Symphony 7.3: standbyServiceScope set to Consumer behaves the same as setting enableStandbyService in the Consumer section to enable the standby service feature in previous releases; standbyServiceScope set to Cluster behaves the same as setting enableGlobalStandbyServices to enable the global standby services feature in previous releases.

When non-default standby services are enabled, different services on different hosts are put in standby mode. The standby hosts whose standby services are preferable are selected to serve for the sessions in the SSM. This selection is done by using a preferred host list. The SSM collects the preferred standby hosts based on the service types of the open sessions in them and puts these hosts in the preferred host list. The SSM then sends the preferred host list to VEMKD. When VEMKD assigns slots on standby hosts, it assigns the SSM slots on hosts in the preferred host list before other standby hosts.

There are two cases in which the preferred host list is not used:

The resReq attribute is specified in the Consumer section of the application profile and contains an order clause. The order clause is enforced for preferred resource selection, and the preferred standby host list is ignored.
The Balanced Slot Allocation Policy is enabled in the Resource Plan. The number of free slots on a host govern preferred resource selection, and the preferred standby host list is ignored. Use either the Stacked or Exclusive slot allocation policy with this feature.

System behavior when applications are configured with standby services

This section describes IBM Spectrum Symphony behavior during the lifecycle of a standby service.

The lifecycle begins when an application is registered and enabled.
The Session Director reads the application profile and starts the Session Manager (SSM) for the application.
When the SSM receives workload for the application, the SSM requests resources from EGO.
For each resource received, the SSM sends the service information to the SIM and the SIM starts the service instance.
When the workload is finished, the SSM releases the slot to EGO but keeps the service running.
For applications with multiple services, and maxOtherInstances is set to a value of 0 in the Service section of the application profile, the default service is always kept running (even if another service ran before the default service). When a non-default service is complete, the non-default service is terminated and the default service will be automatically started as the standby service.

For applications with multiple services and maxOtherInstances is set to a value larger than 0 in the Service section of the application profile, the service instance manager keeps the default service plus the number of services specified for maxOtherInstanace plus one, as standby services.
EGO deallocates the slot but keeps the standby service running.
If the SSM receives new workload, EGO allocates the resources that have the standby services running.
When the SSM receives the resource allocations from EGO, it associates the resource with the SIM already running on the resource, rather than start a new SIM. The activities are reassociated with the allocations.
The standby services are shut down when the application is disabled or unregistered.

System behavior when applications with standby services share resources

When IBM Spectrum Symphony is optimized for standby services, EGO allocates resources to applications by first searching for resources with standby services running. Therefore, it is recommended that the consumer own all the resources in a resource group dedicated to standby services, as it guarantees that the resources with standby services running will be available to the consumer when they are needed. If the consumer has unsatisfied demand and it previously lent out resources with standby services to another consumer, EGO will reclaim them before allocating an idle resource that does not have standby services running on it. In this case, workload will be pending until the resource is reclaimed. However, if no standby service is running on the resources lent out, EGO will allocate an idle resource before reclaiming the resources that were lent out.

Note that if the reclamation period is longer than the service startup time, the optimization for standby services will not be beneficial. In this case, it would be better to disable the standby service optimization so that EGO allocates any idle resource to the application.

To enable reclaim optimization for standby services, select the Optimized for standby service setting using the cluster management console.

Failure recovery

In the event of an IBM Spectrum Symphony component failure, the system recovers standby services in the following ways:

EGO failure: After recovery, all the information related to standby services is restored by EGO. All the allocations and activities will be recovered including the activities without slots allocated.
SIM failure: If the SSM detects a standby SIM failure, the number of slots with standby services in the system decreases by the number of slots affected by the failure. There is no request to restart the standby SIM immediately. When workload is submitted, the SSM requests the necessary resources and then consumes the ones with standby services first. If the resources demanded by the workload consume all the existing standby services within the system, the SSM requests EGO to start new SIMs, which start new services. After the workload completes, the SSM returns the resources and keeps the SIMs and their service instances in standby mode.
SSM failure: If an SSM fails over, all standby services will be terminated. New standby services can be generated once workload comes in and the services are started.
Standby service failure: Since the SIM does not monitor the Service Instance while it is idle, if the standby service goes out of service, the SIM will not know it. When the SIM is assigned to a session that wants to use the standby service, the SIM must start a new service before it can submit tasks to it.

Best practices for configuring standby services

You should follow the best practices outlined here to ensure that resources with standby services running are available when required by the application.

The application's consumer should own all the resources in the resource group.
Enable lending and disable borrowing in the consumer’s resource plan. Borrowing would not guarantee the availability of a resource with the standby service running when EGO allocates the resource to the application.