Configuration options for a dynamic requestor which is using the cws requestor plug-in

The polices for dynamic cloud resource provisioning and return are specified in the requestorname_config.json file for a dynamic requestor whose name is requestorname. The file is loaded by the cws requestor plug-in, which then implements the policies specified in the file. This file specifies the policies for calculating and generating requests to provision or return cloud resources.

Creating a configuration file

There are two options for creating a configuration file:

Use a sample configuration file: The first option is to copy the sample cws_config.json configuration file from $EGO_CONFDIR/../../hostfactory/conf/requestors/cws/ to the required directory, rename the configuration file to requestorname_config.json, where requestorname is the name of the dynamic requestor as specified in the name parameter of the respective requestor record configured in the hostRequestors.json file, and then edit the parameters in the file as required, and save the file. The required directory is the directory configured in the confPath parameter of the respective requestor record configured in the hostRequestors.json file.
Automatically generate a default configuration file: The second option is to enable the cws requestor plug-in to automatically generate a default configuration file. Namely, if the requestorname_config.json file does not exist in the desired directory, the cws requestor plug-in is automatically generated the file with default values. The file will be generated in the directory specified in the confPath parameter of the respective requestor record configured in the hostRequestors.json file. You can then modify the contents of the requestorname_config.json file, and save the file. Your changes will to take effect dynamically.

Multiple dynamic requestors

You can configure multiple different dynamic requestors that use the cws requestor plug-in. These dynamic requestors use the same scripts and binaries of the cws requestor plug-in, but have different configurations specified in their respective records in the hostRequestors.json file and in their respective requestorname_config.json files. For example, different dynamic requestors can correspond to different lines of business, each running its dedicated applications, using its dedicated cloud provider accounts, and specifying dedicated policies for scale-out and scale-in requests.

Configuring a dynamic requestor consists of two steps:

Add or update a record for the dynamic requestor in the hostRequestors.json configuration file. The record includes a unique name for the dynamic requestor, directories for configuration, work and logs, and a list of cloud providers to be used, where each cloud provider specifies a cloud account and associated settings. For more details, see Registering a dynamic requestor which uses the cwsinst provider instance.
Create or reuse a requestorname_config.json configuration file, to configure the policies of the dynamic requestor. For details on how to create the file see Creating a configuration file. Edit the file with required configuration parameters and save the file. For details on the configuration parameters in the requestorname_config.json configuration file see Configuration parameters.

Configuration parameters

The requestorname_config.json file contains the following types of parameters:

Admin parameters

These parameters specify connectivity information to the IBM® Spectrum Conductor cluster and logging information.

General calculation parameters

These parameters specify general properties required for the scale-out and scale-in calculations.

Use the SlotsNumberCoresPerSlot and SlotsRamMBPerSlot parameters to specify the number of cores and the amount of RAM for a resource slot.

Use the SlotsPerHostCalcMode parameter to specify the calculation method of the amount of resource slots for a given host.

Scale-out parameters

These parameters specify information to facilitate resource demand calculations that are performed by the cws requestor plug-in.

Use the DemandPolicyType parameter to specify the policy to use for resource demand calculations. You can select between a policy that calculates resource demand based on workload requirements, including required time for completion of workloads (parameter value is 1 for this policy), and a policy that calculates resource demand based on cluster utilization (parameter value is 2 for this policy). The default policy value is 1.

You can further specify parameters that apply to all policies. These parameters include specification of a maximum number of resource slots that can be allocated in a current time unit:

DemandMaxSlotsReqTimeUnitType: Use the DemandMaxSlotsReqTimeUnitType parameter to specify a time unit type (absolute, hour, day, week, or month).
DemandMaxSlotsReqPerTimeUnit: Use the DemandMaxSlotsReqPerTimeUnit parameter to specify the maximum number of resource slots that can be requested within the current time unit.

These two parameters enable you to regulate the resource slots provisioned within a time unit. For example, setting DemandMaxSlotsReqTimeUnitType=2 (hour) and DemandMaxSlotsReqPerTimeUnit=100 specifies that the total amount of allocated, pending, and requested slots within each current hour cannot exceed 100 resource slots. Another example, setting the DemandMaxSlotsReqTimeUnitType=1 (absolute) and DemandMaxSlotsReqPerTimeUnit=100 specifies that the total amount of allocated, pending, and requested slots at any point in time cannot exceed 100 resource slots. Setting the DemandMaxSlotsReqTimeUnitType value to 0 (which is the default), means that this mechanism is disabled.

Remember: While a limit can be set on the number of requested resource slots using these parameters, the number of resource slots actually allocated might be higher by a small quantity than the limit on the requested resource slots. This is because cloud hosts are allocated to satisfy the number of requested resource slots and a last allocated cloud host might include an amount of resource slots that is in excess of the remaining requested amount. Therefore, the excess amount of allocated slots is limited by the number of slots in the last allocated host in the time unit.

Cluster utilization based policy

With this policy, request for additional cloud resources will be generated when the aggregated cluster utilization is not lower than a configured threshold. Use the UtilizationThresholdPercent parameter to specify this utilization threshold percent to which the aggregated cluster utilization percent is compared. When using the cluster utilization based policy, then a value for the UtilizationThresholdPercent parameter is required, and it must be a positive integer between one and 100. If the value of this parameter is outside of this range, then the value is automatically set to a default value. When using another policy, then this parameter does not have any effect and its default value is zero.

You can specify a minimum duration of time, in seconds, for which the aggregated cluster utilization percent must be not lower than the value of the UtilizationThresholdPercent parameter, to satisfy the condition for generating a resource demand request. Namely, the aggregated cluster utilization percent must be not lower than the value of the UtilizationThresholdPercent parameter for at least the specified duration, in seconds, to satisfy the condition for generating a resource demand request. Use the UtilizationCondMinDurationSec parameter to specify this minimum duration of time. The duration of time for this condition is calculated from an initial time when the aggregated cluster utilization percent was not lower than the value of the UtilizationThresholdPercent parameter until the current time. A new initial time is calculated when the aggregated cluster utilization percent is not lower than the value of the UtilizationThresholdPercent parameter after it has been lower than the value of the UtilizationThresholdPercent parameter in the previous observation. If the value of the UtilizationCondMinDurationSec parameter is zero (which is the default), then the cluster utilization duration condition is disabled and not required for resource demand calculation.

You can specify a minimum duration of time in seconds that must elapse from the time when the last resource demand request was issued in order to issue a new resource demand request. Use the UtilizationLastReqMinDurationSec parameter to specify this minimum duration of time in seconds. The duration of time for this condition is calculated from the time when the last resource demand request was issued until the current time. If the value of the UtilizationLastReqMinDurationSec parameter is zero (which is the default), then this condition is disabled and not required for resource demand calculation.

In this policy, if the cluster utilization condition for generating a resource demand request is satisfied according to the above parameters, it is also required that there is workload in the cluster that currently requires additional resources. Therefore, the cws requestor plug-in also scans the workloads in the cluster to determine their resource requirement. If both the cluster utilization condition is satisfied and there is workload in the cluster that currently requires additional resources, then the cws requestor plug-in proceeds to generate a resource demand request.

You can specify a list of resource groups that is used to select hosts whose utilization metrics will be included in the resource demand calculations of the utilization-based policy, and used to select instance groups whose workloads metrics will be included in the resource demand calculations of the utilization-based policy. Use the UtilizationSelectionByResourceGroupNames parameter to specify this list. You can further specify whether your relevant hosts and instance groups are running Apache Spark applications or other types of applications. Set the value of the UtilizationAvoidWorkloadCheck parameter to false when your applications are Apache Spark applications. Set the value of the UtilizationAvoidWorkloadCheck parameter to true when your applications are any other type of applications.

For calculating the number of resource slots to request, the cws requestor plug-in calculates the difference between the maximum number of resource slots that can be requested within the current time unit (specified using the DemandMaxSlotsReqPerTimeUnit and DemandMaxSlotsReqTimeUnitType parameters), and the total amount of resource slots already allocated, pending and requested within the current time unit. This difference, if larger than zero, is the number of resource slots set in the resource demand request.

Workloads' requirements based policy

With this policy, request for additional cloud resources will be generated when there are workloads in the cluster that cannot meet their configured completion time requirement using the resources that are currently available in the cluster.

You can configure the method for selecting workloads for the resource demand calculation based on their processing state, using the WorkloadsSelectionByStateMode parameter in the requestorname_config.json file, which accepts the following values:

1: Workloads that are waiting for processing and workloads that are processing (running) are considered for the resource demand calculation.
2: Workloads that are waiting for processing are considered for the resource demand calculation.
3: Workloads that are waiting for processing more than a specified amount of time are considered for the resource demand calculation.

You can further configure which instance groups are considered in the resource demand calculation by using the WorkloadsSelectionByInstanceGroupNames parameter. This parameter specifies a list of names of instance groupss, such that only workloads of these specified instance groups will be included in the resource demand calculation. If the value of this parameter is empty (which is the default), then this selection criteria is disabled and does not apply a filter on the selected instance groups.

You can also configure which instance groups are considered in the resource demand calculation by using the WorkloadsSelectionByResourceGroupNames parameter. This parameter specifies a list of names of resource groups, which is used to select instance groups whose workloads will be included in the resource demand calculations. If a list of names is configured for this parameter then only instance groups that specify a resource group from the configured list for any of their drivers, executors, or shuffle services will be included in the resource demand calculations. If the value of this parameter is empty (which is the default), then this selection criteria is disabled and does not apply a filter on the selected instance groups.

To calculate resource requirements of workloads within the scale-out calculations, the cws requestor plug-in profiles workload processing behaviors on the granularity of workload classes. A workload class is a group of defined workloads that have similar characteristics and processing behaviors. For a submitted Spark application, the workload class is indicated in the workload's name with the #cls# suffix followed by the workload class name. If no workload class is specified, then the plug-in uses the full workload name as the workload class name. The workload class name can be specified by using the following methods, in order of precedence:

Specify the class name in the Spark application source code, as the SparkSession name. If the workload class name is specified by using this method, it will override the --name option that you specify in the Spark submit command.
Specify the class name by using --name in the Spark submit command.
For example, ./spark-submit --name SparkPi#cls#Project1.

Attention: Consider the following notes and best practices when you submit workload to instance groups that are enabled for cloud bursting with host factory:

Notebook workloads are not included in the resource demand calculations of the workloads' requirements based policy. To also include notebook workloads, use the cluster utilization based policy.
Long-running Spark workload that is not expected to terminate, such as streaming applications, can reduce the accuracy of workload profiling. Either submit these long-running Spark workloads to instance groups that are not enabled for cloud bursting or use the cluster utilization based policy.

To configure requirements for specific workload classes or workloads for the resource demand calculation use the WorkloadsInfoPerWorkload parameter. The value of this parameter is a list of records, where each record specifies information for an individual workload class or workload. The information includes: the name of the workload class or workload, in the WorkloadName parameter; the minimum duration of time in seconds that a workload must be in a waiting state before the workload is considered for the resource demand calculation, in the WaitDurationLimitSec parameter; the maximum duration of time in seconds that is required for the workload to complete its processing, in the DurationForSLASecparameter; and the number of resource slots that are required for the workload to meet its Service Level Agreement (SLA), to be used if profiling information is not available for the workload that is being considered, in the SlotsRequiredForSLA parameter.

The following is an example of the WorkloadsInfoPerWorkload parameter configuration:

{
    "WorkloadsInfoPerWorkload":[
      {
        "WorkloadName": "nameA",
        "WaitDurationLimitSec": 0,
        "DurationForSLASec": 3600,
        "SlotsRequiredForSLA": 1
      },
      {
        "WorkloadName": "nameB",
        "WaitDurationLimitSec": 0,
        "DurationForSLASec": 1200,
        "SlotsRequiredForSLA": 1
      }]
}

To specify whether workload-specific information is configured in the WorkloadsInfoPerWorkload parameter, and if configured, specify the method for using this information, use the WorkloadsInfoPerWorkloadMode parameter. A value of 2 for this parameter specifies that workload-specific information is configured and only the workloads specified in the workload-specific information are considered in the resource demand calculation. A value of 1 specifies that workload-specific information is configured but no filtering should be applied using this information, namely all workloads of the instance groups selected for the resource demand calculations are considered. A value of 0 specifies that workload-specific information not configured.

To specify default workload information to be used for workloads for which the WorkloadsInfoPerWorkload parameter does not contain information associated with these workloads, or for workloads that do not have a name, or for workloads that do not expose their name because of their processing state, use the WorkloadsInfoGlobal parameter.

Scale-in parameters

These parameters specify information to facilitate the resource return calculations that are performed by the cws requestor plug-in.

The BillingCycleInfoPerProvider parameter specifies information about the billing cycle for specified cloud providers. The parameter specifies the name of the cloud provider, the duration of the billing cycle, and the start and end time of the return window relative to the billing cycle duration. For cloud providers that have no information specified in the BillingCycleInfoPerProvider parameter, default billing cycle information from the BillingCycleInfoGlobal parameter is used. You can use the BillingCycleAvoidUsage parameter to specify whether billing cycle information should be considered in the calculation of resources to be returned to the cloud providers.

You can specify a time duration, in seconds, after which a cloud host that is in the process of being removed gracefully from the cluster is returned to the cloud provider without waiting further for the removal process to complete, using the ForceReturnAfterDurationSec parameter. This maximum time duration is measured from the time when the process of gracefully removing a cloud host from the cluster is initiated for the cloud host. A value of 0 (which is the default), for this parameter disables this time limit mechanism. A value of 1 indicates immediate return after the first attempt to remove the cloud host from the cluster. A value higher than 1 specifies a time limit.

You can use the HostReturnIdleOnly parameter to specify that only idle hosts should be returned to the cloud providers, by setting this parameter value to true. A cloud host is defined as idle if the host currently does not run any applications. Alternatively, you can use the HostReturnUtilizationLimitPercent parameter to configure a maximum percentage utilization of a cloud host that is used to determine whether a cloud host is considered for return. If the current utilization of the cloud host does not exceed this limit, then the cloud host is considered for return to its cloud provider. If the current utilization of the cloud host exceeds this limit, then the cloud host is not considered for return. This parameter takes effect only when the HostReturnIdleOnly parameter is set to false.

Generally, cloud hosts are considered for return only when the resource demand from the currently running applications is lower than the number of free resource slots in the cloud hosts.

Cloud hosts that are closed by the cws requestor plug-in include the following comment:

Closed by IBM Spectrum Conductor Host Factory requestor plug-in instance_name at yyyy-mm-ddThh:mm:ssTZ

where:

yyyy is the year
mm is the month of the year
dd is the day of the month
hh is the hour of the day
mm is the minutes of the hour
ss is the seconds of the minute
TZ is the time zone

You can use the egosh resource view command to view this comment for each closed cloud host.

For more information, see requestorname_config.json reference.

Requestor plug-in metadata file

A dynamic requestor that uses the cws requestor plug-in maintains a metadata profiling record for each workload class, where the record aggregates statistics based on samples that are collected for workloads within that workload class. This metadata information is used in the bursting calculations. Specifically, this metadata information is used to calculate the number of resource slots that is required for a specific workload to complete processing while meeting its required time for completion.

You can use the WorkloadsProfilingMinSamples parameter to configure the minimum number of workload profiling samples that is required to consider a workload profiling record as valid for usage in the bursting calculations. For a specific workload, if no profiling samples exist or the number of existing profiling samples for the workload is lower than the configured minimum number of profiling samples, the configured number of resource slots that are required for this workload to meet its completion time is used. This number of resource slots is configured using the SlotsRequiredForSLA parameter in a record that is dedicated to that workload or its corresponding workload class in the WorkloadsInfoPerWorkload parameter in the configuration file. If no such record exists for the workload in the configuration file, then the value in the SlotsRequiredForSLA parameter in the WorkloadsInfoGlobal parameter is used.

The values in the WorkloadsInfoGlobal parameter serve as default values for all workloads for which specific configuration information does not exist. Workloads in the submitted state always use the values in the WorkloadsInfoGlobal parameter as workloads in this state do not expose a name. Similarly, workloads without a name always use the values in the WorkloadsInfoGlobal parameter.

The directory for storing the metadata file for a dynamic requestor is configured by using the workPath parameter in the hostRequestors.json file. You can use different directories for each dynamic requestor or a same directory for multiple dynamic requestors. If a value is not specified for the workPath parameter in the hostRequestors.json file, then the default directory of ${HF_WORKDIR}/requestors/cws/ is used.

Each metadata file follows the naming convention requestorname_md.json, where requestorname is the name of the requestor specified in the name parameter in the hostRequestors.json file.

You can use the WorkloadsProfilingMaxRecords parameter to set the maximum number of workload profile records that the cws requestor plug-in stores and maintains in its corresponding metadata file. You can further use the WorkloadsProfilingRemoveNumberAtMaxRecords parameter to set the number of workload profile records to remove when the total number of workload profile records reach the value of the WorkloadsProfilingRemoveNumberAtMaxRecords parameter. The records that are removed are those with the oldest access time.

Requestor plug-in log file

The log directory for each dynamic requestor that uses the cws requestor plug-in is configured by using the logPath parameter in the hostRequestors.json file. You can use different directories for each dynamic requestor or a same directory for multiple dynamic requestors. If a value is not specified for the logPath parameter in the hostRequestors.json file, then the default directory of ${HF_LOGDIR}/requestors/cws/ is used.

Each log file follows the naming convention requestorname_log_number, where requestorname is the name of the requestor specified in the name parameter of the the hostRequestors.json file, and number facilitates the log files rotation. A value of 1 indicates the most recent log file.

The logging level for the dynamic requestor uses the LoggingLevel parameter in the requestorname_config.json file. Valid values for this parameter are identical to the EGO log levels (refer to table 2: Table 2 ).