Data-aware scheduling

The data-aware scheduling (data affinity) framework and plug-in allows IBM® Spectrum Symphony to intelligently schedule application tasks and improve performance by taking into account data location when dispatching the tasks. By directing tasks to resources that already contain the required data, application runtimes can be significantly reduced. As well, this feature can help to meet the challenges of latency requirements for real time applications.

Additionally, IBM Spectrum Symphony offers the IBM Spectrum Scale File Placement Optimizer (IBM Spectrum Scale-FPO) data-aware plug-in, which obtains location information for files located in a GPFS-FPO cluster.

Data-aware scheduling is enabled with the IBM Spectrum Symphony Advanced Edition and does not require separate deployment.

Plug-in support

The data-aware scheduling plug-in is supported on all supported operating systems.

The IBM Spectrum Scale-FPO data-aware scheduling plug-in is supported on Linux®.

About data-aware scheduling

Workload schedulers focus on dispatching tasks to compute hosts and transferring data either directly to the compute hosts or delegating data retrieval to the service. The time required for this data transfer from various sources to where the work is being processed can lead to inefficient use of CPU cycles and under-utilization of the resource. With the data-aware scheduling feature, you can specify a preferential association between a task and a service instance or host that already possesses the data required to process the workload. This association is based on the evaluation of a user-defined expression containing data attributes capable of being collected. The evaluation of this expression is carried out against the data attributes of each service instance available to the session. Typically, a data attribute is an identifier for a data set that is already available to a service instance before it processes workload.

The following example illustrates the concept of data-aware scheduling at the task level. The data preference expression has been evaluated and it is determined that a task in the queue prefers to run on a service instance where the service instance already possesses Dataset1. The SSM collects meta data (service attributes) from all the resources available to the session at that moment. Service B with Dataset1 is currently available and since it is the best match for that task according to the specified preference, the task is subsequently dispatched to Service B.

Data-aware scheduling at the task level

Expressing a preference for data

Specifying data attributes in a preference expression

Data attributes take the form of identifiers and can be specified with the +, -, *, and / operators within an expression. Each data attribute resolves to a number. The operators are math operators (+ means add, - means subtract, * means multiply, and / means divide). An example of an expression would be DataSet1 + DataSet2, meaning that it is preferred to send workload to a service instance that possesses DataSet1 plus DataSet2. The operators can also be used to normalize disparate terms such as data and memory or to give weight to specific terms.

For example, in the flow of task-level data aware scheduling, each task can have a preference expression, comprised of resource attributes and operators. Each host or service instance can have a different value for each resource attribute. This way, each task's preference expression can be evaluated against each host. The host or service instance that yields the lowest value preference expression is considered the best host or service instance for that task.

Attribute names have a 32 character limit and can only contain alphanumeric and underscore characters; if you want to use data attributes with names that do not comply with these rules, define aliases. A resource attribute definition can be used to define an alias and to override the default value for an attribute should the session level default or system default be inappropriate.

How IBM Spectrum Symphony handles the result of the expression

The result of each expression is a numeric value that is obtained by applying the operators to the attributes in the expression. If the preferred data is available to a service, it should programmatically publish a value for the attribute. Alternatively, the value of the attribute may be collected from the data-aware scheduling plug-in, if present. Once the result is obtained for each resource being evaluated, it is used to sort the resources in ascending order. This means the resource that evaluates to the lowest value is the most preferred.

When no information is available for a resource attribute that is involved in a resource evaluation, the resolution of the expression still proceeds. In such cases, Symphony attempts to substitute a default value for each attribute that it cannot resolve. The value of the attribute is resolved by the system in the following manner, in the given order:

  1. attempt to find any published or collected value
  2. retrieve the current default for the alias (if defined)
  3. retrieve the current default for the session (if defined)
  4. retrieve the system default (1.0E+300)

How preference affects dispatch order

Once a preference is specified on a task within a session, the best match for that task with every service instance currently assigned to the session is considered during the dispatching of the task. This means that once a service becomes available, it is given the next task with the best match for that service instance at the current moment. It is important to note that the next task to be dispatched may not currently be at the front of the queue within the session, i.e., the order of task dispatch depends on the currently available service instances and the preference associated with the tasks and not on the order that tasks are submitted to the session.

Note that the data-aware scheduling feature does not affect the behavior of tasks with the PriorityTask setting enabled. Those tasks are still dispatched before other tasks.

Factoring the cost of data transfer

When you want to specify a preference for specific data, the expectation is that attributes in the expression would have their default value set to some number representing the cost of moving this data to that resource.

For example, consider the case where a task within a session requires data sets Dataset1, Dataset2 and Dataset3, the preference for the task could be represented as Dataset1 + Dataset2 + Dataset3. Since expressions are evaluated and sorted in ascending order, you would expect the most preferred resource for this task would have the lowest cost to get the data that the task requires to execute.

Since a service instance is typically able to inform the middleware about what resource attributes (that is, data sets) it has access to, a fixed value can be substituted for any missing attributes when evaluating the task preference against this service instance. If you use the same fixed value to represent each missing attribute, the system will assign each missing data set the same cost so service instances that have access to the most data sets would be the most preferred and so on. But in reality, it is not generally true that all missing data sets would have the same cost to acquire, especially in the case where the data sets can vary greatly in size or be located on hosts with disparate network topologies and network access speeds. That is why to gain the most benefit from the data-aware scheduling feature, the default value for a data attribute should be carefully considered when applied to a preference expression to ensure that it is representative in some way of the cost of retrieving the missing data for the service instance. This default value is set through the resource attribute definition API.

Data-aware scheduling plug-in

This direct plug-in provides an additional way for IBM Spectrum Symphony to obtain the location and cost data associated with preferred resources. The plug-in enhances data-aware scheduling by allowing IBM Spectrum Symphony to get this data not only from the service side publishing and subscription mechanism, but also from a direct call to a centralized process that can be customized for integration with third party products.

The plug-in process requests the information from an external application and returns it back to the SSM to be used by IBM Spectrum Symphony as the preferred scheduling location and value for the attribute in the same way as if it were published by a service. Sample code for the plug-in process is provided in an appendix at the end of this document.

To improve performance, this feature also provides caching for the result received from the plug-in so subsequent calls for the same attribute will return cached values.

Here is the sequence the SSM follows to obtain the resource attribute value when data-aware scheduling is enabled:
Data-aware scheduling flow

If the plug-in is configured and the preferred attributes do not have any published values during the current evaluation cycle, the SSM requests locations and values from the plug-in. The following describes the functional flow when the plug-in is used to retrieve resource attribute data for a task.

  1. A task is submitted with one or more preference attributes.
  2. Upon the task's submission, the SSM parses the preferences and puts the task in the pending queue.
  3. The preferences are evaluated and the SSM tries to find values for the corresponding attributes. Since, in this case, the attribute was not published by the service, the expression is not resolved.
  4. The SSM checks if the direct plug-in is configured for the application. The direct plug-in process is started using a startup command defined in the application profile.
  5. The SSM repeatedly calls the plug-in interface with the attribute names to retrieve the corresponding locations and values.
  6. Tasks are dispatched to currently available resources for the session that are the best match according to the preferred attributes.

IBM Spectrum Scale-FPO data-aware plug-in

The IBM Spectrum Scale-FPO data-aware plug-in obtains location information for files located in a IBM Spectrum Scale-FPO cluster. Location information is used to dispatch tasks to achieve data affinity. To use this plug-in, configure SSMResPubPluginCmd within the SOAM > SSM section of application profile. GPFS_FPO_MMLSNSD and GPFS_FPO_MMLSDISK are mandatory values. All others (-d DIRECTORIES, -p POLLING_INTERVAL, -e CACHE_ENABLED, and -e CACHE_ENTRIES) are optional.

When using the IBM Spectrum Scale-FPO data-aware plug-in, also configure host mapping when multi-homed hosts are configured on the hosts. If multi-homed hosts are configured on the host, to ensure that the IBM Spectrum Scale-FPO host name and IBM Spectrum Symphony host names are the same, configure the EGO host file ($EGO_CONFDIR/hosts (Linux) or %EGO_CONFDIR%\hosts (Windows)) with the correct mapping information.

Configuring data-aware scheduling

Enabling data-aware scheduling for tasks

Data-aware scheduling is enabled at the application level with the schedulingAffinity attribute in the Consumer element of the application profile. When the attribute is set to DataAware, the SSM collects data attributes of service instances and hosts and evaluates them against a user-defined preference expression. Note that setting the attribute to DataAware automatically enables resource-aware scheduling. When the attribute is set to None (default), no metadata is collected by the SSM and no preference is applied.

Example:
<Consumer applicationName="SharingDataCPP" ...schedulingAffinity="DataAware" />

The schedulingAffinity attribute can be configured through the cluster management console or by manually editing the application profile.

Configuring the data-aware plug-ins

The data-aware scheduling plug-in is configured using the following attributes in the application profile (SOAM > SSM):
  • SSMResPubPluginCmd: Specifies a command line for starting the plug-in process, one per IBM Spectrum Symphony application, in the application profile.
  • SSMResPubPluginCmdTimeout: Timeout in seconds to detect hanging if the SSM does not receive an expected response from the plug-in. The default is 30 seconds.
  • SSMResPubCacheEnabled: Sets data caching on or off. The default is on.
Here is an example of the attributes configuration in the application profile:
<SSM resReq="" workDir="${EGO_SHARED_TOP}/soam/work"
 startUpTimeout="60" shutDownTimeout="300" SSMResPubPluginCmd="C:\extplugin.exe"
 SSMResPubPluginCmdTimeout="60" SSMResPubCacheEnabled="false"> 
</SSM>
Additionally, to configure the IBM Spectrum Scale-FPO data-aware plug-in, in the application profile:
  1. Configure SSMResPubPluginCmd.

    GPFS_FPO_MMLSNSD and GPFS_FPO_MMLSDISK are mandatory values. All others (-d DIRECTORIES, -p POLLING_INTERVAL, -e CACHE_ENABLED, and -e CACHE_ENTRIES) are optional.

  2. Set enableOptimizedHostQueue to true.
  3. If multi-homed hosts are configured on the hosts, configure the EGO host file ($EGO_CONFDIR/hosts (Linux) or %EGO_CONFDIR%\hosts (Windows)) is with correct mapping information. This ensures that the IBM Spectrum Scale-FPO host name and IBM Spectrum Symphony host names are the same.
For example, here is the configuration in the application profile for SSMResPubPluginCmd:
<SSM SSMResPubPluginCmd="<plugin_path 
GPFS_FPO_MMLSNSD GPFS_FPO_MMLSDISK 
[-d DIRECTORIES] [-p POLLING_INTERVAL] [-c CACHE_ENABLED] [–e CACHE_ENTRIES]> 
"enableOptimizedHostQueue="true"/>
Here is the same configuration, with example values:
<SSM SSMResPubPluginCmd="</…/gpfs_fpo_das_plugin 
/usr/lpp/mmfs/bin/mmlsnsd /usr/lpp/mmfs/bin/mmlsdisk 
-d /gpfs/f1 -p 120 -c true -e 10000>" 
enableOptimizedHostQueue="true"/>
Note: When running third party applications with MapReduce and IBM Spectrum Scale, ensure that the libgpfshadoop.so file is in the LD_LIBRARY_PATH on your system. For example, run:
export LD_LIBRARY_PATH= $HADOOP_HOME/lib/native:$LD_LIBRARY_PATH

Developing clients and services for data-aware applications

Data preferences for sessions and tasks can only be specified through the application's client and service APIs.