Monitoring service for plug-ins

Plug-ins provide monitoring operations that collect and display deployment metrics for resource use and performance at the virtual machine, middleware, and application level.

If you are developing your own plug-ins for Cloud Pak System Software for x86, you can configure and register collectors for plug-in specific metrics at runtime and apply metadata to define the presentation of the monitoring metrics in the Instance Console deployment panel.

Draft comment:
Need to add diagram

Collector

Cloud Pak System Software for x86 monitoring provides specific and built-in collectors, and generic and typed collectors. These collectors are based on an open, loosely coupled, and collector-oriented framework. All collectors are implemented from the interface com.ibm.maestro.monitor.ICollectorService, which includes the following methods:

// Creates the collector based on the given configuration values.
// @param config 
// @return uniqueId for this collector instance or null if the collector could not be created
String create(JSONObject config); 

// Returns the metadata for the available metrics from the collector.
// @param uniqueId of the collector instance to query
// @return {"categories":    [{"categoryType":"<ROLE_CATEGORY>"}],
updateInterval":  "<secs>":,"<ROLE_CATEGORY>": {<see
IMetricService.getServerMetadat{}>}}
JSONObject getMetadata(String uniqueId); 		 

// @param uniqueId of the collector instance to query 
// @param metricType not used in this release and defaults to "all" 
// @return {"<ROLE_CATEGORY>":[{"<METRIC_NAME>":"<METRIC_VALUE>"}, ...},
JSONObject getMetrics(String uniqueId);  

// @param uniqueId of the collector instance to shutdown void delete(String uniqueId);  
void delete(String uniqueId);

Cloud Pak System Software for x86 monitoring has the following types of collectors:

Table 1. Monitoring collector types.
Name	Type	Usage
`com.ibm.maestro.monitor.collector.script`	Script	Collector for plug-ins that can supply metrics with shell scripts.
`com.ibm.maestro.monitor.collector.http`	HTTP	Collector for plug-ins that can supply metrics by HTTP request.

Monitoring also implements several collectors for itself that collect operating system metrics from the Monitoring Agent for IBM® Cloud Pak System Software and the hypervisor relevant to processor, memory, disk, and networking in virtual machines. These collectors are provided in all deployments and can be used by other components or plug-ins without needing to register a separate collector. For information about metrics that are collected by the agent, see Metrics collected by Monitoring Agent for IBM Cloud Pak System Software.

Registration

To use Cloud Pak System Software for x86 monitoring collectors, you must register the collectors with the plug-in configuration, providing the node, role, metrics, and collector facilities information.

Cloud Pak System Software for x86 provides a Python interface to register the collectors. The definition of the interface is as follows:

maestro.monitorAgent.register('{
  "version"  : Number,
  "node"     : String,
  "role"     : String,
  "collector": String,
  "config"   : JSONObject
}')

The single parameter, maestro.monitorAgent.register is a JSON object, which defines the following attributes:

version: The version number
node: The name of the server that is running the collector.
role: The name of the role for which the collector works.
collector: The collector type.
config: The required and optional properties of instantiating the specified collector type for the specific node and role. Each type of collector has its own configuration properties.

Use maestro.monitorAgent.unregister(String) to unregister a collector. The parameter is a registration string.

To check whether a collector is registered, use maestro.monitorAgent.isRegistered(String). The parameter is the registration string for the collector that you want to check. The interface returns True to indicate that the collector with the specified registration exists or False to indicate that the collector with the specified registration does not exist.

Collector config.properties

Table 2. Monitoring collector types.
Collector	config.properties
`com.ibm.maestro.monitor.collector.script`	`{ "metafile" :"<meta-data file>", "executable":"<executable script>", "arguments" :"<script arguments>", "validRC" :"<valid return>", "workdir" :"<work dir >", "timeout" :"<time out duration>" }`
`com.ibm.maestro.monitor.collector.http`	`{ "metafile" :"<meta-data file>", "url" :"<URL>", "query" :"<query arguments>", "timeout" :"<time out duration>", "retry_delay":"<delay time to next retry>" "retry_times":"<retry_times>" "datahandler":"<utility jar properties >" }`

com.ibm.maestro.monitor.collector.script

{
 "metafile"  :"<meta-data file>",
 "executable":"<executable script>",
 "arguments" :"<script arguments>",
 "validRC"   :"<valid return>",
 "workdir"   :"<work dir >",
 "timeout"   :"<time out duration>"
}

com.ibm.maestro.monitor.collector.http

{
 "metafile"   :"<meta-data file>",
 "url"        :"<URL>",
 "query"      :"<query arguments>",
 "timeout"    :"<time out duration>",
 "retry_delay":"<delay time to next retry>"
 "retry_times":"<retry_times>"
 "datahandler":"<utility jar properties >"
}

The following code example illustrates the registering script that is used in the script collector:

maestro.monitorAgent.register('{
	"node":"${maestro.node}",
	"role":"${maestro.role}",
	"collector":"com.ibm.maestro.monitor.collector.script",
	"config":{<config properties>}')

The registering scripts are typically put into appropriate scripts or directories of the plug-in lifecycle to ensure that the plug-in is ready to collect metrics. For example, for the WebSphere® Application Server collector, the registering script is placed under the installApp_post_handlers directory where all scripts are started after WebSphere Application Server is running.

Registration with a different type of collector must provide a corresponding configuration. The values for the config.properties file are as follows:

Table 3. Configuration for the script collector
Property	Required?	Value
metafile	Yes	The full path string of the metadata file that contains the JSON object.
executable	Yes	The full path string of a shell script that provides plug-in metrics output.
arguments	No	Arguments for the script. The value can be a single string with arguments that are separated by spaces, or it can be an array of strings. Provide the arguments as an array of strings if an individual argument contains a space.
validRC	No	A code string for a valid HTTP return. The default value is 0. The value can be an integer or a string that converts into an integer.
workdir	No	The full path of the working directory for the script. The default value is java.io-tmpdir.
timeout	No	The amount of time to wait for the script to run, in seconds. The default value is 5. The value can be a number or a string that converts into a number.

Tip: To obtain the full path of the metadata file or script, the registration script can prepare the config.properties by referring to maestro variables, which keep the path and directory information of plug-in installation.

Table 4. Configuration for the HTTP collector
Property	Required?	Value
metafile	Yes	The full path string of the metadata file that contains the JSON object.
url	Yes	The URL string of the requesting plug-in metrics.
query	No	The arguments string of the query in the HTTP request.
validRC	No	A code string for a valid HTTP return. The default value is 200. The value can be an integer or a string that converts into an integer.
timeout	No	The amount of time to wait for the script to run, in seconds. The default value is 5. The value can be a number or a string that converts into a number.
retry_delay	No	The time interval in seconds that occurs between calling failure and the next attempt. The value can be a number or a string that converts into a number.
retry_times	No	The total number of attempt times before entering a delay period. The value can be an integer or a string that converts into an integer.
datahandler	No	JSON object, including properties of the utility JAR package, or transforming HTTP response to metrics.

For an example of each collector type, see Monitoring collector examples.

Metadata file

The metadata file is referred to in collector registering.

The plug-in provides a JSON formatted file that includes collector metadata parameters, metric category types that it wants to expose and metadata that describes each exposed metric. The content of the metadata file contains information:

The metadata file version.
The array of category names to register (1..n).
The interval time, in seconds, to poll for updated data.
Configuration parameters that are unique for each category (example: mbeanQuery).
The list of metric metadata objects:

attributeName

Specifies an attribute from the collector to associate to this metric.

metricName

Specifies a metric name to expose through the monitoring agent APIs.

metricType

Specifies the data type, like range, counter, time, average, percent, and string. The metricType is not yet checked. Any metricType in the list was accepted. You can choose the metricType that best matches your data from the list.

description

(optional) Specifies the string that defines the metric.

The format of the metadata file follows here:

{
"Version" : <metadata file version>,
"update_interval": <interval  time in seconds to poll for updated data>,
"Category": [
            <array of category names to register (1..n)>
            ],
"Metadata": [
            {
            "<category name from Category[]>":{
                "metrics": [
                  { 
                   "attribute_name": <attribute from collector to associate to this metric>
                   "metricsName": <metric name to expose through monitoring agent APIs>
                   "metricType": <metric value data type including "RANGE",
                                                                          "COUNTER",
                                                                          "PERCENT",
                                                                          "STRING", 
                                                                          "AVAILABILITY",
                                                                          STATUS>
									} ,
			           ... ...
                    ]
                  }
                },
              ... ...
             ]
}

Metric format

The data entered by plug-ins into a collector must follow a specific format so that it can be parsed and transferred by the monitoring agent into metrics. For example, a plug-in that uses the script collector must ensure that the script output is formatted, and a plug-in in the HTTP collector must ensure that the HTTP response or the data handler output is formatted. The metric format is in JSON:

{
"version": <version number>,
    "category": [
        <category name>,
	   ... ...
        
    ],
    "content": [
        {
            <category name> :{
                <metric name>: <metric value>,
                   ... ...
            }
        },
       ... ...
    ]
}

The following example shows metrics that are formatted correctly for the collector:

{
"version": 2,
    "category": [
        "WAS_JVMRuntime",
        "WAS_TransactionManager",
        "WAS_JDBCConnectionPools",
        "WAS_WebApplications"
    ],
    "content": [
        {
            "WAS_JVMRuntime": {
                "jvm_heap_used": 86.28658,
                "used_memory": 176576,
                "heap_size": 204639
            }
        },
        {
            "WAS_TransactionManager": {
                "rolledback_count": 0,
                "active_count": 0,
                "committed_count": 0
            }
        },
        {
            "WAS_JDBCConnectionPools": {
                "max_percent_used": 0,
                "min_percent_used": 0,
                "percent_used": 0,
                "wait_time": 0,
                "min_wait_time": 0,
                "max_wait_time": 0
            }
        },
        {
            "WAS_WebApplications": {
                "max_service_time": 210662,
                "min_service_time": 0,
                "service_time": 8924,
                "request_count": 30
            }
        }
    ]
}

Error handling

There are two types of errors that can occur while collectors call scripts for a script collector or a data handler for the HTTP collector to get formatted metrics:

At the script or data handler level, when errors occur in scripts and data handlers.
At the collector level, when errors are raised when the collector starts a script or invokes a data handler.

A collector handles collector-level errors directly, but it can handle only errors at the script or data handler level when the errors are returned by the scripts or data handlers. For the script collector, scripts should communicate with plug-ins, gather the metrics and output formatted metrics. Scripts can handle errors as either expected or unexpected while communicating and formatting, and then expose the errors to the collector. For the HTTP collector, data handlers transform data in an HTTP response from plug-ins into formatted metrics. Data handlers can handle transformation errors and then expose them to the collector. To communicate errors to the collectors, scripts or data handlers must wrap them.

{
"FFDC": <error message >
}

When a collector gets an FFDC object from scripts or data handlers, it logs them in log and trace files for troubleshooting. It also propagates them to the monitoring agent, which then clears corresponding records from the monitoring cache so that monitoring API does not return old metrics any longer. As a result, the user interface does not display these error messages for plug-ins in which FFDC objects are being collected.

For collector-level errors that are raised when a collector runs scripts, sends HTTP requests, or invokes data transformers, the collector wraps errors in FFDC objects and then logs and propagates them the same way as FFDC objects from scripts and data handlers.

Script collector error messages include:

The collector has trouble in calling scripts registered by plug-in for outputting metrics, for example, script files are missing
The collector gets error return code (RC) when executing script files
The collector gets nothing and empty string from return of executing script files
The collector fails to parse metrics from return of executing script files due to unexpected ending or incorrect JSON format
The collector gets error message from return of executing script files
The collector executing script files is timeout

HTTP collector error messages include:

The collector waiting for  HTTP response is timeout
The collector gets nothing from HTTP response
The collector get error status code from HTTP response, such as 4xx for client error, 5xx for server error
The collector gets error from user transformer instead of metrics
The collector fails to parse metrics from user transformer  due to incorrect JSON format

The HTTP collector might catch errors before it invokes the data handler and as a result, no data is available to the data handler for further processing. In this situation, the collector passes in a null object, which makes the data handler aware that there is no data input. The data handler can determine how to generate the final data for the collector. For example, data handlers can either create FFDC objects with a message such as "No data collected" for null object input, or create plug-ins metrics with their own specific values for null, such as "UNKNOWN" for availability.

User interface presentation

Plug-in metrics are displayed on the Middleware Monitoring tab of the Instance Console. Plug-ins provide metadata to describe the metric and category for displaying the metrics, and define the format for displaying metrics.

The monitoring_ui.json file is located under the plugin directory of a plug-in project, for example, plugin.com.ibm.was/plugin/monitoring_ui.json. Other JSON files are also in this directory, including config.json and config.meta.json.

Note: Middleware Monitoring does not display for invisible roles. A role is invisible if dashboard.visible is set to false for a role in the topology model. By default the value is set to true.

 "role" : {
        "name"  : "$roleName",
        "type"  : "$roleName",
        "dashboard.visible" : false,

The metadata is defined in monitoring_ui.json. Two versions of this file are supported:

Figure 1. monitoring_ui.json for a single role

[
{
        "version": 1,
        "category": < category name from Category[] defined in 
                      metric metadata>,
        "label": <the content showed on chart for the category>,
        "displays": [
            {
                "label": <string showed on chart element for the 
                            metric>,
                "monitorType": <time and type properties of metric to 
                                 Display>,
                "chartType": <chart type for displaying the metric>,
                "metrics": [
                    {
                        "attributeName": <metric name defined in 
                                            metric medata>,
                        "label": <string showed on chart element 
                                    for the metric>,
                    } 
                ] 
            } 
        ] 
},
... ...
]

It is assumed that monitoring_ui.json serves a single role that has the same name as the plug-in. It should be used with plug-ins that contain only a single role and no cross referencing to other plug-ins.

To support multiple roles within a plug-in, Version 2 has an extra array-type attribute displayRoles, which can associate one metric category with one or more roles.

Figure 2. Version 2 of monitoring_ui.json for multiple roles

[
{
        "version": 2,
        "displayRoles": [<role name>, ...]
        "category": < category name from Category[] defined in 
                      metric metadata>,
        "label": <the content showed on chart for the category>,
        "displays": [
            {
                "label": <string showed on chart element for the 
                            metric>,
                "monitorType": <time and type properties of metric to 
                                 Display>,
                "chartType": <chart type for displaying the metric>,
                "metrics": [
                    {
                        "attributeName": <metric name defined in 
                                            metric medata>,
                        "label": <string showed on chart element 
                                    for the metric>,
                    } 
                ] 
            } 
        ] 
},
... ...
]

For both versions of the monitoring_ui.json file, displays define attributes for the appearance of the metric in the user interface. All metrics in one category are displayed the same way and share one chart. The attribute monitorType and chartType should be used together to define what the metrics look like. For example, if monitorType is set to HistoricalNumber and chartType is set to Lines for a category of metrics at the same time, the metric is displayed as a line graph with time in the X axis and metric values in the Y axis.

Table 5. Monitor types
Monitor types (`monitorType`)	Description
`HistoricalNumber`	Metric data in simple number for historical timeline
`HistoricalPercentage`	Metric data in percentage for historical timeline
`RealtimeNumber`	Metric data in simple number for current temporality
`RealtimePercentage`	Metric data in percentage for current temporality

Table 6. Chart types
Chart types (`chartType`)	Presentation
`Lines`	Line chart
`StackedAreas`	Stacked line chart (area chart)
`StackedColumns`	Column chart

As an alternative for HistoricalNumber and chartType, you can use chartWidgetName to define appearance. The following example shows the use of chartWidgetName.

[
{
    ... ...
        "category": "DATABASE_DRILLDOWN_HEALTH",
        "label": "Database Health Indicator",
        "displays": [
            {
                "label": " Database Health Indicator ",
                "chartWidgetName": "paas.widgets.HealthStatusTrend",                
                "metrics": [
                    {
                        "attributeName": "data_server_status",
                        "label": " Data_Server_Status " 
                    },
                    {
                        "attributeName": "io",
                        "label": "I/O" 
                    },
                    {
                        "attributeName": "locking",
                        "label": " Locking " 
                    },                                         
                    {
                        "attributeName": "logging",
                        "label": "Logging " 
                    },
                    {
                        "attributeName": "memory",
                        "label": "Memory" 
                    },      
                    {
                        "attributeName": "recovery",
                        "label": "Recovery" 
                    }, 
                    {
                        "attributeName": "sorting",
                        "label": "Sorting" 
                    }, 
                    {
                        "attributeName": "storage",
                        "label": "Storage" 
                    },                                                                            
                    {
                        "attributeName": "workload",
                        "label": "Workload" 
                    } 
                ]
            }                                     
        ] 
    }
]

Metrics with this configuration display as a two column list with metric labels in the first column and a colored square indicator icon to show the status.

Availability

After a plug-in is deployed, its role has a special status that indicates its overall health, called availability. It has the following status values.

NORMAL
WARNING
CRITICAL
UNKNOWN

An icon is associated with each value.

To provide health status for role, a plug-in can bind one of its metrics to the availability so that the monitoring service can show the status and update indicator icons that are based on the current value of the metric. Set metric_type to AVAILABILITY in the plug-in metadata file to make this association.

... ...
"metadata":[
    	{
            "DATABASE_AVAILABILITY":{
                "metrics":[{
                        "attribute_name":"database_availability",
                        "metric_name":"database_availability",
                        "metric_type":"AVAILABILITY"
                    }
                ]
            }
        },    
... ...

The metric associated with availability can belong to any category; but a plug-in can have only one metric for binding. If multiple bindings are defined, only the first one is effective and the rest are ignored. The metric for binding must be a String type and can accept only the supported values at run time: "NORMAL", "WARNING", "CRITICAL", and "UNKNOWN".

For a plug-in that binds metrics to availability, the "UNKNOWN" status should be set when the plug-in cannot retrieve metrics for availability. The status can be archived in collector scripts (if it is using the script collector) or data handlers (if it is using the HTTP collector).

As scripts retrieve metrics by their own mechanism, they can create the metric predefined for availability when they fail to retrieve metrics.
Because data handlers get data from the HTTP collector, they need the collector to tell them when nothing is retrieved from an HTTP response. As an agreement, the HTTP collector passes a null object into data handlers when it fails to get data before it invokes data handlers. Data handlers should create the metric predefined for availability when they receive a null object. For more information, see Troubleshooting monitoring collectors.

If a plug-in does not bind a metric to availability, the monitoring service applies the following algorithm to generate availability for the plug-in role:

When a scaling policy is used for the plug-in, the state of scaling is used to determine the health status. When a threshold is reached, availability is set to "WARNING" until a new role instance is created. If the maximum number of role instances is reached or the created instance fails, availability is set to "CRITICAL".
When a scaling policy is not used, operating system metrics are used for availability. If processor usage is greater than 70%, availability is set to "WARNING" and if usage is greater than 85%", availability is set to "CRITICAL".

Troubleshooting monitoring collectors

Problem: I registered the collector for roles in my plug-in, but the roles are not listed in Middleware Monitoring View.

Resolution: There are several possible causes:

Wrong role name in "register" parameter.
When a plug-in registers a collector for itself, it usually obtains the role name directly from maestro.role['name'] in its scripts. But when a plug-in registers a collector for other plug-ins instead of itself, the wrong role name is returned. For example, when the plug-in OPMDB2 registers a collector for the plug-in DB2, maestro.role['name'] returns the wrong role name because it always returns role name of current plug-in. Ensure that you pass the correct role name into the "register" parameter and ensure that it is the one your collector is working for.
A version 1 monitoring_ui.json file is in the wrong plug-in.
Because a version 1 monitoring_ui.json file lacks role information, the user interface assumes that monitoring_ui.json is serving the plug-in role that it is in. To avoid this issue, ensure that monitoring_ui.json is in the plug-in for which the metrics are collected or use a version 2 monitoring_ui.json file that explicitly defines roles.
Missing role name in a version 2 monitoring_ui.json file.
A version 2 monitoring_ui.json file requires the attribute "displayRoles". If a version 2 monitoring_ui.json file does not contain role information, the user interface cannot find the right monitoring_ui.json for collected metrics. If you are updating a monitoring_ui.json from version 1 to version 2, ensure that it includes the "displayRoles" attribute and include the role that your collector is working with.

Metrics are collected for an invisible role.

If a role is invisible, metrics do not display in the user interface, even if metrics are collected for it. If a role is not displaying, ensure that "dashboard.visible" is not set to false.

There are two kinds of roles for visibility, a visible role and an invisible role. The visible role is displayed on Virtual Application Instances page, and usually displays the core components in deployment such as the WebSphere Application Server and DB2® roles. An invisible role cannot be seen on Virtual Application Instances page. These roles usually function as a back-end and assistant component. Examples include MONITORING, SSH, and OPMDB2. Plug-in developers can specify the visibility of their plug-in roles by setting a Boolean value for the attribute "dashboard.visible" in the topology. The following topology snippet shows an example:

{    
"vm-templates": [        
	... ...
    {
            "name": "database-db2",
            "roles": [
                {   
                    "type": "DB2",
                    "name": "DB2"     
                    ... ...  
                },
                {
                    "global": true,
                    "plugin": "ssh/2.0.0.1",
                    "dashboard.visible": false,
                    "type": "SSH",
                    "name": "SSH"
 
               },
                {
                    "global": true,
                    "plugin": "opmdb2/1.0.0.0",
                    "depends": [
                        {
                            "role": "database-db2.DB2",
                            "type": "DB2"
                        }
                    ],
                    "dashboard.visible": false,
                    "type": "OPMDB2",
                    "name": "OPMDB2"
                },
                   ... ...
               ],
                ... ...
        }
    ],

}

In this example, DB2 is a primary role and is visible. SSH and OPMDB2 are invisible roles because their "dashboard.visible" value is set to false.

Metrics are not collected.
The monitoring service ignores roles without initial metrics even if there is a collector that is successfully registered for the role. Check the log and trace files to verify whether the collector is working properly.

Problem: I can see my roles that are listed in Monitoring View, but I cannot see their metrics. The message CWZMO0040W: No real-time metric data is found for deployment is displayed.

Resolution: The error message displays when the monitoring service cannot find metrics for a certain role anymore. Check the log and trace files to verify that the collector is working properly.

Auto scaling

The elastic scaling, or auto scaling, feature in a plug-in uses monitoring. Auto scaling provides the automatic addition or removal of virtual application and shared services instances that are based on workload.

You can optionally turn on the auto scaling feature by attaching the scaling policy to a target application or shared service. The policy is also used to deliver the scaling requirements to the back-end engine. Requirements include trigger event, trigger time, and instance number, which drive the scaling procedure.

Cloud Pak System Software for x86 supports two types of scaling: horizontal scaling and vertical scaling.

Horizontal Scaling

Horizontal scaling expands or shrinks a deployment by adding nodes into the deployment (a scale out) or removing nodes from the deployment (scale-in).

Vertical Scaling

Vertical scaling increases or reduces node size by adding processor cores, increasing memory size, or by attaching new disks to the nodes (scale-up), or by removing processor cores, decreasing memory size, or by removing disks from the nodes (scale-down).

Scaling policy overview

The auto scaling policy can be attached to two kinds of components in Cloud Pak System Software for x86: a virtual application and a shared service. For the virtual application, you can explicitly add the scaling policy to one or more components of the application in the Pattern Builder. For the shared service, the scaling policy must be described in the application model that is made by the plug-in developer if the service asks for the auto scaling capability.

Plug-ins, either for virtual applications or shared services, define the scaling policy, describe the policy in the application model, and provide transformers to explain and add scaling attributes into the topology document when the policy is deployed with plug-ins. The application build automatically generates the segment of the scaling policy in the application model only if you are using shared services. At run time, the back-end auto scaling engine first loads the scaling attributes and generates the rule set for scaling trigger. Then, the back-end engine computes on the rule set and decides whether the workload reaches a threshold for adding or removing application or shared service instances. The final step of the process is to complete the request.

To apply the auto scaling policy to a plug-in, ensure that the scaling policy is defined in the application model that the plug-in is associated with, which collects user-specific requirement for the scaling capability. Also, ensure that the policy is transformed into the topology document, which guides the back-end engine to inspect the trigger event and take scaling actions.

Scaling elements

A scaling policy specifies criteria for driving automatic behaviors and constraints on scaling actions. It contains three primary elements: trigger event, trigger time, and resource limit.

Trigger event
Scaling actions are triggered based on the changing value of certain metrics. The trigger event specifies the type of monitoring metrics and threshold range for which different scaling actions are triggered.

For each metric in the event definition, there are two thresholds: scale-in threshold and scale-out threshold. For example, the processor use for virtual machines that run WebSphere Application Server instances can be the metric for the trigger event and the thresholds for scale-in and scale-out are 20% and 80%, then when the value of processor use is higher than 80%, a new WebSphere Application Server instance is started. When the processor use is below 20%, an existing WebSphere Application Server instance is selected for removal.
Trigger time
To prevent a spike from occurring, the system uses a time span to monitor a metric value, rather than a value taken in an instant, before it triggers a scaling action. Trigger time specifies the time to hold an inspecting threshold before the scaling actions are taken when the threshold condition is met. For example, at the moment that the processor use is monitored higher than 80%, a timer is started. If trigger time is set to 120 seconds, then when the timer reaches 120 seconds, the scale-out operation is started. If the processor use goes out of the thresholds during the timing, which would be that the processor use drops below 80% in this example, then the timer stops. It restarts when the processor use enters the threshold again.
Resource limit
A resource limit for scaling behaviors is required to prevent a deployment from using all of the system resources. For example, a scaling policy should specify the total number of instances a plug-in can have at one time or at least by scale-out or scale-in. When the cluster size of a plug-in reaches the border of its ranges, no instance is added or removed to or from the cluster, even though the trigger event is met.

The scaling policy includes the horizontal scaling and elements for vertical scaling. There are three types of trigger events: horizontal scaling, vertical scaling of processor, and vertical scaling of memory. ScaleUpCPUThreshold specifies the threshold for which processor count is increased on a node. ScaleUpMemoryThreshold specifies the threshold for which memory size is increased on a node. The resource limit includes both processor and memory. The processor min and max values specify the maximum number of cores that can be added to a node by a scale-up action, and can be decreased to for a node on a scale-down action. The memory min and max values specify the maximum amount of memory that can be allocated to a node, and the minimum amount of memory that a deployment can be reduced to. This version also includes increment and decrement values for a single scaling action. The processor increment and decrement values specify how many cores are added or removed in one scale-up or scale-down action. The memory increment and decrement values specify how much memory is added or removed in one scale-up or scale-down action. In previous versions, these values do not exist because only one node is added or removed in one scale-in or scale-out action by default.

Application model

Auto scaling capability is embodied as a policy in the application model. The application model is used to describe the components, policies, and links in the virtual applications or shared services. For virtual applications, the model can be visually displayed and edited with the Pattern Builder.

Virtual application designers can customize components and policies, including the auto scaling policy, in the Pattern Builder. There is no tool to visualize shared services in the application model. Auto scaling can be customized only in the Instance Console when the service is deployed. The scaling policy that is described in the application model, for either a virtual application or shared service, follows the application model specification. The policy is defined in the node with a group of attributes.

The three auto scaling elements, trigger event, trigger time and instance number range, are described in the attribute set. There is no name convention for the attribute keys, but they must be understood by the plug-in to transfer into a topology document. The following code is an example of the elements that are described in the plug-in:

"model": {
   "nodes": [
  	       {
			   ... ...
             },
             {
             "id": <policy id>
             "type":<policy type>
             "attributes": {
		
      	              <No.1 metric id for trigger event>: [
      	              < threshold for scale-in >,
      	              < threshold for scale-out >
      	               ],
      	              <No.2 metric for trigger event>: [
      	              < threshold for scale-in >,
      	              < threshold for scale-out >
      	               ],
      	              <... :[... ,... ]>
      	              <No.n metric for trigger event>: [
      	              < threshold for scale-in >,
      	              < threshold for scale-out >
      	                ],
      	              <trigger time id>: <trigger time value>
      	              <instance range number id": [
      	                 <min number>,
      	                 <max number>
      	                ],
					}
			},
			{
				... ...
			}
		]
}

The attributes describe the scaling policy in an application model. From the example JSON segment, the Trigger Event can include multiple metrics and thresholds for one scaling policy, which means that the scaling operations on a plug-in can be triggered by different condition entries with different metrics. The relationship among these entries is explicitly explained by plug-in transformer and marked in the topology document. It is not required to mark them in application model, except that their label can be used to define the relationship in the user interface. Cloud Pak System Software for x86 requires the metadata be provided in a plug-in to explain components in the application model for user interface presentation. For scaling policy, the plug-in can apply correct widget types and data types to the attributes for Trigger Event, Trigger Time, and Instance Number Scope.

Topology model

In the topology document, the scaling is extended to contain the attributes from auto scaling. The neat scaling contains only attributes of min and max, both of which typically have the same value. The value indicates the size of a fixed cluster on the plug-in template.

"vm-templates": [
		{
			...
      scaling :{
                 "min": <number>,
                 "max": <number>,
                  }
		},
		{
      ...
		}
]

When more attributes such as triggerEvents and triggerTimeare included in the scaling, it evolves to an auto scaling capacity on the cluster. The value of min and maxshould not be the same any longer: min for lower limit of Instance Number Scope and max for the upper limit. The attributes for auto scaling are shown in the following JSON code example.

"vm-templates": [
{
	...
  scaling :{
     "role" : <role type for the template>,
     "triggerEvents": [
         {
            "metric": <metric category and item linked by ".">,
            "scaleOutThreshold": {
                "value": <metric value with its data type>,
                "type": "CONSTANT",
                "relation": <comparison symbol including "<", 
                                                         ">", 
                                                         "<=",
                                                         ">=" >
                             },
            "conjunction": <conjunction type with other trigger 
                           events including "OR", "AND">
            "scaleInThreshold": {
                "value": <number>,
                "type": "CONSTANT",
                "relation": <comparison symbol>
                         }
        	},
         "triggerTime": <number>
         },
        {
            "metric": " metric category and item",
            "scaleOutThreshold": {
                "value":<number>,
                "type": "CONSTANT",
                "relation": <comparison symbol>
            },
            "conjunction": <conjunction type with other trigger 
                           events>
            "scaleInThreshold": {
                "value": <number>,
                "type": "CONSTANT",
                "relation": <comparison symbol>,
                "electMetricTimeliness": <"historical"|"instant">
            }
            "triggerTime": <number>
        },
	  {
       {
            "metric": " metric category and item",
            "scaleUpCPUThreshold": {
                "value":<number>,
                "type": "CONSTANT",
                "relation": <comparison symbol>
            },
            "conjunction": <conjunction type with other trigger 
                           events>
            "triggerTime": <number>
	  },
	  {
            "metric": " metric category and item",
            "scaleUpMemoryThreshold": {
                "value":<number>,
                "type": "CONSTANT",
                "relation": <comparison symbol>
            },
            "conjunction": <conjunction type with other trigger 
                           events>
            "triggerTime": <number>
		  },
		  {
      ...
        }
    ],
    "min": <number>,
    "max": <number>,
    "maxcpucount": <number>,
    "minmemory": <number>,
    "cpucountUpIncrement": <number>,
    "memoryUpIncrement": <number>,
    "triggerTime": <number>,
   }
   ...
 },
 {
  ...
 }
]

Cloud Pak System Software for x86 supports multiple trigger events for a scaling operation. Those events are currently aggregated in two modes: OR and AND. The OR mode means that the scaling operation is triggered if only one event happens. The AND operation means if all events happen at the same time, only then is the scaling operation triggered. Auto scaling depends on monitoring to collect metrics for inspecting. To ensure that the right metrics are collected, the value of key metric in each trigger event must be consistent with the category and attributeName. These attributes are defined in the plug-in metadata for monitoring collectors. The values can be joined by . into metric. For example, CPU.Used represents the metric with a category of CPU and an attributeName of Used. Monitoring also provides a group of OS level metrics, which can also be selected by plug-in developers and used for auto-scaling. For details, see Metrics collected by Monitoring Agent for IBM Cloud Pak System Software.

Some attributes are specific to a particular scaling type. For example, min and max are used with horizontal scaling. The following table lists attributes and their associated scaling type.

Table 7. Attributes
Type	Key	Description
Horizontal scaling	`min`	The minimum number of virtual machines that a role can have
	`max`	The maximum number of virtual machines that a role can have
	`scaleInThreshold`	Metric and its threshold for scale-in action
	`scaleOutThreshold`	Metric and its threshold for scale-out action
Vertical scaling	`maxcpucount`	The maximum number of cores that a virtual machine can have
	`scaleUpCPUThreshold`	Metric and its threshold for scale-in action
	`cpucountUpIncrement`	Core count to increase by in one scale-up CPU action
	`minmemory`	The minimum memory size for a virtual machine
	`maxmemory`	The maximum memory size for a virtual machine
	`scaleUpMemoryThreshold`	Metric and its threshold for scale-up memory action
	`memoryUpIncrement`	Memory size to increase by in one scale-up memory action

The triggerTime attribute is shared by several scaling types and trigger events. It can either be placed inside a triggerEvent object or out a triggerEvent object. Its placement determines its scope. If triggerTime is placed inside the triggerEvent object, it only applies to its triggerEvent object. If triggerTime is placed outside the triggerEvent object, it applies globally for all triggerEvent objects. If there is a triggerTime attribute both inside and outside a triggerEvent object, the triggerTime that is inside the object takes precedence.

The transformer that is provided by the plug-in must define attributes of the scaling policy in the application model and map them to the named attributes in the topology document. The Trigger Event, Trigger Time, and Instance Number Scope autoscaling elements correspond to triggerEvents, triggerTime, and min and max.

The scale-in action is more complicated than scale-out. For scale-in, there is a selection of a candidate role instance. The selection logic of scale-in follows some predefined rules, which try to make scale-in as reasonable as possible. The logic takes the instance as the scale-in candidate in priority order from Rule 1 to Rule 4.

The terminated or failed virtual machine, except for the master.
The virtual machine that has a status other than RUNNING, such as LAUNCHING, INITIALIZING, STARTING, except for the master.
The virtual machine that the role with the least or greatest value of specified metric among cluster, except for the master.
A random virtual machine, except for the master.

This selection logic ensures that unworkable virtual machines are removed first, and a random one is removed last. For Rule 3, users can define a group of scaling attributes in the topology for customization of, for example, metric type, value feature, and timeliness. For convenience, the customization is reusing some of described attributes for auto-scaling. Here is a scaling template example:

{
    "role" : "WAS",
    "triggerEvents": [
        {
           "metric": "CPU.Used",
            "scaleOutThreshold": {
                "value": 80,
                "type": "CONSTANT",
                "relation": ">="
            },
            "conjunction": "OR",
            "scaleInThreshold": {
                "value": 20,
                "type": "CONSTANT",
                "relation": "<",
                "electMetricTimeliness" : "historical"
            }
        }
    ],
    "min": 1,
    "max": 10,
    "triggerTime": 120
}

The values for "metric", "scaleInThreshold", "relation", and "electMetricTimeliness" are used to guide how to select a WebSphere Application Server instance for scale-in if plug-in provides manual scaling operations. In this example, "metric" specifies that processor utilization is the metric. The "<" for "relation" specifies that the candidate instance for scale-in is the one with lowest processor utilization in the cluster. A value of ">" would indicate the greatest processor utilization instead. For "electMetricTimeliness", the value can be "historical" or "instance". The "historical" value specifies that the scale-in instance is selected on an average of historical value in 5 minutes.

Manual scaling

Manual scaling provides virtual application administrators with a flexible and controllable way to add or remove instances of virtual applications or share services. By using "autoscalingAgent.scale_out" and "autoscalingAgent.scale_in", manual scaling can run in an autoscaling-safe way. Customization of some manual scaling features is supported, typically focusing on scale-in, using the manual scaling policy. When a plug-in exposes manual scaling operations, it transforms the policy to predefined attributes of the topology, which are used by scaling back-end to archive customized features.

In the topology document, manual scaling is uses the same attributes as auto scaling.

"scaling": {
                   "role": "RTEMS"
                   "triggerEvents": [{
                          "metric": "RTEMS.ConnectionNumber ",
                          "scaleOutThreshold": { ... },
                          "conjunction": "OR",
                          "scaleInThreshold": {
                                    "value": 20,
                                    "type": "CONSTANT",
                                    "relation": "<",
                              "electMetricTimeliness" : "instant"
            }

             }
                       ],
                   "triggerTime": 120,
                   "min": 1,
                   "max": 10,
                   "manual": {
                             "scaleInMetric":"RTEMS.ConnectionNumber",
                             "metricType" : "instant",
                             "rule": "minimum"
                             }
                  }

This example shows a deployment with both an auto scaling and a manual scaling policy on Remote Remote Tivoli® Enterprise Monitoring Server (RTEMS). The scale-in is triggered automatically by using "triggerEvents" and "triggerTime" and can also be applied manually by users. For manual scale-in, the RTEMS instance with the lowest value of "connectionNumber" among all instances is selected as the one to destroy every time.

Use the following guidelines when you develop auto scaling and manual scaling policy and operations for plug-ins:

If an auto scaling policy is applied, there should be nothing else that is required in the Pattern Builder to support manual scaling. The plug-in must expose manual scale in and manual scale out operations. The scaling template that is provided by the plug-in must identify which metric to use for manual scale-in if multiple metrics are used. Only the first metric is used by default. Typical plug-ins include WebSphere Application Server and shared services such as caching and monitoring.
A plug-in should provide a combination of auto and manual scaling templates for the user to choose from, but only one can be applied to a vmTemplate.
If you do not want or need to support auto scaling in your plug-in but want to enable manual scaling, you must define the triggerEvent. The scaling template still provides min, max, role, and triggerEvent. However, the triggerEvent contains only the scale-in metric to use along with the min and max attributes (< or > metric) which is for any plug-in or shared service to use, such as the load balancer.

Scaling Interface

The autoscalingAgent utility defines a generic API for Python scripts to interact with the auto scaling agent on the virtual machine. The utility provides several functions.

Manual scale-out request for the specified role and template. The parameter is a JSON-like string. The "vmTemplate" and "roleType" values indicate the specified role and template for applying "scale-out". Calling the function fails if:

The maximum number of instances was reached
Scaling was paused or disabled
The deployment is being updated
There are insufficient resources in the cloud to fulfill the request

maestro.autoscalingAgent.scale_out('{
             "vmTemplate":  String, 
             "roleType":  String
         }')

The following examples show usage of the function:

maestro.autoscalingAgent.scale_out('
 {"vmTemplate":"Web_Application-was", 
 		"roleType":"WAS"}
')

Manual scale-out request for the specified role and template. The parameter is a JSON-like string. The values for "vmTemplate" and "roleType" indicate the specified role and template for applying "scale-in". The "node" is an optional attribute, which indicates the name of node to be removed. If no "node" is in the parameter, scale-in follows the predefined rules and manual scaling policy to select a node for removal. Calling this function fails if:

The minimum number of instances was reached
Scaling was paused or disabled
The deployment is being updated

maestro.autoscalingAgent.scale_in('{
             "vmTemplate":  String, 
             "roleType":  String,
             ["node" : String]
         }')

Pause all scaling tasks that are running for deployment. Calling this function fails if:

No scaling policy is provided in the application model
Scaling was paused or is disabled
The deployment is being updated

maestro.autoscalingAgent.pause_autoscaling()

Resume all scaling tasks that running for the deployment. Calling this function fails if:

No scaling policy is provided in the application model
Scaling is resumed or is disabled
The deployment is being updated

maestro.autoscalingAgent.resume_autoscaling()

Enable the function of autoscaling agent. Calling this function fails if:

No scaling policy is provided in the application model
Scaling is already enabled
The deployment is being updated

maestro.autoscalingAgent.enable_autoscaling()

Disable the function of autoscaling agent. Calling this function fails if:

No scaling policy is provided in the application model
Scaling is already disabled
The deployment is being updated

maestro.autoscalingAgent.disable_autoscaling()

Note: Although there are other ways to launch or destroy virtual machines, such as the kernel services APIs and a virtual machine action, they are not all safe for auto scaling, which means they can conflict or intervene with auto scaling tasks that are running in deployment. For example, auto scaling tasks are suspended when a scaling operation is triggered, and they do not resume until the deployment restores to steady state, meaning that a newly deployed virtual machine is running or a destroyed virtual machine disappears. This approach ensures that auto scaling can correctly use the metrics from running roles and virtual machines, ignoring the roles and virtual machines that are not working. Kernel services APIs and virtual machine actions do not notify auto scaling when they trigger launching or destroying actions. So auto scaling uses data from new virtual machines that are still launching or destroyed virtual machines that are being removed. The additional data can lead to unnecessary new scaling actions. The interfaces of autoscalingAgent.scale_out and autoscalingAgent.scale_in are safe for auto scaling. Using them helps to avoid these kinds of issues and other possible intervening and conflicting problems between auto scaling and manual scaling.