Monitoring service for plug-ins
Plug-ins provide monitoring operations that collect and display deployment metrics for resource use and performance at the virtual machine, middleware, and application level.
If you are developing your own plug-ins for Cloud Pak System Software for x86, you can configure and register collectors for plug-in specific metrics at runtime and apply metadata to define the presentation of the monitoring metrics in the Instance Console deployment panel.
Collector
com.ibm.maestro.monitor.ICollectorService,
which includes the following methods: // Creates the collector based on the given configuration values.
// @param config
// @return uniqueId for this collector instance or null if the collector could not be created
String create(JSONObject config);
// Returns the metadata for the available metrics from the collector.
// @param uniqueId of the collector instance to query
// @return {"categories": [{"categoryType":"<ROLE_CATEGORY>"}],
updateInterval": "<secs>":,"<ROLE_CATEGORY>": {<see
IMetricService.getServerMetadat{}>}}
JSONObject getMetadata(String uniqueId);
// @param uniqueId of the collector instance to query
// @param metricType not used in this release and defaults to "all"
// @return {"<ROLE_CATEGORY>":[{"<METRIC_NAME>":"<METRIC_VALUE>"}, ...},
JSONObject getMetrics(String uniqueId);
// @param uniqueId of the collector instance to shutdown void delete(String uniqueId);
void delete(String uniqueId); | Name | Type | Usage |
|---|---|---|
com.ibm.maestro.monitor.collector.script |
Script | Collector for plug-ins that can supply metrics with shell scripts. |
com.ibm.maestro.monitor.collector.http |
HTTP | Collector for plug-ins that can supply metrics by HTTP request. |
Monitoring also implements several collectors for itself that collect operating system metrics from the Monitoring Agent for IBM® Cloud Pak System Software and the hypervisor relevant to processor, memory, disk, and networking in virtual machines. These collectors are provided in all deployments and can be used by other components or plug-ins without needing to register a separate collector. For information about metrics that are collected by the agent, see Metrics collected by Monitoring Agent for IBM Cloud Pak System Software.
Registration
To use Cloud Pak System Software for x86 monitoring collectors, you must register the collectors with the plug-in configuration, providing the node, role, metrics, and collector facilities information.
maestro.monitorAgent.register('{
"version" : Number,
"node" : String,
"role" : String,
"collector": String,
"config" : JSONObject
}')
maestro.monitorAgent.register is
a JSON object, which defines the following attributes:- version
- The version number
- node
- The name of the server that is running the collector.
- role
- The name of the role for which the collector works.
- collector
- The collector type.
- config
- The required and optional properties of instantiating the specified collector type for the specific node and role. Each type of collector has its own configuration properties.
Use maestro.monitorAgent.unregister(String) to
unregister a collector. The parameter is a registration string.
To check whether a collector
is registered, use maestro.monitorAgent.isRegistered(String).
The parameter is the registration string for the collector that you
want to check. The interface returns True to indicate
that the collector with the specified registration exists or False
to indicate that the collector with the specified registration does
not exist.
| Collector | config.properties |
|---|---|
com.ibm.maestro.monitor.collector.script |
|
com.ibm.maestro.monitor.collector.http |
|
maestro.monitorAgent.register('{
"node":"${maestro.node}",
"role":"${maestro.role}",
"collector":"com.ibm.maestro.monitor.collector.script",
"config":{<config properties>}')The registering scripts are typically put into appropriate scripts or directories of the plug-in lifecycle to ensure that the plug-in is ready to collect metrics. For example, for the WebSphere® Application Server collector, the registering script is placed under the installApp_post_handlers directory where all scripts are started after WebSphere Application Server is running.
| Property | Required? | Value |
|---|---|---|
| metafile | Yes | The full path string of the metadata file that contains the JSON object. |
| executable | Yes | The full path string of a shell script that provides plug-in metrics output. |
| arguments | No | Arguments for the script. The value can be a single string with arguments that are separated by spaces, or it can be an array of strings. Provide the arguments as an array of strings if an individual argument contains a space. |
| validRC | No | A code string for a valid HTTP return. The default value is 0. The value can be an integer or a string that converts into an integer. |
| workdir | No | The full path of the working directory for the script. The default value is java.io-tmpdir. |
| timeout | No | The amount of time to wait for the script to run, in seconds. The default value is 5. The value can be a number or a string that converts into a number. |
maestro variables, which keep the path
and directory information of plug-in installation.| Property | Required? | Value |
|---|---|---|
| metafile | Yes | The full path string of the metadata file that contains the JSON object. |
| url | Yes | The URL string of the requesting plug-in metrics. |
| query | No | The arguments string of the query in the HTTP request. |
| validRC | No | A code string for a valid HTTP return. The default value is 200. The value can be an integer or a string that converts into an integer. |
| timeout | No | The amount of time to wait for the script to run, in seconds. The default value is 5. The value can be a number or a string that converts into a number. |
| retry_delay | No | The time interval in seconds that occurs between calling failure and the next attempt. The value can be a number or a string that converts into a number. |
| retry_times | No | The total number of attempt times before entering a delay period. The value can be an integer or a string that converts into an integer. |
| datahandler | No | JSON object, including properties of the utility JAR package, or transforming HTTP response to metrics. |
For an example of each collector type, see Monitoring collector examples.
Metadata file
The metadata file is referred to in collector registering.
- The metadata file version.
- The array of category names to register (1..n).
- The interval time, in seconds, to poll for updated data.
- Configuration parameters that are unique for each category (example:
mbeanQuery). - The list of metric metadata objects:
- attributeName
- Specifies an attribute from the collector to associate to this metric.
- metricName
- Specifies a metric name to expose through the monitoring agent APIs.
- metricType
- Specifies the data type, like range, counter, time, average, percent,
and string. The
metricTypeis not yet checked. AnymetricTypein the list was accepted. You can choose themetricTypethat best matches your data from the list.
- description
- (optional) Specifies the string that defines the metric.
{
"Version" : <metadata file version>,
"update_interval": <interval time in seconds to poll for updated data>,
"Category": [
<array of category names to register (1..n)>
],
"Metadata": [
{
"<category name from Category[]>":{
"metrics": [
{
"attribute_name": <attribute from collector to associate to this metric>
"metricsName": <metric name to expose through monitoring agent APIs>
"metricType": <metric value data type including "RANGE",
"COUNTER",
"PERCENT",
"STRING",
"AVAILABILITY",
STATUS>
} ,
... ...
]
}
},
... ...
]
}Metric format
{
"version": <version number>,
"category": [
<category name>,
... ...
],
"content": [
{
<category name> :{
<metric name>: <metric value>,
... ...
}
},
... ...
]
}{
"version": 2,
"category": [
"WAS_JVMRuntime",
"WAS_TransactionManager",
"WAS_JDBCConnectionPools",
"WAS_WebApplications"
],
"content": [
{
"WAS_JVMRuntime": {
"jvm_heap_used": 86.28658,
"used_memory": 176576,
"heap_size": 204639
}
},
{
"WAS_TransactionManager": {
"rolledback_count": 0,
"active_count": 0,
"committed_count": 0
}
},
{
"WAS_JDBCConnectionPools": {
"max_percent_used": 0,
"min_percent_used": 0,
"percent_used": 0,
"wait_time": 0,
"min_wait_time": 0,
"max_wait_time": 0
}
},
{
"WAS_WebApplications": {
"max_service_time": 210662,
"min_service_time": 0,
"service_time": 8924,
"request_count": 30
}
}
]
}Error handling
- At the script or data handler level, when errors occur in scripts and data handlers.
- At the collector level, when errors are raised when the collector starts a script or invokes a data handler.
{
"FFDC": <error message >
}When a collector gets an FFDC object from scripts or data handlers, it logs them in log and trace files for troubleshooting. It also propagates them to the monitoring agent, which then clears corresponding records from the monitoring cache so that monitoring API does not return old metrics any longer. As a result, the user interface does not display these error messages for plug-ins in which FFDC objects are being collected.
For collector-level errors that are raised when a collector runs scripts, sends HTTP requests, or invokes data transformers, the collector wraps errors in FFDC objects and then logs and propagates them the same way as FFDC objects from scripts and data handlers.
The collector has trouble in calling scripts registered by plug-in for outputting metrics, for example, script files are missing
The collector gets error return code (RC) when executing script files
The collector gets nothing and empty string from return of executing script files
The collector fails to parse metrics from return of executing script files due to unexpected ending or incorrect JSON format
The collector gets error message from return of executing script files
The collector executing script files is timeoutThe collector waiting for HTTP response is timeout
The collector gets nothing from HTTP response
The collector get error status code from HTTP response, such as 4xx for client error, 5xx for server error
The collector gets error from user transformer instead of metrics
The collector fails to parse metrics from user transformer due to incorrect JSON formatThe
HTTP collector might catch errors before it invokes the data handler
and as a result, no data is available to the data handler for further
processing. In this situation, the collector passes in a null object,
which makes the data handler aware that there is no data input. The
data handler can determine how to generate the final data for the
collector. For example, data handlers can either create FFDC objects
with a message such as "No data collected" for null object input,
or create plug-ins metrics with their own specific values for null,
such as "UNKNOWN" for availability.User interface presentation
Plug-in metrics are displayed on the Middleware Monitoring tab of the Instance Console. Plug-ins provide metadata to describe the metric and category for displaying the metrics, and define the format for displaying metrics.
The monitoring_ui.json file is located under the plugin directory of a plug-in project, for example, plugin.com.ibm.was/plugin/monitoring_ui.json. Other JSON files are also in this directory, including config.json and config.meta.json.
dashboard.visible is set to false for
a role in the topology model. By default the value is set to true. "role" : {
"name" : "$roleName",
"type" : "$roleName",
"dashboard.visible" : false,monitoring_ui.json. Two versions of
this file are supported:[
{
"version": 1,
"category": < category name from Category[] defined in
metric metadata>,
"label": <the content showed on chart for the category>,
"displays": [
{
"label": <string showed on chart element for the
metric>,
"monitorType": <time and type properties of metric to
Display>,
"chartType": <chart type for displaying the metric>,
"metrics": [
{
"attributeName": <metric name defined in
metric medata>,
"label": <string showed on chart element
for the metric>,
}
]
}
]
},
... ...
]
monitoring_ui.json serves
a single role that has the same name as the plug-in. It should be
used with plug-ins that contain only a single role and no cross referencing
to other plug-ins.displayRoles,
which can associate one metric category with one or more roles.[
{
"version": 2,
"displayRoles": [<role name>, ...]
"category": < category name from Category[] defined in
metric metadata>,
"label": <the content showed on chart for the category>,
"displays": [
{
"label": <string showed on chart element for the
metric>,
"monitorType": <time and type properties of metric to
Display>,
"chartType": <chart type for displaying the metric>,
"metrics": [
{
"attributeName": <metric name defined in
metric medata>,
"label": <string showed on chart element
for the metric>,
}
]
}
]
},
... ...
]
For both versions of the monitoring_ui.json file, displays define
attributes for the appearance of the metric in the user interface.
All metrics in one category are displayed the same way and share one
chart. The attribute monitorType and chartType should
be used together to define what the metrics look like. For example,
if monitorType is set to HistoricalNumber and chartType is
set to Lines for a category of metrics at the same
time, the metric is displayed as a line graph with time in the X axis
and metric values in the Y axis.
Monitor types (monitorType) |
Description |
|---|---|
HistoricalNumber |
Metric data in simple number for historical timeline |
HistoricalPercentage |
Metric data in percentage for historical timeline |
RealtimeNumber |
Metric data in simple number for current temporality |
RealtimePercentage |
Metric data in percentage for current temporality |
Chart types (chartType) |
Presentation |
|---|---|
Lines |
Line chart |
StackedAreas |
Stacked line chart (area chart) |
StackedColumns |
Column chart |
[
{
... ...
"category": "DATABASE_DRILLDOWN_HEALTH",
"label": "Database Health Indicator",
"displays": [
{
"label": " Database Health Indicator ",
"chartWidgetName": "paas.widgets.HealthStatusTrend",
"metrics": [
{
"attributeName": "data_server_status",
"label": " Data_Server_Status "
},
{
"attributeName": "io",
"label": "I/O"
},
{
"attributeName": "locking",
"label": " Locking "
},
{
"attributeName": "logging",
"label": "Logging "
},
{
"attributeName": "memory",
"label": "Memory"
},
{
"attributeName": "recovery",
"label": "Recovery"
},
{
"attributeName": "sorting",
"label": "Sorting"
},
{
"attributeName": "storage",
"label": "Storage"
},
{
"attributeName": "workload",
"label": "Workload"
}
]
}
]
}
]Metrics with this configuration display as a two column
list with metric labels in the first column and a colored square indicator
icon to show the status.Availability
availability.
It has the following status values. - NORMAL
- WARNING
- CRITICAL
- UNKNOWN
availability so
that the monitoring service can show the status and update indicator
icons that are based on the current value of the metric. Set metric_type to AVAILABILITY in
the plug-in metadata file to make this association.... ...
"metadata":[
{
"DATABASE_AVAILABILITY":{
"metrics":[{
"attribute_name":"database_availability",
"metric_name":"database_availability",
"metric_type":"AVAILABILITY"
}
]
}
},
... ...The metric associated with availability can
belong to any category; but a plug-in can have only one metric for
binding. If multiple bindings are defined, only the first one is effective
and the rest are ignored. The metric for binding must be a String
type and can accept only the supported values at run time: "NORMAL",
"WARNING", "CRITICAL", and "UNKNOWN".
availability, the "UNKNOWN" status should
be set when the plug-in cannot retrieve metrics for availability.
The status can be archived in collector scripts (if it is using the
script collector) or data handlers (if it is using the HTTP collector). - As scripts retrieve metrics by their own mechanism, they can create
the metric predefined for
availabilitywhen they fail to retrieve metrics. - Because data handlers get data from the HTTP collector, they need
the collector to tell them when nothing is retrieved from an HTTP
response. As an agreement, the HTTP collector passes a null object
into data handlers when it fails to get data before it invokes data
handlers. Data handlers should create the metric predefined for
availabilitywhen they receive a null object. For more information, see Troubleshooting monitoring collectors.
availability,
the monitoring service applies the following algorithm to generate availability for
the plug-in role:- When a scaling policy is used for the plug-in, the state of scaling
is used to determine the health status. When a threshold is reached,
availabilityis set to "WARNING" until a new role instance is created. If the maximum number of role instances is reached or the created instance fails,availabilityis set to "CRITICAL". - When a scaling policy is not used, operating system metrics are
used for
availability. If processor usage is greater than 70%,availabilityis set to "WARNING" and if usage is greater than 85%",availabilityis set to "CRITICAL".
Troubleshooting monitoring collectors
Problem: I registered the collector for roles in my plug-in, but the roles are not listed in Middleware Monitoring View.
- Wrong role name in "register" parameter.
When a plug-in registers a collector for itself, it usually obtains the role name directly from
maestro.role['name']in its scripts. But when a plug-in registers a collector for other plug-ins instead of itself, the wrong role name is returned. For example, when the plug-inOPMDB2registers a collector for the plug-inDB2,maestro.role['name']returns the wrong role name because it always returns role name of current plug-in. Ensure that you pass the correct role name into the "register" parameter and ensure that it is the one your collector is working for. - A version 1 monitoring_ui.json file is in
the wrong plug-in.
Because a version 1 monitoring_ui.json file lacks role information, the user interface assumes that monitoring_ui.json is serving the plug-in role that it is in. To avoid this issue, ensure that monitoring_ui.json is in the plug-in for which the metrics are collected or use a version 2 monitoring_ui.json file that explicitly defines roles.
- Missing role name in a version 2 monitoring_ui.json file.
A version 2 monitoring_ui.json file requires the attribute "displayRoles". If a version 2 monitoring_ui.json file does not contain role information, the user interface cannot find the right monitoring_ui.json for collected metrics. If you are updating a monitoring_ui.json from version 1 to version 2, ensure that it includes the "displayRoles" attribute and include the role that your collector is working with.
- Metrics are collected for an invisible role.
If a role is invisible, metrics do not display in the user interface, even if metrics are collected for it. If a role is not displaying, ensure that "dashboard.visible" is not set to
false.There are two kinds of roles for visibility, a visible role and an invisible role. The visible role is displayed on Virtual Application Instances page, and usually displays the core components in deployment such as the WebSphere Application Server and DB2® roles. An invisible role cannot be seen on Virtual Application Instances page. These roles usually function as a back-end and assistant component. Examples include MONITORING, SSH, and OPMDB2. Plug-in developers can specify the visibility of their plug-in roles by setting a Boolean value for the attribute "dashboard.visible" in the topology. The following topology snippet shows an example:
In this example, DB2 is a primary role and is visible. SSH and OPMDB2 are invisible roles because their{ "vm-templates": [ ... ... { "name": "database-db2", "roles": [ { "type": "DB2", "name": "DB2" ... ... }, { "global": true, "plugin": "ssh/2.0.0.1", "dashboard.visible": false, "type": "SSH", "name": "SSH" }, { "global": true, "plugin": "opmdb2/1.0.0.0", "depends": [ { "role": "database-db2.DB2", "type": "DB2" } ], "dashboard.visible": false, "type": "OPMDB2", "name": "OPMDB2" }, ... ... ], ... ... } ], }"dashboard.visible"value is set tofalse. - Metrics are not collected.
The monitoring service ignores roles without initial metrics even if there is a collector that is successfully registered for the role. Check the log and trace files to verify whether the collector is working properly.
Problem: I can see my roles that are listed in Monitoring View, but I cannot see their metrics. The message CWZMO0040W: No real-time metric data is found for deployment is displayed.
Resolution: The error message displays when the monitoring service cannot find metrics for a certain role anymore. Check the log and trace files to verify that the collector is working properly.
Auto scaling
The elastic scaling, or auto scaling, feature in a plug-in uses monitoring. Auto scaling provides the automatic addition or removal of virtual application and shared services instances that are based on workload.
You can optionally turn on the auto scaling feature by attaching the scaling policy to a target application or shared service. The policy is also used to deliver the scaling requirements to the back-end engine. Requirements include trigger event, trigger time, and instance number, which drive the scaling procedure.
Cloud Pak System Software for x86 supports two types of scaling: horizontal scaling and vertical scaling.
Horizontal Scaling
Horizontal scaling expands or shrinks a deployment by adding nodes into the deployment (a scale out) or removing nodes from the deployment (scale-in).
Vertical Scaling
Vertical scaling increases or reduces node size by adding processor cores, increasing memory size, or by attaching new disks to the nodes (scale-up), or by removing processor cores, decreasing memory size, or by removing disks from the nodes (scale-down).
Scaling policy overview
The auto scaling policy can be attached to two kinds of components in Cloud Pak System Software for x86: a virtual application and a shared service. For the virtual application, you can explicitly add the scaling policy to one or more components of the application in the Pattern Builder. For the shared service, the scaling policy must be described in the application model that is made by the plug-in developer if the service asks for the auto scaling capability.
Plug-ins, either for virtual applications or shared services, define the scaling policy, describe the policy in the application model, and provide transformers to explain and add scaling attributes into the topology document when the policy is deployed with plug-ins. The application build automatically generates the segment of the scaling policy in the application model only if you are using shared services. At run time, the back-end auto scaling engine first loads the scaling attributes and generates the rule set for scaling trigger. Then, the back-end engine computes on the rule set and decides whether the workload reaches a threshold for adding or removing application or shared service instances. The final step of the process is to complete the request.
To apply the auto scaling policy to a plug-in, ensure that the scaling policy is defined in the application model that the plug-in is associated with, which collects user-specific requirement for the scaling capability. Also, ensure that the policy is transformed into the topology document, which guides the back-end engine to inspect the trigger event and take scaling actions.
Scaling elements
- Trigger event
Scaling actions are triggered based on the changing value of certain metrics. The trigger event specifies the type of monitoring metrics and threshold range for which different scaling actions are triggered.
For each metric in the event definition, there are two thresholds: scale-in threshold and scale-out threshold. For example, the processor use for virtual machines that run WebSphere Application Server instances can be the metric for the trigger event and the thresholds for scale-in and scale-out are 20% and 80%, then when the value of processor use is higher than 80%, a new WebSphere Application Server instance is started. When the processor use is below 20%, an existing WebSphere Application Server instance is selected for removal.
- Trigger time
To prevent a spike from occurring, the system uses a time span to monitor a metric value, rather than a value taken in an instant, before it triggers a scaling action. Trigger time specifies the time to hold an inspecting threshold before the scaling actions are taken when the threshold condition is met. For example, at the moment that the processor use is monitored higher than 80%, a timer is started. If trigger time is set to 120 seconds, then when the timer reaches 120 seconds, the scale-out operation is started. If the processor use goes out of the thresholds during the timing, which would be that the processor use drops below 80% in this example, then the timer stops. It restarts when the processor use enters the threshold again.
- Resource limit
A resource limit for scaling behaviors is required to prevent a deployment from using all of the system resources. For example, a scaling policy should specify the total number of instances a plug-in can have at one time or at least by scale-out or scale-in. When the cluster size of a plug-in reaches the border of its ranges, no instance is added or removed to or from the cluster, even though the trigger event is met.
The scaling policy includes the horizontal scaling and
elements for vertical scaling. There are three types of trigger events:
horizontal scaling, vertical scaling of processor, and vertical scaling
of memory. ScaleUpCPUThreshold specifies the threshold
for which processor count is increased on a node. ScaleUpMemoryThreshold specifies
the threshold for which memory size is increased on a node. The resource
limit includes both processor and memory. The processor min and max values
specify the maximum number of cores that can be added to a node by
a scale-up action, and can be decreased to for a node on a scale-down
action. The memory min and max values
specify the maximum amount of memory that can be allocated to a node,
and the minimum amount of memory that a deployment can be reduced
to. This version also includes increment and decrement values for
a single scaling action. The processor increment and decrement values
specify how many cores are added or removed in one scale-up or scale-down
action. The memory increment and decrement values specify how much
memory is added or removed in one scale-up or scale-down action. In
previous versions, these values do not exist because only one node
is added or removed in one scale-in or scale-out action by default.
Application model
Auto scaling capability is embodied as a policy in the application model. The application model is used to describe the components, policies, and links in the virtual applications or shared services. For virtual applications, the model can be visually displayed and edited with the Pattern Builder.
Virtual application designers can customize components and policies, including the auto scaling policy, in the Pattern Builder. There is no tool to visualize shared services in the application model. Auto scaling can be customized only in the Instance Console when the service is deployed. The scaling policy that is described in the application model, for either a virtual application or shared service, follows the application model specification. The policy is defined in the node with a group of attributes.
"model": {
"nodes": [
{
... ...
},
{
"id": <policy id>
"type":<policy type>
"attributes": {
<No.1 metric id for trigger event>: [
< threshold for scale-in >,
< threshold for scale-out >
],
<No.2 metric for trigger event>: [
< threshold for scale-in >,
< threshold for scale-out >
],
<... :[... ,... ]>
<No.n metric for trigger event>: [
< threshold for scale-in >,
< threshold for scale-out >
],
<trigger time id>: <trigger time value>
<instance range number id": [
<min number>,
<max number>
],
}
},
{
... ...
}
]
}The attributes describe the scaling policy in an application model. From the example JSON segment, the Trigger Event can include multiple metrics and thresholds for one scaling policy, which means that the scaling operations on a plug-in can be triggered by different condition entries with different metrics. The relationship among these entries is explicitly explained by plug-in transformer and marked in the topology document. It is not required to mark them in application model, except that their label can be used to define the relationship in the user interface. Cloud Pak System Software for x86 requires the metadata be provided in a plug-in to explain components in the application model for user interface presentation. For scaling policy, the plug-in can apply correct widget types and data types to the attributes for Trigger Event, Trigger Time, and Instance Number Scope.
Topology model
"vm-templates": [
{
...
scaling :{
"min": <number>,
"max": <number>,
}
},
{
...
}
]
"vm-templates": [
{
...
scaling :{
"role" : <role type for the template>,
"triggerEvents": [
{
"metric": <metric category and item linked by ".">,
"scaleOutThreshold": {
"value": <metric value with its data type>,
"type": "CONSTANT",
"relation": <comparison symbol including "<",
">",
"<=",
">=" >
},
"conjunction": <conjunction type with other trigger
events including "OR", "AND">
"scaleInThreshold": {
"value": <number>,
"type": "CONSTANT",
"relation": <comparison symbol>
}
},
"triggerTime": <number>
},
{
"metric": " metric category and item",
"scaleOutThreshold": {
"value":<number>,
"type": "CONSTANT",
"relation": <comparison symbol>
},
"conjunction": <conjunction type with other trigger
events>
"scaleInThreshold": {
"value": <number>,
"type": "CONSTANT",
"relation": <comparison symbol>,
"electMetricTimeliness": <"historical"|"instant">
}
"triggerTime": <number>
},
{
{
"metric": " metric category and item",
"scaleUpCPUThreshold": {
"value":<number>,
"type": "CONSTANT",
"relation": <comparison symbol>
},
"conjunction": <conjunction type with other trigger
events>
"triggerTime": <number>
},
{
"metric": " metric category and item",
"scaleUpMemoryThreshold": {
"value":<number>,
"type": "CONSTANT",
"relation": <comparison symbol>
},
"conjunction": <conjunction type with other trigger
events>
"triggerTime": <number>
},
{
...
}
],
"min": <number>,
"max": <number>,
"maxcpucount": <number>,
"minmemory": <number>,
"cpucountUpIncrement": <number>,
"memoryUpIncrement": <number>,
"triggerTime": <number>,
}
...
},
{
...
}
]Cloud Pak System Software for x86 supports
multiple trigger events for a scaling operation. Those events are
currently aggregated in two modes: OR and AND. The OR mode means that
the scaling operation is triggered if only one event happens. The
AND operation means if all events happen at the same time, only then
is the scaling operation triggered. Auto scaling depends on monitoring
to collect metrics for inspecting. To ensure that the right metrics
are collected, the value of key metric in each trigger
event must be consistent with the category and attributeName.
These attributes are defined in the plug-in metadata for monitoring
collectors. The values can be joined by . into metric.
For example, CPU.Used represents the metric with
a category of CPU and an attributeName of Used.
Monitoring also provides a group of OS level metrics, which can also
be selected by plug-in developers and used for auto-scaling. For
details, see Metrics collected by Monitoring Agent for IBM Cloud Pak System Software.
| Type | Key | Description |
|---|---|---|
| Horizontal scaling | min |
The minimum number of virtual machines that a role can have |
max |
The maximum number of virtual machines that a role can have | |
scaleInThreshold |
Metric and its threshold for scale-in action | |
scaleOutThreshold |
Metric and its threshold for scale-out action | |
| Vertical scaling | maxcpucount |
The maximum number of cores that a virtual machine can have |
scaleUpCPUThreshold |
Metric and its threshold for scale-in action | |
cpucountUpIncrement |
Core count to increase by in one scale-up CPU action | |
minmemory |
The minimum memory size for a virtual machine | |
maxmemory |
The maximum memory size for a virtual machine | |
scaleUpMemoryThreshold |
Metric and its threshold for scale-up memory action | |
memoryUpIncrement |
Memory size to increase by in one scale-up memory action |
The triggerTime attribute is shared
by several scaling types and trigger events. It can either be placed
inside a triggerEvent object or out a triggerEvent object.
Its placement determines its scope. If triggerTime is
placed inside the triggerEvent object, it only applies
to its triggerEvent object. If triggerTime is
placed outside the triggerEvent object, it applies
globally for all triggerEvent objects. If there is
a triggerTime attribute both inside and outside a triggerEvent object,
the triggerTime that is inside the object takes precedence.
The transformer that is provided by the plug-in must define attributes of the scaling policy in the application model and map them to the named attributes in the topology document. The Trigger Event, Trigger Time, and Instance Number Scope autoscaling elements correspond to triggerEvents, triggerTime, and min and max.
- The terminated or failed virtual machine, except for the master.
- The virtual machine that has a status other than RUNNING, such as LAUNCHING, INITIALIZING, STARTING, except for the master.
- The virtual machine that the role with the least or greatest value of specified metric among cluster, except for the master.
- A random virtual machine, except for the master.
{
"role" : "WAS",
"triggerEvents": [
{
"metric": "CPU.Used",
"scaleOutThreshold": {
"value": 80,
"type": "CONSTANT",
"relation": ">="
},
"conjunction": "OR",
"scaleInThreshold": {
"value": 20,
"type": "CONSTANT",
"relation": "<",
"electMetricTimeliness" : "historical"
}
}
],
"min": 1,
"max": 10,
"triggerTime": 120
}The values for "metric", "scaleInThreshold", "relation", and "electMetricTimeliness" are used to guide how to select a WebSphere Application Server instance for scale-in if plug-in provides manual scaling operations. In this example, "metric" specifies that processor utilization is the metric. The "<" for "relation" specifies that the candidate instance for scale-in is the one with lowest processor utilization in the cluster. A value of ">" would indicate the greatest processor utilization instead. For "electMetricTimeliness", the value can be "historical" or "instance". The "historical" value specifies that the scale-in instance is selected on an average of historical value in 5 minutes.
Manual scaling
Manual scaling provides virtual application administrators with a flexible and controllable way to add or remove instances of virtual applications or share services. By using "autoscalingAgent.scale_out" and "autoscalingAgent.scale_in", manual scaling can run in an autoscaling-safe way. Customization of some manual scaling features is supported, typically focusing on scale-in, using the manual scaling policy. When a plug-in exposes manual scaling operations, it transforms the policy to predefined attributes of the topology, which are used by scaling back-end to archive customized features.
"scaling": {
"role": "RTEMS"
"triggerEvents": [{
"metric": "RTEMS.ConnectionNumber ",
"scaleOutThreshold": { ... },
"conjunction": "OR",
"scaleInThreshold": {
"value": 20,
"type": "CONSTANT",
"relation": "<",
"electMetricTimeliness" : "instant"
}
}
],
"triggerTime": 120,
"min": 1,
"max": 10,
"manual": {
"scaleInMetric":"RTEMS.ConnectionNumber",
"metricType" : "instant",
"rule": "minimum"
}
}This example shows a deployment with
both an auto scaling and a manual scaling policy on Remote Remote Tivoli® Enterprise Monitoring Server (RTEMS).
The scale-in is triggered automatically by using "triggerEvents"
and "triggerTime" and can also be applied manually by users. For manual
scale-in, the RTEMS instance
with the lowest value of "connectionNumber" among all instances is
selected as the one to destroy every time.- If an auto scaling policy is applied, there should be nothing else that is required in the Pattern Builder to support manual scaling. The plug-in must expose manual scale in and manual scale out operations. The scaling template that is provided by the plug-in must identify which metric to use for manual scale-in if multiple metrics are used. Only the first metric is used by default. Typical plug-ins include WebSphere Application Server and shared services such as caching and monitoring.
- A plug-in should provide a combination of auto and manual scaling templates for the user to choose from, but only one can be applied to a vmTemplate.
- If you do not want or need to support auto scaling in your plug-in but want to enable manual scaling, you must define the triggerEvent. The scaling template still provides min, max, role, and triggerEvent. However, the triggerEvent contains only the scale-in metric to use along with the min and max attributes (< or > metric) which is for any plug-in or shared service to use, such as the load balancer.
Scaling Interface
The autoscalingAgent utility
defines a generic API for Python scripts to interact with the auto
scaling agent on the virtual machine. The utility provides several
functions.
- The maximum number of instances was reached
- Scaling was paused or disabled
- The deployment is being updated
- There are insufficient resources in the cloud to fulfill the request
maestro.autoscalingAgent.scale_out('{
"vmTemplate": String,
"roleType": String
}') maestro.autoscalingAgent.scale_out('
{"vmTemplate":"Web_Application-was",
"roleType":"WAS"}
')- The minimum number of instances was reached
- Scaling was paused or disabled
- The deployment is being updated
maestro.autoscalingAgent.scale_in('{
"vmTemplate": String,
"roleType": String,
["node" : String]
}')- No scaling policy is provided in the application model
- Scaling was paused or is disabled
- The deployment is being updated
maestro.autoscalingAgent.pause_autoscaling()- No scaling policy is provided in the application model
- Scaling is resumed or is disabled
- The deployment is being updated
maestro.autoscalingAgent.resume_autoscaling()- No scaling policy is provided in the application model
- Scaling is already enabled
- The deployment is being updated
maestro.autoscalingAgent.enable_autoscaling()- No scaling policy is provided in the application model
- Scaling is already disabled
- The deployment is being updated
maestro.autoscalingAgent.disable_autoscaling()autoscalingAgent.scale_out and autoscalingAgent.scale_in are
safe for auto scaling. Using them helps to avoid these kinds of issues
and other possible intervening and conflicting problems between auto
scaling and manual scaling.
Need to add diagram