A common concern of WebSphere DataPower Appliance monitoring is how and what to monitor in order to plan ahead appropriately to meet the continuously growing business needs of the appliances in future.
With ITCAM Agent for WebSphere DataPower Appliance v7.1, the following capacity related metrics should be monitored for trend analysis.
ITCAM Agent for WebSphere DataPower Appliance monitors the latency messages of the DataPower appliance.
Technote: Latency messages in DataPower appliance
The latency arguments in the latency message expressed in milliseconds (ms) of when certain events occurred since the HTTP transaction began (not additive). By monitor these messages, the agent provides the capability to collect HTTP transaction latency data for trend analysis. If the transaction latency continuously to grow, and the system resource utilization is high, then it might indicate the appliance is approaching its capacity limit.
In order for the ITCAM Agent for WebSphere DataPower Appliance to monitor the latency messages, the appliance must be configured to log messages to a remote syslog daemon and must be subscribe to the log messages of "latency" category of "information" priority as the following screenshot.
The SystemUsage attribute group contains attribute of “Load (%)” that measures the percentage of total load on the device during the measurement interval, and attribute of “Work List” that measures the number of pending messages in queue for processing by the appliance.
“Load (%)” of above 90% indicates that the device is at or near load capacity. High load values are not necessarily indicate a problem, provided that the transaction latencies are not affected
The "KBN_DP_System_Load_High" pre-defined situation comes with a threshold of 80%. You can use it as the template to create customized situation suitable to your business needs.
The CPUUsage attribute group contains attributes that measure average CPU usage percentage of the DataPower in 10 seconds, 1 minute, 10 minutes, 1 hour, and 1 day respectively. High CPU usage does not necessarily indicate a problem, provided that the transaction latencies are not affected
The "KBN_DP_CPU_High" pre-defined situation comes with a threshold of 90% measured in the last 10 minutes. You can use it as the template to create customized situation suitable to your business needs.
The MemoryStatus attribute group contains "Memory Usage (%)" that measures the instantaneous memory usage of the appliance as a percentage of the total memory.
The ServiceMemoryStatus attribute group contains details of memory usage of all services. Specifically, the "Current (KB)" attribute measures the current memory being used by this service; it also contains attributes that measure peak memory usage by the service over its lifetime and several particular time frames.
The DomainMemoryStatus attribute group contains details of memory usage of all application domains. Specifically, the "Current (KB)" attribute measures the current memory being used by this application domain; it also contains attributes that measure peak memory usage by the application domain over its lifetime and several particular time frames.
The "KBN_DP_Memory_High" pre-defined situation comes with a threshold of 80%. You can use it as the template to create customized situation suitable to your business needs.
The FilesystemStatus attribute group contains "Encrypted Usage (%)", "Temporary Usage (%)", and "Internal Usage (%)" attributes that measure the usage of storage space of different type on the appliance.
The "KBN_DP_Encryptfile_Space_Low" and "KBN_DP_Tempfile_Space_Low" pre-defined situations come with a threshold of 80%. You can use them as the templates to create customized situations suitable to your business needs.
The NetworkReceiveDataThroughput and NetworkTransmitDataThroughput attribute groups contain average receive/transmit KB per second for last the 10 sec, 1 min, 10 min, 1 hour, 1 day of each network interface of the appliance
HTTP transaction rate and time
The HTTPTransactions2 attribute group contains the transaction per second for the last 10 seconds, 1 minute, 10 minutes, 1 hour, and 1 day time period for each service of the appliance.
The HTTPMeanTransactionTime2 attribute group contains the mean transaction time (ms) for the last 10 seconds, 1 minute, 10 minutes, 1 hour, and 1 day time period for each service of the appliance.
Constantly downgrade of transaction time with high resource utilization and throughput might be an indicator of capacity limit
The appliance reacts to low available memory, temporary file space, free ports, and XML Names by refusing to accept new connections. If the appliance does not free sufficient resources after a certain duration, the appliance will auto-restart.
The throttle settings of the appliance are configurable. The default throttle thresholds of memory and temporary file space are 20%, that means once the free memory or temporary file space is lower than 20%, the appliance will refuse to accept new connections. The system waits for the timeout period before reboot, to allow the resource usage to recover. The default terminate thresholds of memory and temporary file space are 5%, that means once the minimal available memory or temporary file space is lower than 5%, the system will reboot.
The LogNotification attribute group can be used to monitor throttling or termination events from DataPower appliance whenever the threshold reached. "Event Code" related are:
0x01a40001 Throttling connections due to low memory
0x01a40005 Throttling connections due to low temporary file space
0x01a40008 Throttling connections due to low number of free ports
0x01a40018 Throttling due to low number of available file descriptors
0x01a30002 Restart due to low memory
0x01a30003 Restart due to resource shortage timeout
0x01a30006 Restart due to low temporary file space
0x01a30009 Restart due to port shortage
0x01a30017 Restart due to low file descriptor
0x01a3000b Restart due to XML Names prefix shortage
0x01a3000c Restart due to XML Names namespace shortage
0x01a3000d Restart due to XML Names local name shortage
0x02430001 Message throttled
0x0804000a0 Not permitting connection due to %s throttling.
0x0804000a3 Free memory very low (%f), throttling
0x0804000a4 Free temporary file space very low (%f), throttling
0x0804000a5 Free ports very low (%f), throttling
0x0804000a6 Resource shortage has not recovered in %d seconds, forcing system restart due to %s
0x0804000a7 Free memory too low (%f), forcing system restart
0x0804000a8 Free temporary file space too low (%f), forcing system restart
0x0804000a9 Free prefix XML Names too low (%f), forcing system restart
0x0804000aa Free namespace XML Names too low (%f), forcing system restart
0x0804000ab Free local name XML Names too low (%f), forcing system restart
0x0804000ac No available port, forcing system restart
The agent provides several pre-defined situations to monitor the throttling/termination events. To create a customized situation on the events, the filter of "Event Code" must eliminate the leading zeros. For instance, as in the following picture, to monitor event of code '0x02430001', you need to input '2430001' in the formula.