Displaying PMI metrics in Prometheus format with the metrics app
You can use the metrics.ear file to create a Prometheus endpoint for your WebSphere® Application Server runtimes to display PMI metrics in Prometheus format.
- Retrieves the PMI data objects by using the JMX Perf MBean
- Renders the data from the PMI data objects into Prometheus format output.
Before you begin
Similar to the PerfServlet, the metrics.ear provides a way to use HTTP requests to query the performance metrics for an entire WebSphere Application Server administrative domain. In contrast to the PerfServlet, which returns PMI data in XML format, the metrics.ear converts PMI data into Prometheus format. The metrics.ear enables the scraping of metrics from your application servers into Prometheus format.
The metrics available on the Prometheus endpoint correspond to the set of metrics enabled in the PMI configuration. For the Prometheus output, some PMI metrics are suppressed or split into two metrics to better follow Prometheus best practices. See Prometheus metrics for a mapping of the original PMI metrics to their corresponding Prometheus metrics.
com.ibm.ws.pmi.prometheus.includeCellNodeServerLabels=false
com.ibm.ws.pmi.prometheus.includeNodeAgents
system property
to false
, as shown in the following example:
com.ibm.ws.pmi.prometheus.includeNodeAgents=false
Tuning performance
With network deployment, the metrics.ear endpoint contacts all servers in the cell to gather metrics. If any servers in the cell are CPU-bound, or slow to respond for other reasons, they adversely affect the response time of the metrics endpoint. Monitor the response time of metrics endpoint requests to determine whether tuning is needed.
The metrics.ear endpoint response time scales linearly with the number of metrics available at the endpoint. If response time is too slow, reduce the number of metrics that are collected by adjusting the PMI configuration.
- PMI settings
- Enable only the PMI metrics that are relevant for your business needs. Review the PMI settings and use a custom setting to enable or disable metrics. If possible, avoid the use of the All metrics setting. For servers that do not require metrics collections, set the PMI settings to disable.
- URL filtering
- You can use the metrics endpoint to query for metrics from a single node or single server. The default endpoint /metrics shows the PMI metrics that are collected from all of the servers and node agents in the cell. To select servers from a specified node or server, use URL /metrics/<node_name> or /metrics/<node_name>/<server_name>.
- Prometheus scrape_duration
- The default Prometheus scrape_duration is 15 seconds. If the response time for your Prometheus endpoint is a few seconds, increase the Prometheus scrape_duration value. Alternatively, you can scale down the number of PMI metrics available at the endpoint by modifying the PMI settings or by using URL filtering.
- Result caching
- The metrics app stores the most recent Prometheus metrics result in a cache for 5 seconds by default. A request that is made within 5 seconds of the previous request is served with the cached result. This default interval time value is configurable with the com.ibm.ws.pmi.prometheus.resultCacheInterval system property.
- Servers list refresh
- The list of servers to be scraped for PMI data is refreshed when the metrics endpoint is accessed. To reduce the cost, the metrics app does not refresh this list more often than every 600 seconds by default. New servers added to the cell with PMI enabled can be picked up by the metrics.ear at the next refresh. This default scrape interval value can be configured with the com.ibm.ws.pmi.prometheus.serverListUpdateInterval system property.
- Server metrics scrape response time
- For metrics scraping, when Prometheus calls the
/metrics
endpoint from the metrics.ear application, it makes JMX calls to each server in the cell to collect metrics. If one of the servers is slow to respond, the/metrics
endpoint response time might be large and Prometheus times out with no response, according to the Prometheus scrape_timeout configuration setting. The default timeout that is set in the metrics.ear application when it communicates with the servers in the cell is 8 seconds. After this timeout is reached, it returns a response back to Prometheus, even if some server data is omitted from the response. This configuration limits the Prometheus endpoint response time. You can configure the default server scrape timeout value with the following system property: com.ibm.ws.pmi.prometheus.serverScrapeTimeout. If the value is set to 0 or a negative value, no timeout value is set for the server scrapes.
Performance improvement recommendations
- In cells with many servers, use a longer scrape_duration value than typical because the scrape is over the entire cell.
- Response time is proportional to the number of metrics returned. You can turn off some of the more verbose metrics, such as URI and EJB metrics, to help improve performance.
- Long scrape times occur when CPU usage of any node in the cell is near 100%. When you use Prometheus with Grafana to consume metrics, gaps can occur in your Grafana graphs when scrape time exceeds the Prometheus scrape timeout.