IBM Support

nimon working with Prometheus

How To


Summary

This article covers sending the performance statistics to the Prometheus Time-Series database via the Influxdata tool called Telegraf. nimon already supported InfluxDB and njmon via JSON output can also work with elastic and Splunk.

Objective

Nigels Banner

Environment

This article assumes you are using nimon for AIX or Linux version 63 or higher.
Note: the nimon program is the version of njmon but outputs InfluxDB Line Protocol formatted statistics.

Steps

njmon and nimon were developer to work with InfluxDB Time-Series database and the Grafana web-based graphing and monitoring tool:

  • njmon sends JSON data via njmond.py a central Python client to InfluxDB - 2nd letter is J for JSON format
  • nimon directly send InfluxDB native Line Protocol to InfluxDB - 2nd letter is I for InfluxDB format
The JSON format output from njmon can be sent to:
  • Splunk via the Splunk Python client module.
  • elastic (also called elasticsearch or ELK) via the elastic file snooping tool called beats.
  • Loaded directly into Python because JSON data loads quickly to a Python native "dictionary" data type.
This article is about sending the performance statistics from nimon (not njmon) to another popular Time-Series database called Prometheus. Prometheus has been around a few year but getting extra attention in the last year because it is use in Red Hat OpenShift and Kubernetes.
The Prometheus Time-Series database is different to the other Time-Series tools in that the central Prometheus service polls the endpoints on your computer servers or virtual machines and then pulls the available data back to the database.  The other Time-Series tools have endpoints gathering the performance statistics and when ready, they then push the data to the central database.
It might seem a small point but it requires Prometheus endpoints to run a sort of website, which is not trivial to implement in the C language, which is used to gather the performance stats.  It is tricky on AIX, which has the excellent Perfstat Library in C. Prometheus endpoint tend to be written in Go and access to large C structures of data is complicated.
So nimon wants to push data and Prometheus wants to pull it. What we need is a tool in the middle of them, that accepts push requests from nimon then hold the data until Prometheus polls and pulls the data.  In this article, we use a tool from the company Influxdata (the InfluxDB team) and called Telegraf.  This tool can handle dozens of input formats and dozens of output formats.  It is a real "multi-function adapter" like the Switz Army knife for transforming data.
So here is my worked example of the nimon to Telegraf to Prometheus to Grafana data chain.
Notes:
  • I chose the port numbers (8888,9099) largely at random and the IP address is on my network (9.137.62.10)
  • I am running Prometheus and Telegraf on the same server.
  • My test server is running Ubuntu 18.04 but that make little difference to the setup except for the downloaded file and install command
This diagram shows the architecture and connection details
image 4050
1) Prometheus - current at the time is version 2.18.1
  • Downloaded Prometheus from https://prometheus.io/download/
  • Uncompress and extract the files
  • cd to the newly create "prometheus"directory
  • Create a prometheus.yml file based on this example:
  global:    scrape_interval:     15s       scrape_configs:    - job_name: 'node'      static_configs:      - targets: ['localhost:9100']    - job_name: 'prometheus'      static_configs:      - targets: ['localhost:9090']    - job_name: 'telegraf'      static_configs:      - targets: ['localhost:8099']  
Comments:
  • Ignore the job_name: 'node' stanza that was for testing the Prometheus with the node_exporter endpoint to collect stats to prove I had a working Prometheus service
  • The job_name: 'prometheus' stanza sets the IP port number of 9090 for Grafana to connect to Prometheus as a data source later.
  • The job_name: 'telegraf' stanza sets the IP port number of 8099 for Prometheus to pull the data from telegraf.
To run Prometheus:
  ./prometheus
By default it looks in the current directory for the prometheus.yml file.
This method is obviously a "quick and dirty" way to get Prometheus running and it outputs logging information on the screen.  Read the Prometheus documentation on how to run it as a service and restarted every time the server or virtual machine is rebooted.
2) Telegraf - current at the time is version 1.14.3
  • Downloaded and install Telegraf program from https://portal.influxdata.com/downloads/
  •   wget https://dl.influxdata.com/telegraf/releases/telegraf_1.14.3-1_amd64.deb  sudo dpkg -i telegraf_1.14.3-1_amd64.deb
  • Create a telegraf.config file file based on this example:
  [[outputs.prometheus_client]]    listen = ":8099"    metric_version = 2    path = "/metrics"    expiration_interval = "120s"    string_as_label = false    [[inputs.socket_listener]]    service_address = "tcp://:8888"    data_format = "influx"    read_buffer_size = "256KiB"    read_timeout = "2s"  
Comments:
  • The outputs.prometheus_client stanza tells Telegraf to prepare for a Prometheus server to connect and pull data from Telegraf on this port number 8099.  The expiration_interval of 120 seconds is how long Telegraf buffers the last set of statistics. As we expect Prometheus to comment then much faster than that rate.
  • The outputs.socket_listener stanza with port 8888 if of type influx so it open sthis network socket and expect InfluxDB Line Protocol formatted statistics. As nimon sends the statistics in one short burst, the read_timeout of 2 seconds lets telegraf know not to expect any further stats.  Note: use nimon with the -w command line option to remove three lines it would send to InfluxDB concerning the InfluxDB and authentication. These lines are not needed with Telegraf.  Telegraf ignores them but added needless warning to the log file.
To run Telegraf:
  telegraf --config telegraf.config --debug
Note:
  • The  command line option --debug can but be used in production but provides useful feedback that the statistics are arriving and being collected by Prometheus.
This method is obviously a "quick and dirty "way to get Telegraf running and it outputs logging information on the screen.  Read the Telegraf documentation on how to run it as a service and restarted every time the server or virtual machine is rebooted.
3) nimon - current at the time is version 63
One each of your servers or virtual machines run:
  nimon -s15 -c5760 -w -i 9.137.62.10 -p 8888
The IP address of 9.137.62.10 is the Telegraf server - you can use its hostname.
The nimon statistics are now be pouring into Prometheus and my job is done - phew!
But we have to check that we can find data and graph the statistics. First with Prometheus tools and then with Grafana.
4) Graph data on the Prometheus graphical user interface
I am no expert with this tool but after some experiments I get the following in a browser connecting to port 9090. To find the nimon data start typing in the box above the Execute button and it brings up statistic names that matches in the database.  I had previously tried the Prometheus endpoint data collector called node_exporter. So there was node_exporter statistics and nimon statistics. Initially, the two data sources was a  little confusing.  So I deliberately, searched to nimon statistics names like "cpu_physical_total_user".
image 3983
Here are three virtual machines on my POWER9 server with three difference AIX releases.
5) Grafana - current at the time is version 7.0.1
I have use Grafana with InfluxDB data for a few years but it was a little shocking that there are differences with using the Prometheus as the data source.  Not better or worse, just different. and takes time to work through.
I added the new Prometheus data source to Grafana, used the Prometheus IP address and port number 9100. It was straight forward and the same process as connecting Grafana to InfluxDB. Then I created three graphs but the way this is done is different to in Prometheus data than with InfluxDB. This is because the data schema is different. 
image 4044
Here are the setting behind a simple graph: image 4051 I hope this gets you off to a flying start and I am happy that my njmon and nimon tools can feed Prometheus and keep Prometheus enthusiasts happy too.
One key feature of nimon is that for AIX on a virtual machine with four disks, 1 network, 2 CPUs with SMT=8 there are 1575 performance statistics available.  On larger virtual machines or if monitoring processes that number grows rapidly.
- - - The End - - -

Additional Information

I was inspired by the following a Blog article that this method and the details of the Telegraf setting by my good friend "wiard "and his words at https://sysrant.com/aix-metrics-in-prometheus-with-njmon/
Thanks waird.


If you find errors or have question, email me:

  • Subject: nimon working with Prometheus
  • E-mail: n a g @ u k . i b m . c o m
Also find me on

Document Location

Worldwide

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SWG10","label":"AIX"},"Component":"","Platform":[{"code":"PF002","label":"AIX"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB08","label":"Cognitive Systems"}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"HW1W1","label":"Power ->PowerLinux"},"ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)","Line of Business":{"code":"","label":""}}]

Document Information

Modified date:
08 June 2020

UID

ibm11116327