IBM Support

njmon and nimon Internal Data Collector

How To


Summary

This article covers adding your own statistics to njmon or nimon. Your new statistics then end up in your Time-Series database (like InfluxDB) so you can graph them over time and check the server configuration.

Objective

Nigels Banner

Steps

To add statistics, a little C code is written supplying two functions but nothing complicated.
  • If you can gather the statistics from C function calls, then the C program can be just three or four lines long.
  • If you have to run a program or shell script to get the statistics, then C code is need to:
    1. Run the program.
    2. Grabbing the data from the output.
    3. Format the data for nimon and nimon
    4. It can take a few more lines of code. The example requires 18 lines of code.

Note:
  • IBM official rPerf values are set for the Power Systems servers running AIX. The performance rating can be found in the "IBM Power Systems Performance Report". Note: rPerf is short for "relative performance" compared to an early Power Server.  There is another AIXpert Blog on "rPerf" and the "rperf" script used here.
  • Strictly, speaking the rperf script gives an "Estimated rperf" for the LPAR (Virtual Machine) that your AIX is running based on the number of Virtual Processes. They are not official numbers but are useful in planning server migration and server consolidation to newer Power Servers.

You can follow the worked example, which runs a Korn shell script to gather the data and extract statistics (numbers) or information (strings) from the script output into C data variables like:
  • long - large integer variable
  • double - large floating point variable
  • string - a series of character variables
The program uses simple functions from njmon and nimon to generate the JSON data structure, which is included into the njmon output or the line protocol formatted data for nimon output. The functions available are:
  • psection("section-name")
    Used once only at the start. It sets a suitable name for your data like the-computer-resource or application-name.  If you can't think of something better, I recommend the name "extra" as in this example.
  • plong("name", longnumber)
    You can have many "plong()" functions, one per long statistics.
  • pdouble("name", doublenumber)
    You can have many "pdouble()" functions, one per statistics.
  • pstring("name", yourstring) 
    You can have many "pstring()" functions, one per statistics.
  • psectionend()
    This function ends the new data. Used once only.
Your small C program includes two C functions called:
  1. extra_init()
    This function is called only once at the start for you to initialize any data structures. For example, you can use malloc() to add memory space and save the first set of data to it ready for the later calls to extra_data().
  2. extra_data(double elapsed)
    This function is called every time njmon or nimon captures data and can use the njmon or nimon function psection(), plong(), pdouble(), pstring(), and psectionend() to save your data.
The C code file "extra.c" is used in two ways.
Diagram on stand-alone to inside njmon
In this example program, there is simple code to check that two functions (extra_init() and extra_data()) work correctly in a "stand-alone" test environment.  This code is optionally compiled in for testing only by adding the compile option "-D EXTRA_TEST".  When the code is compiled into njmon or nimon, the testing code is excluded by not have "-D EXTRA_TEST" as a compiler option.
In njmon and nimon code, the new code by compiling in using an extra compiler option "-D EXTRA".
Example code
Here is the example code in a function called "extra.c" - using this file name is mandatory.
  /* njmon / nimon -- internal data collector */  #ifdef EXTRA_TEST  #include <stdio.h>  #include <stdlib.h>  #include <unistd.h>  #include <string.h>    void psection(char *s) { printf("\"%s\": {\n",s); }  void psectionend() { printf("}\n"); }  void pstring(char *name, char *s) { printf("\t\"%s\": \"%s\",\n",name,s); }  void plong(char *name, long value) { printf("\t\"%s\": %ld,\n",name,value); }  void pdouble(char *name, double value) { printf("\t\"%s\": %.3lf,\n",name,value); }    extern void extra_init();  extern void extra_data(double elapsed);  int main()  {          extra_init();          extra_data(2.0);          return 0;  }  #endif /* EXTRA_TEST */    void extra_init()  {          /* If necessary, use this function to              initialise any data structures */  }    void extra_data(double elapsed)  {      FILE * pop;      char    string[4096];      double  rperf = 0.0;        if ( (pop = popen("/home/nag/rperf/rperf 2>/dev/null", "r") ) != NULL ) {          if ( fgets(string, 4095, pop) != NULL) {                  /* Sample result->54.85 rPerf estimated based on 2.00 Virtual CPU cores<- */                  sscanf(string, "%lf rPerf", &rperf );                  string[strlen(string)-1] = 0; /* remove newline at the end */          }          pclose(pop);      }      if(rperf > 0.0) { /* If the command failed, dont send the data */          psection("extra");          pdouble("rperf",rperf);          pstring("rperf_string",string);          plong("meaning_of_life", 42);           psectionend();      }  }  
Comments on the extra.c code:
  1. Everything between "#ifdef EXTRA_TEST" and "#endif /* EXTRA_TEST */" is the code used for stand-alone testing of your two new functions.
  2. These lines are compiled in by adding the -D EXTRA_TEST option to the compiler.
  3. In this example case, we do not need the extra_init() function so it is empty but must be present.
  4. The extra_data() function has a parameter, which is the double floating point number called "elapsed". Elapsed is the number of seconds since this function was called (or the extra_init() function on the first time).  In this example case, it is not used.  If your data is an incrementing measure, then elapsed time is used to convert the statistics into measure per second.  An example would be a data rate.  If the previous value was 100 KB and current value is 120 KB, then the data rate would be "(120 - 100)/elapsed KB per second."
  5. This example, uses the popen() function to run the Korn shell script "rperf" from the directory "/home/nag/rperf/".  There is a large assumption here that every target server has the rperf command in that directory.  If I was rolling out this new feature across servers, I would probably put the rperf script in /usr/lbin (AIX) or /usr/local/bin (Linux).
  6. The comment in the code shows the output of the script: 54.85 rPerf estimated based on 2.00 Virtual CPU cores
  7. The sscanf() function captures the 54.85 into the double variable rperf.
  8. Next, we strip off the newline character - YOU CAN'T HAVE ANY CONTROL CHARACTERS IN STRINGS in JSON nor InfluxDB Line Protocol. The newline would mess up the test mode output. The full njmon or nimon program removes control characters for us.
  9. The pclose() function cleans up the open file. Not cleaning up, would create a memory leak.
  10. We now have a double variable called rperf and a string variable called rperf_string to save to the Time-Series database.  In a real working case, the string variable is rather pointless as you cannot graph a string -that is only numbers.
  11. Finally, check the data is good.  If the extra functions fail to read the data, do not attempt to save the data. Missing data is handled well in Time-Series databases.
  12. In the example, we use a section name of "extra" and save a double, string and "fake" integer just as an illustration.
    You can section any name that is not already in use.  Select a name to make the data content obvious. Perhaps the example would be better to have "estimated_rperf". Other examples: "rdbms", or RDBMS vendors name or "payroll_statistics " or "app_transaction_rates".
  13. The function psectionend() informs njmon or nimon that to end the data called "extra" (or whatever you want to call it).
Compile for Testing
Run the command:
  $ cc extra.c -o extra -D EXTRA_TEST    
Run the new program called "extra":
    $ ./extra  "extra": {          "rperf": 54.850,          "rperf_string": "54.85 rPerf estimated based on 2.00 Virtual CPU cores",          "meaning_of_life": 42,  }  $   
Notes:
  • This example output is badly formed JSON. The final comma (",") after the 42 is not allowed.  Do not worry about this extra comma as it is due to the simplistic test code. The real njmon or nimon code strips out the comma from the output buffer - that is the prime point psectionend() function.
  • If you remove that comma, you could prove it is valid JSON data by reading it in to a Python program and converting it to a Python dictionary.  There are other ways to test for a correct JSON format.
Compile in to njmon or nimon
The Makefile for the current AIX versions uses a command like this to compile:
  cc njmon_aix_v63.c -o nimon_aix722_v63 -D NIMON -g -O3 -lperfstat -lm  -qstrict  cc njmon_aix_v63.c -o njmon_aix722_v63 -D NJMON -g -O3 -lperfstat -lm  -qstrict
Change to:
  cc njmon_aix_v63.c -o nimon_aix722_v63 -D NIMON -g -O3 -lperfstat -lm  -qstrict -D EXTRA  cc njmon_aix_v63.c -o njmon_aix722_v63 -D NJMON -g -O3 -lperfstat -lm  -qstrict -D EXTRA
The extra.c file in the same directory.
Make a similar change to your Makefile or run the commands by-hand to compile your new functions in to njmon and nimon.
How does your code get added to njmon and nimon?
You don't need to understand this bit but it might help you work out what is going on.
The njmon and nimon code uses the "-D EXTRA" to include your new extra.c code file and call the new functions as follows.
To load the extra.c function in to the njmon or nimon code:
  #ifdef EXTRA    #include "extra.c"    #endif /* EXTRA */  
To call the extra_init() function before the main loop:
  #ifdef EXTRA      extra_init();  #endif /* EXTRA */  
To call the extra_data(elapsed) function toward the end of the main loop. It is the last statistics to be added:
  #ifdef EXTRA          extra_data(elapsed);  #endif /* EXTRA */  
Testing the new code worked
Due to the njmon outputted JSON data records being all on one line, the files are hard to edit (unless you use line2pretty.py Python code to convert the format).  It is simpler to use nimon for testing and the output file is easy to edit.  Warning the -f created the output file in your current working directory and ends with "influxdblp":
  ./nimon_aix722_v63 -s1 -c1 -f
Wait five seconds.
Then, check the end of the new "influxdblp" file (in my case the filename is "blue_20200511_2236.influxlp") as follows:
  $ tail -1 blue_20200511_2236.influxlp  extra,host=blue,os=AIX,architecture=POWER8_COMPAT_mode,serial_no=7804930,mtm=IBM-9009-42A rperf=54.850,rperf_string="54.85 rPerf estimated based on 2.00 Virtual CPU cores",meaning_of_life=42i     $ 
Note:
  1. The last line starts with "extra" or whatever name you used in psection()
  2. Next, are the tags like "host=blue" - ignore the tags.  Tags make your statistics easier to find in the Time-Series database.
  3. After the space character is the actual data: 
    rperf=54.850,rperf_string="54.85 rPerf estimated based on 2.00 Virtual CPU cores",meaning_of_life=42i  
  4. The ending "i" is due it being an integer.
If you have similar results, then your code is working. Well done.
The njmon and nimon programs are for the same source code except for outputting to JSON or InfluxDB line protocol.
If nimon works correctly, then so does njmon.
Now run the new njmon or nimon so the data gets to InfluxDB for 10 quick snapshots (-c 10 -s 10)
Checking your new statistic arrives in the database
After two or three snapshot periods, on the InfluxDB server run command line "influx" program and type the following commands.
Assuming your InfluxDB is called "njmon" and the psection() name was "extra".
  # influx  Connected to http://localhost:8086 version 1.7.7  InfluxDB shell version: 1.7.7  > use njmon  Using database njmon  > select * from extra  name: extra  time                architecture       host meaning_of_life mtm          os  rperf rperf_string                                          serial_no  ----                ------------       ---- --------------- ---          --  ----- ------------                                          ---------  1589225676653374913 POWER8_COMPAT_mode blue 42              IBM-9009-42A AIX 54.85 54.85 rPerf estimated based on 2.00 Virtual CPU cores 7804930  1589225687042766744 POWER8_COMPAT_mode blue 42              IBM-9009-42A AIX 54.85 54.85 rPerf estimated based on 2.00 Virtual CPU cores 7804930  
Alternatively, directly select your new statistic columns:
  > select rperf,meaning_of_life,rperf_string from extra  name: extra  time                rperf meaning_of_life rperf_string  ----                ----- --------------- ------------  1589225676653374913 54.85 42              54.85 rPerf estimated based on 2.00 Virtual CPU cores  1589225687042766744 54.85 42              54.85 rPerf estimated based on 2.00 Virtual CPU cores  1589225697429213338 54.85 42              54.85 rPerf estimated based on 2.00 Virtual CPU cores  1589225707969159256 54.85 42              54.85 rPerf estimated based on 2.00 Virtual CPU cores  1589225717355288209 54.85 42              54.85 rPerf estimated based on 2.00 Virtual CPU cores  
If your new statistics are in the InfluxDB. Well done.
Now implement your new njmon or nimon into production
If you rely on a program or script to get the statistics, then make sure every server has that program or script in the same directory.
Now run your new njmon or nimon for real. Wait an hour (so that a graph can be drawn) and use Grafana to graph your new statistics.
In this worked example, the rperf number does not change unless the LPAR (VM) size is dynamically changed.
Example of my server Estimated rPerf changing due to the dynamic changes to Virtual Processor count for the LPAR (VM) via the HMC.
  • The upper left shows the current value in a "Single Stat" panel.
  • The middle graph shows the rPerf values.
  • The lower graph shows the Virtual and CPU consumed with real work.
Graph of rperf value changing
Last thing to do is email to let me know you have it working and statistics. I might add any statistics that are generally useful for all njmon and nimon users for the next release.  Your hard work gets acknowledge in the release notes.
More points:
  • Avoid integers and plong() unless you are sure it never changes to a floating point number - you cannot change your mind later.  Once saved as an integer you have to supply an integer whenever you use that statistic name.
  • For InfluxDB, there is no benefit in reduced disk space or compute time in using integers.
  • If you get this working, then you need to recompile for every new njmon or nimon release.
  • If you run njmon or nimon as the root user, make sure any program or script you are now running is secure. That is, no write permission for "group" of "other" and it is owned by root. 
  • If you are unlikely to have your script or program on every server, then you could add code to check whether the script or program file exists.  If the file does not exist, then the function can return immediately to reduces the CPU time taken.  If you later want the statistics, collected then placing the script or program in the right directory, it effectively switches them on.
  • Make sure you program code runs fast as it called 100's of times a day. Use the time command and run it a few times to allow caching:
      $ time ./extra  "extra": {          "rperf": 54.850,          "rperf_string": "54.85 rPerf estimated based on 2.00 Virtual CPU cores",          "meaning_of_life": 42,  }    real    0m0.11s  user    0m0.01s  sys     0m0.01s  $ time ./extra  "extra": {          "rperf": 54.850,          "rperf_string": "54.85 rPerf estimated based on 2.00 Virtual CPU cores",          "meaning_of_life": 42,  }    real    0m0.04s  user    0m0.01s  sys     0m0.00s  $ time ./extra  "extra": {          "rperf": 54.850,          "rperf_string": "54.85 rPerf estimated based on 2.00 Virtual CPU cores",          "meaning_of_life": 42,  }    real    0m0.04s  user    0m0.01s  sys     0m0.01s  
    A four-hundredth of a second is long time! If a function call was available to get the statistics, it would be much faster.

Additional Information


Find more content from Nigel Griffiths IBM (retired) here:

Document Location

Worldwide

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SWG10","label":"AIX"},"ARM Category":[{"code":"","label":""}],"Platform":[{"code":"PF002","label":"AIX"}],"Version":"All Versions","Line of Business":{"code":"LOB08","label":"Cognitive Systems"}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"HW1W1","label":"Power -\u003EPowerLinux"},"ARM Category":[{"code":"","label":""}],"Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions","Line of Business":{"code":"","label":""}}]

Document Information

Modified date:
13 June 2023

UID

ibm11165420