Topic
  • 6 replies
  • Latest Post - ‏2013-12-06T04:57:51Z by fbenfredj
fbenfredj
fbenfredj
40 Posts

Pinned topic NMON CPU not correct on P795

‏2013-05-17T10:11:51Z |

Hi all,

I have problem with nmon CPUs stats on a P795.

The PhysicalCPU in the LPAR tab show wrong values compared to sar and lpar2rrd.

For example, on one lpar with VP=4, sar and lpar2rrd show physc ~ 3.8, however nmon and lparstat show physc ~ 2.9.

Theses metrics are correct : %usr, %sys, %wio, %idle.

This issue is occuring on all AIX version (5.3, 6.1 and 7.1).

However on P770 servers there no problem.

Any idea?

Thank you for your help

 

 

  • puvichakravarthy
    puvichakravarthy
    55 Posts

    Re: NMON CPU not correct on P795

    ‏2013-05-17T18:23:09Z  

    Could you kindly validate with lparstat output?

  • nagger
    nagger
    1639 Posts

    Re: NMON CPU not correct on P795

    ‏2013-05-20T08:46:22Z  

    My first reaction is that is very unlikely and to check your service pack levels.

    Use oslevel -s   NOTE THE -s

    This will give you the date of the service pack like:

    $ oslevel -s
    6100-07-03-1207
    

    The 12 means 2012 and the 07 means week 7 like early February.  If your service pack level is more than 18 months out of date - you need to update as you might be looking at a bug fixed a long long time ago.

    Second, reaction is: are you running the supplied nmon with AIX version and not nmon classic. I am totally gob-smacked with people running nmon versions from 5 years ago on new hardware. The last nmon classic version was from POWER5 days. Expecting that to work on POWER6 and POWER7 is insane.

    Third point is - raise a PMR if you have to two above right. Why are you asking a forum?

    Fourth, how did you conclude nmon was wrong!  Sar is an ancient UNIX tool and rarely used- as far as I know it is not SMT aware, and LPAR2RRD only gets the data from the HMC. You also don't include if this is a shared CPU or uncapped. Which would have been useful.

     

    cheers, Nigel Griffiths

  • fbenfredj
    fbenfredj
    40 Posts

    Re: NMON CPU not correct on P795

    ‏2013-05-21T13:44:49Z  
    • nagger
    • ‏2013-05-20T08:46:22Z

    My first reaction is that is very unlikely and to check your service pack levels.

    Use oslevel -s   NOTE THE -s

    This will give you the date of the service pack like:

    <pre dir="ltr">$ oslevel -s 6100-07-03-1207 </pre>

    The 12 means 2012 and the 07 means week 7 like early February.  If your service pack level is more than 18 months out of date - you need to update as you might be looking at a bug fixed a long long time ago.

    Second, reaction is: are you running the supplied nmon with AIX version and not nmon classic. I am totally gob-smacked with people running nmon versions from 5 years ago on new hardware. The last nmon classic version was from POWER5 days. Expecting that to work on POWER6 and POWER7 is insane.

    Third point is - raise a PMR if you have to two above right. Why are you asking a forum?

    Fourth, how did you conclude nmon was wrong!  Sar is an ancient UNIX tool and rarely used- as far as I know it is not SMT aware, and LPAR2RRD only gets the data from the HMC. You also don't include if this is a shared CPU or uncapped. Which would have been useful.

     

    cheers, Nigel Griffiths

    Hi Nigel,

    I am sorry for the delay. In France we had a weekend of 3 days. Monday was off-day.

    Here are answers to your reactions

    1/ I had the issue on many lpars of my P795 with different AIX versions and different TL and SP.

    For example, I have an lpar with the last TL and SP of AIX 7.1l : 7100-02-02-1316.

    2/ Of course nmon supplied with AIX version

    3/ The AIX support contract of my customer is being renewed with IBM. I will raise a PMR as soon as it will be renewed.

    So, I wanted to save time to know if there is a known bug.

    4/ I noticed that nmon values are diffrenent from the stats all other tools (vmstat, sar, lpar2rrd, ...) and our official perf tool : Omnivision.

    That's why I concluded that nmon is wrong.

     

    Here are some examples of stats :

    SAR :

    05/21/2013 15:16:57    %usr    %sys    %wio   %idle   physc   %entc
    05/21/2013 15:17:27      60       3       0      37    1.08   537.6
    05/21/2013 15:17:57      61       2       0      37    1.04   521.6
    05/21/2013 15:18:27      59       4       0      37    1.08   540.8
    05/21/2013 15:18:57      61       2       0      37    1.04   520.6
    05/21/2013 15:19:27      61       3       0      37    1.07   537.0
    05/21/2013 15:19:57      61       1       0      37    1.04   519.7
    05/21/2013 15:20:27      60       3       0      37    1.09   545.7
    05/21/2013 15:20:57      61       1       0      37    1.04   519.3
    05/21/2013 15:21:27      61       2       0      37    1.05   525.4
    05/21/2013 15:21:57      59       4       0      37    1.08   540.7

    LPARSTAT :

    %user  %sys  %wait  %idle physc %entc  lbusy   app  vcsw phint  %nsp  Time
    ----- ----- ------ ------ ----- ----- ------   --- ----- ----- ----- --------
    60.5   2.5    0.0   37.0  1.07 534.3   13.2 17.14   591    93    77 15:17:15
    61.2   1.6    0.0   37.2  1.05 523.1   12.9 16.75   507   104    77 15:17:45
    60.1   2.8    0.0   37.1  1.07 532.8   13.2 16.22   542   149    77 15:18:15
    60.4   2.5    0.0   37.1  1.06 529.9   13.2 16.85   548   137    77 15:18:45
    60.7   2.6    0.0   36.7  1.07 534.4   13.3 18.02   528   109    77 15:19:15
    61.3   1.6    0.0   37.1  1.05 523.7   13.0 17.83   499    96    77 15:19:45
    60.2   2.8    0.0   37.0  1.08 542.3   13.6 17.84   684   112    77 15:20:15
    61.3   1.5    0.0   37.2  1.05 522.9   12.9 16.93   551    97    77 15:20:45
    61.5   1.9    0.0   36.6  1.04 522.5   13.1 17.32   551   108    77 15:21:15
    61.0   1.9    0.0   37.1  1.06 527.7   13.0 17.15   537   111    77 15:21:45

    VMSTAT :

    kthr    memory              page              faults              cpu
    ----- ----------- ------------------------ ------------ -----------------------
     r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa    pc    ec
     2  0 1409651  4902   0   0   0   0    0   0   6 2711 976 60  3 38  0  1.09 547.5
     2  0 1409651  4902   0   0   0   0    0   0  15 1580 907 62  2 37  0  1.02 512.5
     2  0 1409651  4902   0   0   0   0    0   0  15  771 858 61  2 38  0  1.05 525.7
     2  0 1409651  4902   0   0   0   0    0   0   5 2004 829 61  2 37  0  1.05 522.8

     NMON

    LPAR    Logical Partition XXXXXXX      PhysicalCPU

    LPAR    T1864   0.865
    LPAR    T1865   0.790
    LPAR    T1866   0.815
    LPAR    T1867   0.801
    LPAR    T1868   0.842
    LPAR    T1869   0.794
    LPAR    T1870   0.825
    LPAR    T1871   0.792
    LPAR    T1872   0.817
    LPAR    T1873   0.813
    LPAR    T1874   0.823
    LPAR    T1875   0.793
    LPAR    T1876   0.802
    LPAR    T1877   0.797
    LPAR    T1878   0.845
    LPAR    T1879   0.793
    LPAR    T1880   0.814
    LPAR    T1881   0.794
    LPAR    T1882   0.804
     

    Updated on 2013-05-21T13:53:15Z at 2013-05-21T13:53:15Z by fbenfredj
  • nagger
    nagger
    1639 Posts

    Re: NMON CPU not correct on P795

    ‏2013-05-21T15:14:32Z  
    • fbenfredj
    • ‏2013-05-21T13:44:49Z

    Hi Nigel,

    I am sorry for the delay. In France we had a weekend of 3 days. Monday was off-day.

    Here are answers to your reactions

    1/ I had the issue on many lpars of my P795 with different AIX versions and different TL and SP.

    For example, I have an lpar with the last TL and SP of AIX 7.1l : 7100-02-02-1316.

    2/ Of course nmon supplied with AIX version

    3/ The AIX support contract of my customer is being renewed with IBM. I will raise a PMR as soon as it will be renewed.

    So, I wanted to save time to know if there is a known bug.

    4/ I noticed that nmon values are diffrenent from the stats all other tools (vmstat, sar, lpar2rrd, ...) and our official perf tool : Omnivision.

    That's why I concluded that nmon is wrong.

     

    Here are some examples of stats :

    SAR :

    05/21/2013 15:16:57    %usr    %sys    %wio   %idle   physc   %entc
    05/21/2013 15:17:27      60       3       0      37    1.08   537.6
    05/21/2013 15:17:57      61       2       0      37    1.04   521.6
    05/21/2013 15:18:27      59       4       0      37    1.08   540.8
    05/21/2013 15:18:57      61       2       0      37    1.04   520.6
    05/21/2013 15:19:27      61       3       0      37    1.07   537.0
    05/21/2013 15:19:57      61       1       0      37    1.04   519.7
    05/21/2013 15:20:27      60       3       0      37    1.09   545.7
    05/21/2013 15:20:57      61       1       0      37    1.04   519.3
    05/21/2013 15:21:27      61       2       0      37    1.05   525.4
    05/21/2013 15:21:57      59       4       0      37    1.08   540.7

    LPARSTAT :

    %user  %sys  %wait  %idle physc %entc  lbusy   app  vcsw phint  %nsp  Time
    ----- ----- ------ ------ ----- ----- ------   --- ----- ----- ----- --------
    60.5   2.5    0.0   37.0  1.07 534.3   13.2 17.14   591    93    77 15:17:15
    61.2   1.6    0.0   37.2  1.05 523.1   12.9 16.75   507   104    77 15:17:45
    60.1   2.8    0.0   37.1  1.07 532.8   13.2 16.22   542   149    77 15:18:15
    60.4   2.5    0.0   37.1  1.06 529.9   13.2 16.85   548   137    77 15:18:45
    60.7   2.6    0.0   36.7  1.07 534.4   13.3 18.02   528   109    77 15:19:15
    61.3   1.6    0.0   37.1  1.05 523.7   13.0 17.83   499    96    77 15:19:45
    60.2   2.8    0.0   37.0  1.08 542.3   13.6 17.84   684   112    77 15:20:15
    61.3   1.5    0.0   37.2  1.05 522.9   12.9 16.93   551    97    77 15:20:45
    61.5   1.9    0.0   36.6  1.04 522.5   13.1 17.32   551   108    77 15:21:15
    61.0   1.9    0.0   37.1  1.06 527.7   13.0 17.15   537   111    77 15:21:45

    VMSTAT :

    kthr    memory              page              faults              cpu
    ----- ----------- ------------------------ ------------ -----------------------
     r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa    pc    ec
     2  0 1409651  4902   0   0   0   0    0   0   6 2711 976 60  3 38  0  1.09 547.5
     2  0 1409651  4902   0   0   0   0    0   0  15 1580 907 62  2 37  0  1.02 512.5
     2  0 1409651  4902   0   0   0   0    0   0  15  771 858 61  2 38  0  1.05 525.7
     2  0 1409651  4902   0   0   0   0    0   0   5 2004 829 61  2 37  0  1.05 522.8

     NMON

    LPAR    Logical Partition XXXXXXX      PhysicalCPU

    LPAR    T1864   0.865
    LPAR    T1865   0.790
    LPAR    T1866   0.815
    LPAR    T1867   0.801
    LPAR    T1868   0.842
    LPAR    T1869   0.794
    LPAR    T1870   0.825
    LPAR    T1871   0.792
    LPAR    T1872   0.817
    LPAR    T1873   0.813
    LPAR    T1874   0.823
    LPAR    T1875   0.793
    LPAR    T1876   0.802
    LPAR    T1877   0.797
    LPAR    T1878   0.845
    LPAR    T1879   0.793
    LPAR    T1880   0.814
    LPAR    T1881   0.794
    LPAR    T1882   0.804
     

    OK, thanks for the info, you clearly have covered the basics well - this needs a PMR to get it fixed.

    2% differences might be explainable due to timing of the stats output but 20% consistently under is gibberish.

    Hope you can let us know what happens!

    Cheers, Nigel Griffiths

  • Steve_ATS
    Steve_ATS
    40 Posts

    Re: NMON CPU not correct on P795

    ‏2013-05-22T22:23:09Z  
    • fbenfredj
    • ‏2013-05-21T13:44:49Z

    Hi Nigel,

    I am sorry for the delay. In France we had a weekend of 3 days. Monday was off-day.

    Here are answers to your reactions

    1/ I had the issue on many lpars of my P795 with different AIX versions and different TL and SP.

    For example, I have an lpar with the last TL and SP of AIX 7.1l : 7100-02-02-1316.

    2/ Of course nmon supplied with AIX version

    3/ The AIX support contract of my customer is being renewed with IBM. I will raise a PMR as soon as it will be renewed.

    So, I wanted to save time to know if there is a known bug.

    4/ I noticed that nmon values are diffrenent from the stats all other tools (vmstat, sar, lpar2rrd, ...) and our official perf tool : Omnivision.

    That's why I concluded that nmon is wrong.

     

    Here are some examples of stats :

    SAR :

    05/21/2013 15:16:57    %usr    %sys    %wio   %idle   physc   %entc
    05/21/2013 15:17:27      60       3       0      37    1.08   537.6
    05/21/2013 15:17:57      61       2       0      37    1.04   521.6
    05/21/2013 15:18:27      59       4       0      37    1.08   540.8
    05/21/2013 15:18:57      61       2       0      37    1.04   520.6
    05/21/2013 15:19:27      61       3       0      37    1.07   537.0
    05/21/2013 15:19:57      61       1       0      37    1.04   519.7
    05/21/2013 15:20:27      60       3       0      37    1.09   545.7
    05/21/2013 15:20:57      61       1       0      37    1.04   519.3
    05/21/2013 15:21:27      61       2       0      37    1.05   525.4
    05/21/2013 15:21:57      59       4       0      37    1.08   540.7

    LPARSTAT :

    %user  %sys  %wait  %idle physc %entc  lbusy   app  vcsw phint  %nsp  Time
    ----- ----- ------ ------ ----- ----- ------   --- ----- ----- ----- --------
    60.5   2.5    0.0   37.0  1.07 534.3   13.2 17.14   591    93    77 15:17:15
    61.2   1.6    0.0   37.2  1.05 523.1   12.9 16.75   507   104    77 15:17:45
    60.1   2.8    0.0   37.1  1.07 532.8   13.2 16.22   542   149    77 15:18:15
    60.4   2.5    0.0   37.1  1.06 529.9   13.2 16.85   548   137    77 15:18:45
    60.7   2.6    0.0   36.7  1.07 534.4   13.3 18.02   528   109    77 15:19:15
    61.3   1.6    0.0   37.1  1.05 523.7   13.0 17.83   499    96    77 15:19:45
    60.2   2.8    0.0   37.0  1.08 542.3   13.6 17.84   684   112    77 15:20:15
    61.3   1.5    0.0   37.2  1.05 522.9   12.9 16.93   551    97    77 15:20:45
    61.5   1.9    0.0   36.6  1.04 522.5   13.1 17.32   551   108    77 15:21:15
    61.0   1.9    0.0   37.1  1.06 527.7   13.0 17.15   537   111    77 15:21:45

    VMSTAT :

    kthr    memory              page              faults              cpu
    ----- ----------- ------------------------ ------------ -----------------------
     r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa    pc    ec
     2  0 1409651  4902   0   0   0   0    0   0   6 2711 976 60  3 38  0  1.09 547.5
     2  0 1409651  4902   0   0   0   0    0   0  15 1580 907 62  2 37  0  1.02 512.5
     2  0 1409651  4902   0   0   0   0    0   0  15  771 858 61  2 38  0  1.05 525.7
     2  0 1409651  4902   0   0   0   0    0   0   5 2004 829 61  2 37  0  1.05 522.8

     NMON

    LPAR    Logical Partition XXXXXXX      PhysicalCPU

    LPAR    T1864   0.865
    LPAR    T1865   0.790
    LPAR    T1866   0.815
    LPAR    T1867   0.801
    LPAR    T1868   0.842
    LPAR    T1869   0.794
    LPAR    T1870   0.825
    LPAR    T1871   0.792
    LPAR    T1872   0.817
    LPAR    T1873   0.813
    LPAR    T1874   0.823
    LPAR    T1875   0.793
    LPAR    T1876   0.802
    LPAR    T1877   0.797
    LPAR    T1878   0.845
    LPAR    T1879   0.793
    LPAR    T1880   0.814
    LPAR    T1881   0.794
    LPAR    T1882   0.804
     

    Hmm, %nsp at 77, cpus are throttled.  physc X nsp% looks pretty close...

    I've seen some weird output from lparstat -E in later levels, but have not seen it impact physc, but wasn't trying to correlate to nmon either at the time.

  • fbenfredj
    fbenfredj
    40 Posts

    Re: NMON CPU not correct on P795

    ‏2013-12-06T04:57:51Z  
    • nagger
    • ‏2013-05-21T15:14:32Z

    OK, thanks for the info, you clearly have covered the basics well - this needs a PMR to get it fixed.

    2% differences might be explainable due to timing of the stats output but 20% consistently under is gibberish.

    Hope you can let us know what happens!

    Cheers, Nigel Griffiths

    I opened a PMR, but my contract with my customer was over at the of June before I got a solution from IBM.

    But, I got news from my colleague who continued following the PMR.

    The issue was due to the mecanism of power saving.

    When my colleague deactivate the power saving on the P795s the CPU usage became correct.

    Hope this help!

    Cheers,

    Faouzi BEN FREDJ

    P.S : Sorri, I changed my display name from aixadmin_fbf to fbenfredj