Comments (7)
  • Add a Comment
  • Edit
  • More Actions v
  • Quarantine this Entry

1 RFS commented Permalink

A couple of questions? <div>&nbsp;</div> 1) Does the run queue count include the currently executing threads or just those waiting for CPU resources? <div>&nbsp;</div> 2) In this example, should I only be concerned if the run queue is &gt; 300 for sustained period?

2 nagger commented Permalink

On AIX and reported by nmon, vmstat, topas and others the run queue is the number of runnable processes that could use CPU cycles right now. Some of these will be on the CPUs and some waiting to get CPU time (assuming there are enough of them). For part 2) For maximum throughput you wan the number to be greater than 300 (or the total number of SMT threads (CPU cores times 4 for SMT=4) also called the logical CPUs) so that if some of the processes on the CPUs hit something they have to wait for like locks, disk I/O network packet responses, memory to be freed then there are other process threads that can use the CPU cycles. More of a worry for me (as a performance speed junky) is a smaller run queue like less than half the SMT threads = badly designed application, not enough user activity or a LPAR that is over configured in CPU resources at could be used better else where. <div>&nbsp;</div> If you use the "ps -el" command and look at the WCHAN column then the processes with a item in this column are the ones that are not runnable and so are not on the run queue as they are waiting for an event like terminal, disk or network I/O or locks or inter-process communication. The WCHAN is called the Wait or Sleep Channel and it actually the AIX Kernel address of the data structure it is waiting for the resource to be available event. When the kernel puts data in the resource it puts the processes waiting on the run queue so they can take actions on the newly arrived data. <div>&nbsp;</div> Hop that helps, Nigel Griffiths

3 WollyJeeWilikers commented Permalink

We are currently running topas_nmon on some AIX 6.1 P5, P6, and P7 servers. Would you be able to tell us which API call is being made behind the scenes of topas_nmon that gets rolled up and creates the CPU_ALL view in the NMON Analyzer? <div>&nbsp;</div> You mention to look at the Physical CPU cores consumed. Our system currently calls the perfstat_cpu_total subroutine, but when we graph this data over time (User, System, Idle), the graph is about half of the total % that the CPU_ALL tab provides. Should we be using the perfstat_cpu subroutine instead? We would ideally like to mimic the CPU_ALL tab stacked CPU graph. <div>&nbsp;</div> Could you also elaborate a bit more on the difference between the CPU_ALL and PCPU_ALL tabs, and why the PCPU tabs don't always get generated when the NMON Analyzer is used on some AIX nodes (currently using 4.2).

4 nagger commented Permalink

Dear WollyJeeWilikers, perf_cpu_total() and the perfstat_cpu_total_t data structure using the puser, psys, pwait and pidle members. Perhaps you forget to divide the results by the elapsed time. You also don't state if this is a shared or dedicated CPU LPAR - maths is different for Shared CPU - here you have to factor in the time the LPAR was not actually running on the CPU. IN this case the user an system time are OK but you have to boost the wait and idle times to allow for missing PURR counter increments. The boost is done in the ratio of the wait and idle, of course. The PCPU and SCPU stats where (in my humble opinion) a confusing mistake and only useful if you have the CPUs in Power saving mode i.e. its changing the CPU GHz to save electrical power. I hope to have them removed or an optional feature in the next AIX release. The developer that added them did not realise the volume of data they cause on larger machines. The maximum would be the new E880 with 192 CPU which would generate 3000+ lines of pointless stats every snap shot.

5 WollyJeeWilikers commented Permalink

Thanks for the response. This is a dedicated LPAR in capped mode running SMT4 with 176 logical CPUs based on the output from lparstat. <div>&nbsp;</div> In the raw nmon file, I have 176 entries for CPU in the T0001. My question is how are the User%, Sys%, and Wait% computed on the CPU_ALL tab when this data is ran through the nmon analyzer? I was using version 4.2 to process the raw file. I would post some of the raw data, but the formatting isn't very friendly. If I can get an understanding of how each of those T lines are computed for the CPU_ALL tab based on the raw values from the nmon out file, I think that will be the key to finding where our math is wrong when using the CPU ticks off the server using perfstat_cpu_total. <div>&nbsp;</div> Here are some other stats from the raw nmon file (AAA): <br /> progrname,topas_nmon <br /> version,TOPAS_NMON <br /> AIX,6.1.7.15 <br /> TL,07 <br /> interval,300 <br /> snapshots,288, <br /> hardware,Architecture PowerPC Implementation POWER7_in_P7 mode 64 bit <br /> cpus,192,176 <br /> kernel,HW-type=CHRP=Common H/W Reference Platform Bus=PCI LPAR=Dynamic Multi-Processor 64 bit

6 nagger commented Permalink

Hi Wolly, Sorry, I don't understand your question. The nmon file is not "raw" but a text file that you can edit and as far as I know the nmon Analyser just displays the CPU_ALL data lines collected by nmon. They are not calculated.

7 WollyJeeWilikers commented Permalink

My apologies, I thought the analyzer some how used the time slices (T0001-T0288) lines for each CPU (CPUXXX) to calculate the CPU_ALL entries for T0001-T0288, but I see what you are referencing now. I didn't realize that there was a specific T0001-T0288 entry for CPU_ALL in the nmon file, so it makes sense that the analyzer just organizes those and graphs them. <div>&nbsp;</div> Do you know what math is being done behind the scenes to convert the cpu ticks into those percentages that show for each CPU_ALL line in the nmon file? <div>&nbsp;</div> The data we are getting from perfstat_cpu_total seems to be cumulative cpu ticks for each time collection we get, so this is the current way we calculate it. <div>&nbsp;</div> Example: <br /> Snapshot 1 - 04/08/15 00:04 AM - CPUUTIL - 714,130,120 <br /> Snapshot 2 - 04/08/15 00:09 AM - CPUUTIL - 714,134,768 <div>&nbsp;</div> This amounts to a 5 minutes snapshot (300 seconds) and a delta of 4,648 cpu ticks. Since there are 176 logical CPUs, we multiple 176 * 300 seconds to get a maximum number of cpu seconds available during that time period to get 52,800. We then divide the 4,648 cpu ticks by 52,800 possible cpu ticks to get roughly 8.8% total cpu used. If I use the nmon analyzer for this same server and time period (00:05 AM - 00:10 AM), the CPU_ALL total cpu % value is 14.3%. This ratio continues to trend throughout the entire time range for the AIX nodes, where our calculations are about 1/2 of the total cpu % that the nmon analyzer gives us for the CPU_ALL value. I'm just trying to get an understanding of where our math is off and what we are not accounting for. <div>&nbsp;</div> Thanks again for helping out.