Topic
  • 7 replies
  • Latest Post - ‏2011-05-05T12:02:36Z by SystemAdmin
SystemAdmin
SystemAdmin
706 Posts

Pinned topic POWER7 PMCs: Are the events specific to a hardware thread?

‏2011-04-29T17:43:16Z |
Received in an email and paraphrased here...

We're looking at the POWER7 performance monitor counters and we were wondering whether some events are specific to a particular hardware thread (recognizing there can be up to 4 hardware threads per core), while other events are specific to a full POWER7 core? Or if it's all hardware-thread based.
Updated on 2011-05-05T12:02:36Z at 2011-05-05T12:02:36Z by SystemAdmin
  • SystemAdmin
    SystemAdmin
    706 Posts

    Re: POWER7 PMCs: Are the events specific to a hardware thread?

    ‏2011-04-29T20:24:28Z  
    It appears that a majority of the events in the PMU are thread level, but there are also core level events (some L2, all L3 events, and some other core level events). There are also some chip level events.

    Examples are being worked.
  • SystemAdmin
    SystemAdmin
    706 Posts

    Re: POWER7 PMCs: Are the events specific to a hardware thread?

    ‏2011-05-03T21:24:36Z  
    It appears that a majority of the events in the PMU are thread level, but there are also core level events (some L2, all L3 events, and some other core level events). There are also some chip level events.

    Examples are being worked.
    One suggestion received.. try using perf to catch the L3 event on a single core when in SMT=4 mode.

    So try running and starting all of these at the same time:

    (load_L3 being some new program that's likely to generate a lot of L3 cache misses)

    perf stat -C 0 PM_L3_MISS /bin/taskset -c 0 load_L3  
    perf stat -C 1 PM_L3_MISS sleep 10
    perf stat -C 2 PM_L3_MISS sleep 10
    perf stat -C 3 PM_L3_MISS sleep 10

    Then see if you get the same L3 counts on all of these.  You'd need to know how long the load_L3 program ran and calibrate the sleep time to be about the same (10 seconds was used in this example).

    Caveat: You'd need to build perf from 2.6.36 or newer source to be able to use the "-C" option.  The -C option will work with older kernels
    (2.6.32ish) as well as newer ones.
  • SystemAdmin
    SystemAdmin
    706 Posts

    Re: POWER7 PMCs: Are the events specific to a hardware thread?

    ‏2011-05-03T21:26:54Z  
    One suggestion received.. try using perf to catch the L3 event on a single core when in SMT=4 mode.

    So try running and starting all of these at the same time:

    (load_L3 being some new program that's likely to generate a lot of L3 cache misses)

    perf stat -C 0 PM_L3_MISS /bin/taskset -c 0 load_L3  
    perf stat -C 1 PM_L3_MISS sleep 10
    perf stat -C 2 PM_L3_MISS sleep 10
    perf stat -C 3 PM_L3_MISS sleep 10

    Then see if you get the same L3 counts on all of these.  You'd need to know how long the load_L3 program ran and calibrate the sleep time to be about the same (10 seconds was used in this example).

    Caveat: You'd need to build perf from 2.6.36 or newer source to be able to use the "-C" option.  The -C option will work with older kernels
    (2.6.32ish) as well as newer ones.
    Since the PM_L3_MISS isn't a defined perf event, you will need to use the raw event code which means you will need to install the libpfm4 library and use the showevtinfo to cross reference the code.

    Example to follow.
  • SystemAdmin
    SystemAdmin
    706 Posts

    Re: POWER7 PMCs: Are the events specific to a hardware thread?

    ‏2011-05-04T18:23:38Z  
    Since the PM_L3_MISS isn't a defined perf event, you will need to use the raw event code which means you will need to install the libpfm4 library and use the showevtinfo to cross reference the code.

    Example to follow.
    From a peer...

    Using the libpfm-3.10 showevtinfo I got this:

    Name : PM_L3_MISS
    Desc : L3 Misses
    Code : 0x1f082
    Counters : 0

    Then I created a simple program that allocates N * 4MB of
    memory and iterate through it in a sequential manner. I added two
    arguments on this program: the memory size to allocate and the ntimes
    to run on the allocated array.

    I assume that, using a array of the core L3's size, it will trigger
    N times the PM_L3_MISS and the total shouldn't change if I increase
    the ntimes the program runs on it.

    Indeed it was what I saw:

    1. perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 1 -n 1

    Allocating 4194304 bytes and iteration 1 over it

    Performance counter stats for '/bin/taskset -c 0 ./l3load -c 1 -n 1':

    86,832 raw 0x1f082

    0.085471983 seconds time elapsed
    1. perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 1 -n 2

    Allocating 4194304 bytes and iteration 2 over it

    Performance counter stats for '/bin/taskset -c 0 ./l3load -c 1 -n 2':

    82,220 raw 0x1f082

    0.169126166 seconds time elapsed

    1. perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 1 -n 3

    Allocating 4194304 bytes and iteration 3 over it

    Performance counter stats for '/bin/taskset -c 0 ./l3load -c 1 -n 3':

    84,940 raw 0x1f082

    0.252955443 seconds time elapsed

    Another test I tried to check if events is reporting correctly is to
    allocate a array larger than total L3 cache on chip (in this case, 32MB)
    and iterating over it:

    1. perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 10 -n 5

    Allocating 41943040 bytes and iteration 5 over it

    Performance counter stats for '/bin/taskset -c 0 ./l3load -c 10 -n 5':

    604,054 raw 0x1f082

    4.201362030 seconds time elapsed

    1. perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 10 -n 10

    Allocating 41943040 bytes and iteration 10 over it

    Performance counter stats for '/bin/taskset -c 0 ./l3load -c 10 -n 10':

    660,550 raw 0x1f082

    8.394621691 seconds time elapsed

    1. perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 10 -n 15

    Allocating 41943040 bytes and iteration 15 over it

    Performance counter stats for '/bin/taskset -c 0 ./l3load -c 10 -n 15':

    708,350 raw 0x1f082

    12.588042405 seconds time elapsed

    Ok, now that I have some confirmation both my program and the events
    looks good, I created this script:

    #!/bin/bash

    PROGRAM="./l3load"
    EVENT="1f082"
    SLEEP="10"

    perf stat -C 0 -e r$EVENT /bin/taskset -c 0 $PROGRAM -c 8 &
    perf stat -C 1 -e r$EVENT sleep $SLEEP &
    perf stat -C 2 -e r$EVENT sleep $SLEEP &
    perf stat -C 3 -e r$EVENT sleep $SLEEP &

    And then ran it:

    1. ./run.sh
    2. Allocating 33554432 bytes and iteration 16 over it

    Performance counter stats for 'sleep 10':

    8,830 raw 0x1f082

    10.000688096 seconds time elapsed
    Performance counter stats for 'sleep 10':

    9,284 raw 0x1f082

    10.000630696 seconds time elapsed
    Performance counter stats for 'sleep 10':

    10,894 raw 0x1f082

    10.000696278 seconds time elapsed
    Performance counter stats for '/bin/taskset -c 0 ./l3load -c 8':

    575,785 raw 0x1f082

    10.740074627 seconds time elapsed

    Looks like the PM_L3_MISS counter is per thread, since the machine is
    SMT4 mode.
  • SystemAdmin
    SystemAdmin
    706 Posts

    Re: POWER7 PMCs: Are the events specific to a hardware thread?

    ‏2011-05-04T18:31:26Z  
    From a peer...

    Using the libpfm-3.10 showevtinfo I got this:

    Name : PM_L3_MISS
    Desc : L3 Misses
    Code : 0x1f082
    Counters : 0

    Then I created a simple program that allocates N * 4MB of
    memory and iterate through it in a sequential manner. I added two
    arguments on this program: the memory size to allocate and the ntimes
    to run on the allocated array.

    I assume that, using a array of the core L3's size, it will trigger
    N times the PM_L3_MISS and the total shouldn't change if I increase
    the ntimes the program runs on it.

    Indeed it was what I saw:

    1. perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 1 -n 1

    Allocating 4194304 bytes and iteration 1 over it

    Performance counter stats for '/bin/taskset -c 0 ./l3load -c 1 -n 1':

    86,832 raw 0x1f082

    0.085471983 seconds time elapsed
    1. perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 1 -n 2

    Allocating 4194304 bytes and iteration 2 over it

    Performance counter stats for '/bin/taskset -c 0 ./l3load -c 1 -n 2':

    82,220 raw 0x1f082

    0.169126166 seconds time elapsed

    1. perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 1 -n 3

    Allocating 4194304 bytes and iteration 3 over it

    Performance counter stats for '/bin/taskset -c 0 ./l3load -c 1 -n 3':

    84,940 raw 0x1f082

    0.252955443 seconds time elapsed

    Another test I tried to check if events is reporting correctly is to
    allocate a array larger than total L3 cache on chip (in this case, 32MB)
    and iterating over it:

    1. perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 10 -n 5

    Allocating 41943040 bytes and iteration 5 over it

    Performance counter stats for '/bin/taskset -c 0 ./l3load -c 10 -n 5':

    604,054 raw 0x1f082

    4.201362030 seconds time elapsed

    1. perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 10 -n 10

    Allocating 41943040 bytes and iteration 10 over it

    Performance counter stats for '/bin/taskset -c 0 ./l3load -c 10 -n 10':

    660,550 raw 0x1f082

    8.394621691 seconds time elapsed

    1. perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 10 -n 15

    Allocating 41943040 bytes and iteration 15 over it

    Performance counter stats for '/bin/taskset -c 0 ./l3load -c 10 -n 15':

    708,350 raw 0x1f082

    12.588042405 seconds time elapsed

    Ok, now that I have some confirmation both my program and the events
    looks good, I created this script:

    #!/bin/bash

    PROGRAM="./l3load"
    EVENT="1f082"
    SLEEP="10"

    perf stat -C 0 -e r$EVENT /bin/taskset -c 0 $PROGRAM -c 8 &
    perf stat -C 1 -e r$EVENT sleep $SLEEP &
    perf stat -C 2 -e r$EVENT sleep $SLEEP &
    perf stat -C 3 -e r$EVENT sleep $SLEEP &

    And then ran it:

    1. ./run.sh
    2. Allocating 33554432 bytes and iteration 16 over it

    Performance counter stats for 'sleep 10':

    8,830 raw 0x1f082

    10.000688096 seconds time elapsed
    Performance counter stats for 'sleep 10':

    9,284 raw 0x1f082

    10.000630696 seconds time elapsed
    Performance counter stats for 'sleep 10':

    10,894 raw 0x1f082

    10.000696278 seconds time elapsed
    Performance counter stats for '/bin/taskset -c 0 ./l3load -c 8':

    575,785 raw 0x1f082

    10.740074627 seconds time elapsed

    Looks like the PM_L3_MISS counter is per thread, since the machine is
    SMT4 mode.
    ugh. that didnt' format well. let's try that again

    using the libpfm-3.10 showevtinfo I got this:

    
    Name     : PM_L3_MISS Desc     : L3 Misses Code     : 0x1f082 Counters : [ 0 ]
    


    Then I created a simple program that allocates N * 4MB of memory and iterate through it in a sequential manner. I added two arguments on this program: the memory size to allocate and the ntimes to run on the allocated array. I assume that, using an array of the core L3's size, it will trigger N times the PM_L3_MISS and the total shouldn't change if I increase the ntimes the program runs on it.

    Indeed it was what I saw:

    
    # perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 1 -n 1 Allocating 4194304 bytes and iteration 1 over it   Performance counter stats 
    
    for 
    '/bin/taskset -c 0 ./l3load -c 1 -n 1':   86,832 raw 0x1f082   0.085471983  seconds time elapsed   # perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 1 -n 2 Allocating 4194304 bytes and iteration 2 over it   Performance counter stats 
    
    for 
    '/bin/taskset -c 0 ./l3load -c 1 -n 2':   82,220 raw 0x1f082   0.169126166  seconds time elapsed   # perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 1 -n 3   Allocating 4194304 bytes and iteration 3 over it   Performance counter stats 
    
    for 
    '/bin/taskset -c 0 ./l3load -c 1 -n 3':   84,940 raw 0x1f082   0.252955443  seconds time elapsed
    


    Another test I tried to check if events is reporting correctly is to
    allocate a array larger than total L3 cache on chip (in this case, 32MB)
    and iterating over it:

    
    # perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 10 -n 5   Allocating 41943040 bytes and iteration 5 over it   Performance counter stats 
    
    for 
    '/bin/taskset -c 0 ./l3load -c 10 -n 5':   604,054 raw 0x1f082   4.201362030  seconds time elapsed   # perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 10 -n 10 Allocating 41943040 bytes and iteration 10 over it   Performance counter stats 
    
    for 
    '/bin/taskset -c 0 ./l3load -c 10 -n 10':   660,550 raw 0x1f082   8.394621691  seconds time elapsed   # perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 10 -n 15   Allocating 41943040 bytes and iteration 15 over it   Performance counter stats 
    
    for 
    '/bin/taskset -c 0 ./l3load -c 10 -n 15':   708,350 raw 0x1f082   12.588042405  seconds time elapsed
    


    I now have some confirmation both my program and the events looks good, I created this script:

    
    #!/bin/bash   PROGRAM=
    "./l3load" EVENT=
    "1f082" SLEEP=
    "10"   perf stat -C 0 -e r$EVENT /bin/taskset -c 0 $PROGRAM -c 8 & perf stat -C 1 -e r$EVENT sleep $SLEEP & perf stat -C 2 -e r$EVENT sleep $SLEEP & perf stat -C 3 -e r$EVENT sleep $SLEEP &
    


    And then ran it:

    
    # ./run.sh # Allocating 33554432 bytes and iteration 16 over it   Performance counter stats 
    
    for 
    'sleep 10':   8,830 raw 0x1f082   10.000688096  seconds time elapsed     Performance counter stats 
    
    for 
    'sleep 10':   9,284 raw 0x1f082   10.000630696  seconds time elapsed     Performance counter stats 
    
    for 
    'sleep 10':   10,894 raw 0x1f082   10.000696278  seconds time elapsed     Performance counter stats 
    
    for 
    '/bin/taskset -c 0 ./l3load -c 8':   575,785 raw 0x1f082   10.740074627  seconds time elapsed
    


    Conclusion from this data.. it looks like the PM_L3_MISS counter is per thread, since the machine is SMT4 mode.
  • SystemAdmin
    SystemAdmin
    706 Posts

    Re: POWER7 PMCs: Are the events specific to a hardware thread?

    ‏2011-05-05T11:58:58Z  
    From a peer...

    Using the libpfm-3.10 showevtinfo I got this:

    Name : PM_L3_MISS
    Desc : L3 Misses
    Code : 0x1f082
    Counters : 0

    Then I created a simple program that allocates N * 4MB of
    memory and iterate through it in a sequential manner. I added two
    arguments on this program: the memory size to allocate and the ntimes
    to run on the allocated array.

    I assume that, using a array of the core L3's size, it will trigger
    N times the PM_L3_MISS and the total shouldn't change if I increase
    the ntimes the program runs on it.

    Indeed it was what I saw:

    1. perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 1 -n 1

    Allocating 4194304 bytes and iteration 1 over it

    Performance counter stats for '/bin/taskset -c 0 ./l3load -c 1 -n 1':

    86,832 raw 0x1f082

    0.085471983 seconds time elapsed
    1. perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 1 -n 2

    Allocating 4194304 bytes and iteration 2 over it

    Performance counter stats for '/bin/taskset -c 0 ./l3load -c 1 -n 2':

    82,220 raw 0x1f082

    0.169126166 seconds time elapsed

    1. perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 1 -n 3

    Allocating 4194304 bytes and iteration 3 over it

    Performance counter stats for '/bin/taskset -c 0 ./l3load -c 1 -n 3':

    84,940 raw 0x1f082

    0.252955443 seconds time elapsed

    Another test I tried to check if events is reporting correctly is to
    allocate a array larger than total L3 cache on chip (in this case, 32MB)
    and iterating over it:

    1. perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 10 -n 5

    Allocating 41943040 bytes and iteration 5 over it

    Performance counter stats for '/bin/taskset -c 0 ./l3load -c 10 -n 5':

    604,054 raw 0x1f082

    4.201362030 seconds time elapsed

    1. perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 10 -n 10

    Allocating 41943040 bytes and iteration 10 over it

    Performance counter stats for '/bin/taskset -c 0 ./l3load -c 10 -n 10':

    660,550 raw 0x1f082

    8.394621691 seconds time elapsed

    1. perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 10 -n 15

    Allocating 41943040 bytes and iteration 15 over it

    Performance counter stats for '/bin/taskset -c 0 ./l3load -c 10 -n 15':

    708,350 raw 0x1f082

    12.588042405 seconds time elapsed

    Ok, now that I have some confirmation both my program and the events
    looks good, I created this script:

    #!/bin/bash

    PROGRAM="./l3load"
    EVENT="1f082"
    SLEEP="10"

    perf stat -C 0 -e r$EVENT /bin/taskset -c 0 $PROGRAM -c 8 &
    perf stat -C 1 -e r$EVENT sleep $SLEEP &
    perf stat -C 2 -e r$EVENT sleep $SLEEP &
    perf stat -C 3 -e r$EVENT sleep $SLEEP &

    And then ran it:

    1. ./run.sh
    2. Allocating 33554432 bytes and iteration 16 over it

    Performance counter stats for 'sleep 10':

    8,830 raw 0x1f082

    10.000688096 seconds time elapsed
    Performance counter stats for 'sleep 10':

    9,284 raw 0x1f082

    10.000630696 seconds time elapsed
    Performance counter stats for 'sleep 10':

    10,894 raw 0x1f082

    10.000696278 seconds time elapsed
    Performance counter stats for '/bin/taskset -c 0 ./l3load -c 8':

    575,785 raw 0x1f082

    10.740074627 seconds time elapsed

    Looks like the PM_L3_MISS counter is per thread, since the machine is
    SMT4 mode.
    A response to the experiment above..


    I'm not convinced this is proof of the L3 events being thread-specific.

    That you are getting different much lower counts on the CPUs that are idle may not mean anything. It could be that their counters are halted while idle.

    Another experiment would be to place a infinite loop load on three of the four threads, and the L3 cache load on the fourth, to make sure that the CPUs are occupied and counting all of the time.

    
    perf stat -C 0 -e r$EVENT /bin/taskset -c 0 $PROGRAM -c 8 & perf stat -C 1 -e r$EVENT /bin/taskset -c 1 $SPIN_LOAD & perf stat -C 2 -e r$EVENT /bin/taskset -c 2 $SPIN_LOAD & perf stat -C 3 -e r$EVENT /bin/taskset -c 3 $SPIN_LOAD &
    


    Where $SPIN_LOAD is a program that loops doing nothing, for about the same time as it takes to run $PROGRAM -c 8
  • SystemAdmin
    SystemAdmin
    706 Posts

    Re: POWER7 PMCs: Are the events specific to a hardware thread?

    ‏2011-05-05T12:02:36Z  
    A response to the experiment above..


    I'm not convinced this is proof of the L3 events being thread-specific.

    That you are getting different much lower counts on the CPUs that are idle may not mean anything. It could be that their counters are halted while idle.

    Another experiment would be to place a infinite loop load on three of the four threads, and the L3 cache load on the fourth, to make sure that the CPUs are occupied and counting all of the time.

    <pre class="jive-pre"> perf stat -C 0 -e r$EVENT /bin/taskset -c 0 $PROGRAM -c 8 & perf stat -C 1 -e r$EVENT /bin/taskset -c 1 $SPIN_LOAD & perf stat -C 2 -e r$EVENT /bin/taskset -c 2 $SPIN_LOAD & perf stat -C 3 -e r$EVENT /bin/taskset -c 3 $SPIN_LOAD & </pre>

    Where $SPIN_LOAD is a program that loops doing nothing, for about the same time as it takes to run $PROGRAM -c 8
    And the experiment re-run


    
    # sh -x run.sh + PROGRAM=./l3load + SPIN=./spin + EVENT=1f082 + test2 + PROGPID=6820 + perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 8 + PSLEEP1=6821 + perf stat -C 1 -e r1f082 /bin/taskset -c 1 ./spin -t 12 + PSLEEP2=6822 + perf stat -C 2 -e r1f082 /bin/taskset -c 2 ./spin -t 12 + PSLEEP3=6823 + perf stat -C 3 -e r1f082 /bin/taskset -c 3 ./spin -t 12 + wait 6820 + wait 6821 + wait 6822 + wait 6823 + cat ./l3load.out
    


    And the results show very similar counter values
    
    Allocating 33554432 bytes and iteration 16 over it   Performance counter stats 
    
    for 
    '/bin/taskset -c 0 ./l3load -c 8':   623940 raw 0x1f082   12.260903033  seconds time elapsed   + cat ./spin-1.out   Performance counter stats 
    
    for 
    '/bin/taskset -c 1 ./spin -t 12':   597612 raw 0x1f082   12.001629596  seconds time elapsed   + cat ./spin-2.out   Performance counter stats 
    
    for 
    '/bin/taskset -c 2 ./spin -t 12':   588996 raw 0x1f082   12.001452376  seconds time elapsed   + cat ./spin-3.out   Performance counter stats 
    
    for 
    '/bin/taskset -c 3 ./spin -t 12':   575742 raw 0x1f082   12.001382095  seconds time elapsed
    

    It looks like the event counts are halted when processes are idled, which reinforces the idea that the L3 counters are core specific.