Topic
7 replies Latest Post - ‏2011-05-05T12:02:36Z by SystemAdmin
SystemAdmin
SystemAdmin
706 Posts
ACCEPTED ANSWER

Pinned topic POWER7 PMCs: Are the events specific to a hardware thread?

‏2011-04-29T17:43:16Z |
Received in an email and paraphrased here...

We're looking at the POWER7 performance monitor counters and we were wondering whether some events are specific to a particular hardware thread (recognizing there can be up to 4 hardware threads per core), while other events are specific to a full POWER7 core? Or if it's all hardware-thread based.
Updated on 2011-05-05T12:02:36Z at 2011-05-05T12:02:36Z by SystemAdmin
  • SystemAdmin
    SystemAdmin
    706 Posts
    ACCEPTED ANSWER

    Re: POWER7 PMCs: Are the events specific to a hardware thread?

    ‏2011-04-29T20:24:28Z  in response to SystemAdmin
    It appears that a majority of the events in the PMU are thread level, but there are also core level events (some L2, all L3 events, and some other core level events). There are also some chip level events.

    Examples are being worked.
    • SystemAdmin
      SystemAdmin
      706 Posts
      ACCEPTED ANSWER

      Re: POWER7 PMCs: Are the events specific to a hardware thread?

      ‏2011-05-03T21:24:36Z  in response to SystemAdmin
      One suggestion received.. try using perf to catch the L3 event on a single core when in SMT=4 mode.

      So try running and starting all of these at the same time:

      (load_L3 being some new program that's likely to generate a lot of L3 cache misses)

      perf stat -C 0 PM_L3_MISS /bin/taskset -c 0 load_L3  
      perf stat -C 1 PM_L3_MISS sleep 10
      perf stat -C 2 PM_L3_MISS sleep 10
      perf stat -C 3 PM_L3_MISS sleep 10

      Then see if you get the same L3 counts on all of these.  You'd need to know how long the load_L3 program ran and calibrate the sleep time to be about the same (10 seconds was used in this example).

      Caveat: You'd need to build perf from 2.6.36 or newer source to be able to use the "-C" option.  The -C option will work with older kernels
      (2.6.32ish) as well as newer ones.
      • SystemAdmin
        SystemAdmin
        706 Posts
        ACCEPTED ANSWER

        Re: POWER7 PMCs: Are the events specific to a hardware thread?

        ‏2011-05-03T21:26:54Z  in response to SystemAdmin
        Since the PM_L3_MISS isn't a defined perf event, you will need to use the raw event code which means you will need to install the libpfm4 library and use the showevtinfo to cross reference the code.

        Example to follow.
        • SystemAdmin
          SystemAdmin
          706 Posts
          ACCEPTED ANSWER

          Re: POWER7 PMCs: Are the events specific to a hardware thread?

          ‏2011-05-04T18:23:38Z  in response to SystemAdmin
          From a peer...

          Using the libpfm-3.10 showevtinfo I got this:

          Name : PM_L3_MISS
          Desc : L3 Misses
          Code : 0x1f082
          Counters : 0

          Then I created a simple program that allocates N * 4MB of
          memory and iterate through it in a sequential manner. I added two
          arguments on this program: the memory size to allocate and the ntimes
          to run on the allocated array.

          I assume that, using a array of the core L3's size, it will trigger
          N times the PM_L3_MISS and the total shouldn't change if I increase
          the ntimes the program runs on it.

          Indeed it was what I saw:

          1. perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 1 -n 1

          Allocating 4194304 bytes and iteration 1 over it

          Performance counter stats for '/bin/taskset -c 0 ./l3load -c 1 -n 1':

          86,832 raw 0x1f082

          0.085471983 seconds time elapsed
          1. perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 1 -n 2

          Allocating 4194304 bytes and iteration 2 over it

          Performance counter stats for '/bin/taskset -c 0 ./l3load -c 1 -n 2':

          82,220 raw 0x1f082

          0.169126166 seconds time elapsed

          1. perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 1 -n 3

          Allocating 4194304 bytes and iteration 3 over it

          Performance counter stats for '/bin/taskset -c 0 ./l3load -c 1 -n 3':

          84,940 raw 0x1f082

          0.252955443 seconds time elapsed

          Another test I tried to check if events is reporting correctly is to
          allocate a array larger than total L3 cache on chip (in this case, 32MB)
          and iterating over it:

          1. perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 10 -n 5

          Allocating 41943040 bytes and iteration 5 over it

          Performance counter stats for '/bin/taskset -c 0 ./l3load -c 10 -n 5':

          604,054 raw 0x1f082

          4.201362030 seconds time elapsed

          1. perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 10 -n 10

          Allocating 41943040 bytes and iteration 10 over it

          Performance counter stats for '/bin/taskset -c 0 ./l3load -c 10 -n 10':

          660,550 raw 0x1f082

          8.394621691 seconds time elapsed

          1. perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 10 -n 15

          Allocating 41943040 bytes and iteration 15 over it

          Performance counter stats for '/bin/taskset -c 0 ./l3load -c 10 -n 15':

          708,350 raw 0x1f082

          12.588042405 seconds time elapsed

          Ok, now that I have some confirmation both my program and the events
          looks good, I created this script:

          #!/bin/bash

          PROGRAM="./l3load"
          EVENT="1f082"
          SLEEP="10"

          perf stat -C 0 -e r$EVENT /bin/taskset -c 0 $PROGRAM -c 8 &
          perf stat -C 1 -e r$EVENT sleep $SLEEP &
          perf stat -C 2 -e r$EVENT sleep $SLEEP &
          perf stat -C 3 -e r$EVENT sleep $SLEEP &

          And then ran it:

          1. ./run.sh
          2. Allocating 33554432 bytes and iteration 16 over it

          Performance counter stats for 'sleep 10':

          8,830 raw 0x1f082

          10.000688096 seconds time elapsed
          Performance counter stats for 'sleep 10':

          9,284 raw 0x1f082

          10.000630696 seconds time elapsed
          Performance counter stats for 'sleep 10':

          10,894 raw 0x1f082

          10.000696278 seconds time elapsed
          Performance counter stats for '/bin/taskset -c 0 ./l3load -c 8':

          575,785 raw 0x1f082

          10.740074627 seconds time elapsed

          Looks like the PM_L3_MISS counter is per thread, since the machine is
          SMT4 mode.
          • SystemAdmin
            SystemAdmin
            706 Posts
            ACCEPTED ANSWER

            Re: POWER7 PMCs: Are the events specific to a hardware thread?

            ‏2011-05-04T18:31:26Z  in response to SystemAdmin
            ugh. that didnt' format well. let's try that again

            using the libpfm-3.10 showevtinfo I got this:

            
            Name     : PM_L3_MISS Desc     : L3 Misses Code     : 0x1f082 Counters : [ 0 ]
            


            Then I created a simple program that allocates N * 4MB of memory and iterate through it in a sequential manner. I added two arguments on this program: the memory size to allocate and the ntimes to run on the allocated array. I assume that, using an array of the core L3's size, it will trigger N times the PM_L3_MISS and the total shouldn't change if I increase the ntimes the program runs on it.

            Indeed it was what I saw:

            
            # perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 1 -n 1 Allocating 4194304 bytes and iteration 1 over it   Performance counter stats 
            
            for 
            '/bin/taskset -c 0 ./l3load -c 1 -n 1':   86,832 raw 0x1f082   0.085471983  seconds time elapsed   # perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 1 -n 2 Allocating 4194304 bytes and iteration 2 over it   Performance counter stats 
            
            for 
            '/bin/taskset -c 0 ./l3load -c 1 -n 2':   82,220 raw 0x1f082   0.169126166  seconds time elapsed   # perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 1 -n 3   Allocating 4194304 bytes and iteration 3 over it   Performance counter stats 
            
            for 
            '/bin/taskset -c 0 ./l3load -c 1 -n 3':   84,940 raw 0x1f082   0.252955443  seconds time elapsed
            


            Another test I tried to check if events is reporting correctly is to
            allocate a array larger than total L3 cache on chip (in this case, 32MB)
            and iterating over it:

            
            # perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 10 -n 5   Allocating 41943040 bytes and iteration 5 over it   Performance counter stats 
            
            for 
            '/bin/taskset -c 0 ./l3load -c 10 -n 5':   604,054 raw 0x1f082   4.201362030  seconds time elapsed   # perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 10 -n 10 Allocating 41943040 bytes and iteration 10 over it   Performance counter stats 
            
            for 
            '/bin/taskset -c 0 ./l3load -c 10 -n 10':   660,550 raw 0x1f082   8.394621691  seconds time elapsed   # perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 10 -n 15   Allocating 41943040 bytes and iteration 15 over it   Performance counter stats 
            
            for 
            '/bin/taskset -c 0 ./l3load -c 10 -n 15':   708,350 raw 0x1f082   12.588042405  seconds time elapsed
            


            I now have some confirmation both my program and the events looks good, I created this script:

            
            #!/bin/bash   PROGRAM=
            "./l3load" EVENT=
            "1f082" SLEEP=
            "10"   perf stat -C 0 -e r$EVENT /bin/taskset -c 0 $PROGRAM -c 8 & perf stat -C 1 -e r$EVENT sleep $SLEEP & perf stat -C 2 -e r$EVENT sleep $SLEEP & perf stat -C 3 -e r$EVENT sleep $SLEEP &
            


            And then ran it:

            
            # ./run.sh # Allocating 33554432 bytes and iteration 16 over it   Performance counter stats 
            
            for 
            'sleep 10':   8,830 raw 0x1f082   10.000688096  seconds time elapsed     Performance counter stats 
            
            for 
            'sleep 10':   9,284 raw 0x1f082   10.000630696  seconds time elapsed     Performance counter stats 
            
            for 
            'sleep 10':   10,894 raw 0x1f082   10.000696278  seconds time elapsed     Performance counter stats 
            
            for 
            '/bin/taskset -c 0 ./l3load -c 8':   575,785 raw 0x1f082   10.740074627  seconds time elapsed
            


            Conclusion from this data.. it looks like the PM_L3_MISS counter is per thread, since the machine is SMT4 mode.
          • SystemAdmin
            SystemAdmin
            706 Posts
            ACCEPTED ANSWER

            Re: POWER7 PMCs: Are the events specific to a hardware thread?

            ‏2011-05-05T11:58:58Z  in response to SystemAdmin
            A response to the experiment above..


            I'm not convinced this is proof of the L3 events being thread-specific.

            That you are getting different much lower counts on the CPUs that are idle may not mean anything. It could be that their counters are halted while idle.

            Another experiment would be to place a infinite loop load on three of the four threads, and the L3 cache load on the fourth, to make sure that the CPUs are occupied and counting all of the time.

            
            perf stat -C 0 -e r$EVENT /bin/taskset -c 0 $PROGRAM -c 8 & perf stat -C 1 -e r$EVENT /bin/taskset -c 1 $SPIN_LOAD & perf stat -C 2 -e r$EVENT /bin/taskset -c 2 $SPIN_LOAD & perf stat -C 3 -e r$EVENT /bin/taskset -c 3 $SPIN_LOAD &
            


            Where $SPIN_LOAD is a program that loops doing nothing, for about the same time as it takes to run $PROGRAM -c 8
            • SystemAdmin
              SystemAdmin
              706 Posts
              ACCEPTED ANSWER

              Re: POWER7 PMCs: Are the events specific to a hardware thread?

              ‏2011-05-05T12:02:36Z  in response to SystemAdmin
              And the experiment re-run


              
              # sh -x run.sh + PROGRAM=./l3load + SPIN=./spin + EVENT=1f082 + test2 + PROGPID=6820 + perf stat -C 0 -e r1f082 /bin/taskset -c 0 ./l3load -c 8 + PSLEEP1=6821 + perf stat -C 1 -e r1f082 /bin/taskset -c 1 ./spin -t 12 + PSLEEP2=6822 + perf stat -C 2 -e r1f082 /bin/taskset -c 2 ./spin -t 12 + PSLEEP3=6823 + perf stat -C 3 -e r1f082 /bin/taskset -c 3 ./spin -t 12 + wait 6820 + wait 6821 + wait 6822 + wait 6823 + cat ./l3load.out
              


              And the results show very similar counter values
              
              Allocating 33554432 bytes and iteration 16 over it   Performance counter stats 
              
              for 
              '/bin/taskset -c 0 ./l3load -c 8':   623940 raw 0x1f082   12.260903033  seconds time elapsed   + cat ./spin-1.out   Performance counter stats 
              
              for 
              '/bin/taskset -c 1 ./spin -t 12':   597612 raw 0x1f082   12.001629596  seconds time elapsed   + cat ./spin-2.out   Performance counter stats 
              
              for 
              '/bin/taskset -c 2 ./spin -t 12':   588996 raw 0x1f082   12.001452376  seconds time elapsed   + cat ./spin-3.out   Performance counter stats 
              
              for 
              '/bin/taskset -c 3 ./spin -t 12':   575742 raw 0x1f082   12.001382095  seconds time elapsed
              

              It looks like the event counts are halted when processes are idled, which reinforces the idea that the L3 counters are core specific.