Topic
  • 3 replies
  • Latest Post - ‏2011-05-25T14:02:17Z by SystemAdmin
SystemAdmin
SystemAdmin
706 Posts

Pinned topic POWER7 PMCs: Are there limits on how many counters I can monitor at once?

‏2011-05-05T13:31:42Z |
Looking at oprofile's listing of ppc64-power7-events, there are obviously a lot of performance counters available with Power7.

Are there system limits on how many counters can be monitored at once?

Which tools are recommended for monitoring performance counters?
Updated on 2011-05-25T14:02:17Z at 2011-05-25T14:02:17Z by SystemAdmin
  • SystemAdmin
    SystemAdmin
    706 Posts

    Re: POWER7 PMCs: Are there limits on how many counters I can monitor at once?

    ‏2011-05-10T17:41:11Z  
    From a hardware perspective, the system can monitor up to four counters at a time per hardware thread. One thing that has been done in the past (especially for HPC type workloads) is assuming each thread is doing essentially the same thing, monitor a different set of four counters in each thread. Counters being monitored have to be within the same group.

    Looking for examples of this now.
  • jhopper
    jhopper
    1 Post

    Re: POWER7 PMCs: Are there limits on how many counters I can monitor at once?

    ‏2011-05-10T19:01:32Z  
    From a hardware perspective, the system can monitor up to four counters at a time per hardware thread. One thing that has been done in the past (especially for HPC type workloads) is assuming each thread is doing essentially the same thing, monitor a different set of four counters in each thread. Counters being monitored have to be within the same group.

    Looking for examples of this now.
    You can profile for multiple events at once with Oprofile, however all events must be in the same group. Here is an example of the syntax:

    opcontrol -e PM_INST_FROM_DMEM_GRP26:1000 -e PM_INST_FROM_RMEM_GRP26:1000 -e PM_INST_FROM_LMEM_GRP26:1000

    You can also profile for multiple events using the Linux perf tool using a similar syntax. Perf creates its own counter groups based on the underlying hardware, so there are no pre-defined groups you need to worry about. Here is an example:

    perf stat -e cycles -e cache-misses sleep 10
  • SystemAdmin
    SystemAdmin
    706 Posts

    Re: POWER7 PMCs: Are there limits on how many counters I can monitor at once?

    ‏2011-05-25T14:02:17Z  
    • jhopper
    • ‏2011-05-10T19:01:32Z
    You can profile for multiple events at once with Oprofile, however all events must be in the same group. Here is an example of the syntax:

    opcontrol -e PM_INST_FROM_DMEM_GRP26:1000 -e PM_INST_FROM_RMEM_GRP26:1000 -e PM_INST_FROM_LMEM_GRP26:1000

    You can also profile for multiple events using the Linux perf tool using a similar syntax. Perf creates its own counter groups based on the underlying hardware, so there are no pre-defined groups you need to worry about. Here is an example:

    perf stat -e cycles -e cache-misses sleep 10
    Syntactically speaking, you are allowed to use a large number of event counters with the perf tool - however - depending on the perf usage (such as perf stat vs perf record) and your workload, you may see a detrimental system impact such as your process slowing to a crawl or even your session or system hanging. For example, using a large number of event counters with "perf record" is known to cause problems - likely due to "perf record" profiling at the function/symbol level - rather than the process level so more far more profiling activity is involved.

    To illustrate that a large number of event counters can be used, here is an example of using "perf stat" with 39 counters.

    1. perf stat -e L1-dcache-loads -e L1-dcache-load-misses -e L1-dcache-store-misses -e L1-dcache-prefetches -e L1-icache-load-misses -e L1-icache-prefetches -e LLC-loads -e LLC-load-misses -e LLC-stores -e LLC-store-misses -e dTLB-load-misses -e iTLB-load-misses -e branch-loads -e branch-load-misses -e r200f4 -e r4000a -e r20014 -e r40014 -e r40012 -e r20018 -e r2001c -e r4004a -e r20012 -e r40016 -e r40018 -e r20016 -e r2004a -e r1001c -e r4004c -e r4004e -e r100f8 -e r2001a -e r4001a -e r4001c -e r30004 -e r100f2 -e r200f4 -e r10083 -e r40084 ./STREAM

    STREAM version $Revision: 5.9 $

    This system uses 8 bytes per DOUBLE PRECISION word.

    Array size = 44739240, Offset = 0
    Total memory required = 1024.0 MB.
    Each test is run 10 times, but only
    the best time for each is used.

    Printing one line per active thread....

    Your clock granularity/precision appears to be 1 microseconds.
    Each test below will take on the order of 78913 microseconds.
    (= 78913 clock ticks)
    Increase the size of the arrays if this shows that
    you are not getting at least 20 clock ticks per test.

    WARNING -- The above is only a rough guideline.
    For best results, please be sure you know the
    precision of your system timer.

    Function Rate (MB/s) Avg time Min time Max time
    Copy: 8744.0976 0.0820 0.0819 0.0822
    Scale: 8821.1432 0.0814 0.0811 0.0817
    Add: 11296.8428 0.0966 0.0950 0.0971
    Triad: 11055.7115 0.0979 0.0971 0.0982

    Solution Validates


    Performance counter stats for './STREAM':

    3,467,927,030 L1-dcache-loads (scaled from 10.42%)
    58,921,788 L1-dcache-load-misses (scaled from 5.54%)
    1,918,360,024 L1-dcache-store-misses (scaled from 8.33%)
    163,893,187 L1-dcache-prefetches (scaled from 11.10%)
    2,446,453 L1-icache-load-misses (scaled from 11.15%)
    1,723,908 L1-icache-prefetches (scaled from 11.12%)
    348,593,313 LLC-loads (scaled from 11.09%)
    358,284,845 LLC-load-misses (scaled from 8.30%)
    305,152,751 LLC-stores (scaled from 5.52%)
    226,041,768 LLC-store-misses (scaled from 5.50%)
    586,295 dTLB-load-misses (scaled from 8.24%)
    1,196 iTLB-load-misses (scaled from 10.95%)
    1,816,060,884 branch-loads (scaled from 10.93%)
    661,682 branch-load-misses (scaled from 5.45%)
    9,378,029,635 raw 0x200f4 (scaled from 8.15%)
    5,662,529,645 raw 0x4000a (scaled from 5.42%)
    29,802,770 raw 0x20014 (scaled from 8.11%)
    18,329,498 raw 0x40014 (scaled from 5.40%)
    3,630,877,990 raw 0x40012 (scaled from 2.69%)
    7,150 raw 0x20018 (scaled from 5.37%)
    151,441 raw 0x2001c (scaled from 2.68%)
    8,219 raw 0x4004a (scaled from 5.32%)
    1,890,752,076 raw 0x20012 (scaled from 5.07%)
    145,435,490 raw 0x40016 (scaled from 4.87%)
    93,005,581 raw 0x40018 (scaled from 2.66%)
    599,087,999 raw 0x20016 (scaled from 5.31%)
    494,864,738 raw 0x2004a (scaled from 2.65%)
    6,021 raw 0x1001c (scaled from 5.28%)
    224,474,374 raw 0x4004c (scaled from 7.90%)
    194,830,087 raw 0x4004e (scaled from 2.63%)
    69,674,184 raw 0x100f8 (scaled from 5.24%)
    47,473,682 raw 0x2001a (scaled from 7.84%)
    996,619 raw 0x4001a (scaled from 7.81%)
    1,078,279 raw 0x4001c (scaled from 2.60%)
    3,770,579,266 raw 0x30004 (scaled from 5.19%)
    3,774,508,284 raw 0x100f2 (scaled from 7.76%)
    8,982,693,653 raw 0x200f4 (scaled from 10.32%)
    35,812,764 raw 0x10083 (scaled from 5.15%)
    4,485,230,867 raw 0x40084 (scaled from 7.70%)

    4.115952176 seconds time elapsed
    When the same command is used with only a single counter (-e L1-dcache_loads), we see that the stream results have slightly improved (<.5%) indicating nominal overhead for the "perf stat" command but the actual "L1-dcache-loads" counter has increased by almost 7% (3,462,458,692 to 3,728,380,011).
    1. perf stat -e L1-dcache-loads ./STREAM

    STREAM version $Revision: 5.9 $

    This system uses 8 bytes per DOUBLE PRECISION word.

    Array size = 44739240, Offset = 0
    Total memory required = 1024.0 MB.
    Each test is run 10 times, but only
    the best time for each is used.

    Printing one line per active thread....

    Your clock granularity/precision appears to be 1 microseconds.
    Each test below will take on the order of 78228 microseconds.
    (= 78228 clock ticks)
    Increase the size of the arrays if this shows that
    you are not getting at least 20 clock ticks per test.

    WARNING -- The above is only a rough guideline.
    For best results, please be sure you know the
    precision of your system timer.

    Function Rate (MB/s) Avg time Min time Max time
    Copy: 8761.7812 0.0819 0.0817 0.0821
    Scale: 8837.3710 0.0812 0.0810 0.0814
    Add: 11144.8749 0.0965 0.0963 0.0967
    Triad: 10988.8378 0.0979 0.0977 0.0980

    Solution Validates


    Performance counter stats for './STREAM':

    3,728,380,011 L1-dcache-loads

    4.108253438 seconds time elapsed