Topic
  • 3 replies
  • Latest Post - ‏2012-08-31T19:33:35Z by StevenMunroe
PowerLinuxFAQ
PowerLinuxFAQ
18 Posts

Pinned topic How do I tune a workload like sysbench prime?

‏2012-08-29T19:38:54Z |
I'm running sysbench on a Power7 server - running the latest RHEL 6.3  release (with the CPU utilization fix installed).
 
When I first ran the test
     sysbench  --test=cpu --cpu-max-prime=200000 run
it took quite a while to finish, but then I realized it's only running a single thread.   790 seconds. 
 
So I specified 16 threads (I'm running on a 16-core Power7 server), and it obviously went much faster - 50 seconds..
    sysbench --num-threads=16 --test=cpu --cpu-max-prime=200000 run 
 
I would like to see if I can optimize and tune the code to run even faster on Power7.   Is there a standard approach for understanding what this code is doing?   Is the code even optimized for Power systems?    Any ideas?
 
Updated on 2012-08-31T19:33:35Z at 2012-08-31T19:33:35Z by StevenMunroe
  • Bill_Buros
    Bill_Buros
    190 Posts

    Re: How do I tune a workload like sysbench prime?

    ‏2012-08-29T20:34:42Z  
    The three things I usually start with are:
    1. Make sure I understand what frequency the system is running at..   ppc64_cpu --frequency (if the system is running slower than I expect, that'll be an easy fix)
    2. Watch "top" (press 1) to see how many (and where) the threads are running
    3. and make sure the LPAR I'm running in has the right numa, memory, and isn't sharing my CPU cycles.   Sharing isn't bad..   but if I'm measuring performance, it won't necessarily be an accurate assessment of the server.
    The next step in my realm would be to run lpcpu.sh 

    There are people on our teams here who would also advocate the IBM SDK..  I'll let them weigh in on that.
  • JayFurmanek
    JayFurmanek
    115 Posts

    Re: How do I tune a workload like sysbench prime?

    ‏2012-08-31T07:52:48Z  
     One interesting thing about Sysbench is that it is open source, so we can take a look at what it's actually doing and make some informed decisions.
     
    A quick look at the code shows the entirety of the sysbench Prime CPU benchmark is contained in a single 6-line nested for loop (which is replicated per thread assigned at runtime).
    The main calculation is a call to sqrt().
     
    You can deduce a couple things right away by looking at the code:
        1. A small set of execution resources on the CPU will be in heavy use. Such a narrow use case will likely negate the benefit of SMT. SMT enables routing more work threads to the same physical core in an attempt to make use of unused execution resources and since this workload has such a narrow scope, adding more contention for the same small set of execution resources likely won't help. Workloads like this can be thought of as 'core bound'. Try setting ppc64_cpu --smt=0 to turn off SMT.
       2.  The 'hot spot' in the code is rather obvious. Compiler optimizations might be worth looking into.
     
    Bill, what sort of build-time optimizations are recommended for workloads like this? 
     
  • StevenMunroe
    StevenMunroe
    7 Posts

    Re: How do I tune a workload like sysbench prime?

    ‏2012-08-31T19:33:35Z  
     One interesting thing about Sysbench is that it is open source, so we can take a look at what it's actually doing and make some informed decisions.
     
    A quick look at the code shows the entirety of the sysbench Prime CPU benchmark is contained in a single 6-line nested for loop (which is replicated per thread assigned at runtime).
    The main calculation is a call to sqrt().
     
    You can deduce a couple things right away by looking at the code:
        1. A small set of execution resources on the CPU will be in heavy use. Such a narrow use case will likely negate the benefit of SMT. SMT enables routing more work threads to the same physical core in an attempt to make use of unused execution resources and since this workload has such a narrow scope, adding more contention for the same small set of execution resources likely won't help. Workloads like this can be thought of as 'core bound'. Try setting ppc64_cpu --smt=0 to turn off SMT.
       2.  The 'hot spot' in the code is rather obvious. Compiler optimizations might be worth looking into.
     
    Bill, what sort of build-time optimizations are recommended for workloads like this? 
     
     The first thing I would do is profile the benchmark and find out where it is spending its time.
     
    You can use oprofile, perf, or even load sysbench source into the IBM SDK for PowerLinux and run let it run oprofile with full source.
     
    The profile will tell you if the application is busy in the kernel/libpthread due to thread/lock contention or actually in the application/libm computing sqrt()? 
     
    So heavy kernel execution tell you one thing and heavy libm execution another. Reducing SMT/Thread contention should reduce the kernel overhead.
     
    Improving the compiler/math library would help both the libpthread and libm execution time. RHEL6 does not ship POWER7 CPU tuned libraries, so libpthread/libm will not be optimal out of the box. The IBM Advance Toolchains 5.0 and 6.0 do include POWER7 CPU runtime libraries. These will be selected automatically if you compile/link the application with AT.
     
    Finally the POSIX mandated sqrt() includes setting errno for out range values. But I assume that sysbench is not checking errno. So compiling with -ffast-math should bypass the errno support and directly inline the hardware fsqrt instruction.