IC5Notice: We have upgraded developerWorks Community to the latest version of IBM Connections. For more information, read our upgrade FAQ.
Topic
  • 4 replies
  • Latest Post - ‏2013-07-01T22:25:35Z by nasica88
nasica88
nasica88
24 Posts

Pinned topic Low latency memcpy performance

‏2013-06-26T06:51:27Z |

I am testing low latency memcpy performance on PowerLinux.  I do not own the source code, but in this test, basically one process writes to a shared memory segment and another reads from that using semaphore signals.

I found that adding the following kernel parameters to the 'append' section in /boot/etc/yaboot.conf, the performance improved much.

" nohz=off intel_idle.max_cstate=0 processor.max_cstate=0 cgroup_disable=memory nmi_watchdog=0 divider=4 nosoftlockup mce=ignore_ce"

As you can see, many of these are available only on Intel, and actually I got this set of parameters from x86 people.  I guess perhaps I can get even better performance with kernel parameters specific to PowerPC. 

Any experience with these or any recommendation ?

  • sjmunroe
    sjmunroe
    8 Posts

    Re: Low latency memcpy performance

    ‏2013-06-27T18:52:33Z  

    nohz=off is valid for POWER but the rest don't apply. Beyond that need to know more about what POWER HW you have (number of sockets and frames effects numa) and what Linux distribution you are running. Back level distro may have have optimized memcpy for your specific POWER (P6/P7/P7+) chip.

    Also did you know about the Advance Toolchain and SDK for PowerLinux?

    http://www-304.ibm.com/webapp/set2/sas/f/lopdiags/sdklop.html

  • nasica88
    nasica88
    24 Posts

    Re: Low latency memcpy performance

    ‏2013-06-27T23:57:56Z  
    • sjmunroe
    • ‏2013-06-27T18:52:33Z

    nohz=off is valid for POWER but the rest don't apply. Beyond that need to know more about what POWER HW you have (number of sockets and frames effects numa) and what Linux distribution you are running. Back level distro may have have optimized memcpy for your specific POWER (P6/P7/P7+) chip.

    Also did you know about the Advance Toolchain and SDK for PowerLinux?

    http://www-304.ibm.com/webapp/set2/sas/f/lopdiags/sdklop.html

    I got a much better response time with "nohz=off highres=off cgroup_disable=memory nmi_watchdog=0 divider=4".  I saw evident improvement with the addition of highres=off.

    I have a 7R2 with 4.2GHz 16cores, no LPAR (one whole LPAR, I mean), RHEL 6.4 with AT6.0-4. 

    Updated on 2013-07-02T01:36:33Z at 2013-07-02T01:36:33Z by nasica88
  • sjmunroe
    sjmunroe
    8 Posts

    Re: Low latency memcpy performance

    ‏2013-07-01T19:11:16Z  
    • nasica88
    • ‏2013-06-27T23:57:56Z

    I got a much better response time with "nohz=off highres=off cgroup_disable=memory nmi_watchdog=0 divider=4".  I saw evident improvement with the addition of highres=off.

    I have a 7R2 with 4.2GHz 16cores, no LPAR (one whole LPAR, I mean), RHEL 6.4 with AT6.0-4. 

    I need specific and clear details to be of any help.

    Are these measurements on Intel or on the 7R2? It is still not clear what you are measuring and how you are measuring the results.

    From context you have some shared memory and semaphores. You have not described how the code and data is distributed across the 2 nodes (6-8 cores per node) of the 7R2.

    Which kind of Semaphore? Posix or ipc? How much time are you spending in the kernel? If Posix semaphore are you using trylock (to stay out of the kernel)?

    As far as I know, only the nohz boot option of the list you gave is applicable on POWER.

    Also did you recompile and link you test case with the advance toolchain. Just installing the AT will not change the behavior of existing applications.

    export PATH=/opt/at6.0/bin:$PATH

    then rebuild your application

  • nasica88
    nasica88
    24 Posts

    Re: Low latency memcpy performance

    ‏2013-07-01T22:25:35Z  
    • sjmunroe
    • ‏2013-07-01T19:11:16Z

    I need specific and clear details to be of any help.

    Are these measurements on Intel or on the 7R2? It is still not clear what you are measuring and how you are measuring the results.

    From context you have some shared memory and semaphores. You have not described how the code and data is distributed across the 2 nodes (6-8 cores per node) of the 7R2.

    Which kind of Semaphore? Posix or ipc? How much time are you spending in the kernel? If Posix semaphore are you using trylock (to stay out of the kernel)?

    As far as I know, only the nohz boot option of the list you gave is applicable on POWER.

    Also did you recompile and link you test case with the advance toolchain. Just installing the AT will not change the behavior of existing applications.

    export PATH=/opt/at6.0/bin:$PATH

    then rebuild your application

    These measurements were on 7R2. 

    I used AT with the proper path as you suggest.

    We are not binding the processes to any core or socket with taskset command on 7R2, nor on x86.

    However, I cannot answer the rest of your questions, for I do no have access to the source codes of the customer's testcase.