Topic
IC4NOTICE: developerWorks Community will be offline May 29-30, 2015 while we upgrade to the latest version of IBM Connections. For more information, read our upgrade FAQ.
2 replies Latest Post - ‏2011-01-20T23:54:26Z by SystemAdmin
dbrandes
dbrandes
1 Post
ACCEPTED ANSWER

Pinned topic Are there any tools for performance measure?

‏2011-01-20T22:22:56Z |
Hi,

Are there any tools for measure performance on Cell with OpenCL? As Example the utilised capacity for the Memory Bus, or single commands in OpenCL?

I know only of spu-ps, spu-top and the CL_QUEUE_PROFILING_ENABLE stats.
But to find bottlenecks in OpenCL-Kernels this is not very useful.

I get in a few kernels i tested from spu-top only about 1 % processor load per spu, and also from top 1% processor load for ppu.

best regards,

Daniel Brandes
Updated on 2011-01-20T23:54:26Z at 2011-01-20T23:54:26Z by SystemAdmin
  • SystemAdmin
    SystemAdmin
    131 Posts
    ACCEPTED ANSWER

    Re: Are there any tools for performance measure?

    ‏2011-01-20T23:42:30Z  in response to dbrandes
    Daniel,

    The CPC (Cell Performance Counter) tool from the SDK for Cell should provide the kind of information that you seek. CPC was available for SDK 3.1 and was documented to run on RHEL 5.2 and later (see http://www.bsc.es/plantillaH.php?cat_id=575). I personally have not tried to use the tool with OpenCL, so I don't know how well it works.

    If you think you are having memory bandwidth problems on Cell, it may be likely that your kernel directly accesses global memory. These accesses utilize a software cache which greatly limits your performance. It is recommended that async_work_group_copy be used to block move data between global and local memory. If you are using async_work_group_copy to move data, then it is important that they are correctly aligned and optimally sized for the SPE's MFC (see Cell Programmer's Guide for details). In OpenCL, it is important that the async_work_group_copy built-ins be called with a pointer type that is guaranteed to be at least quad-word aligned according to the OpenCL data type alignment rules (see section 6.1.5 of the OpenCL specification). Lastly, it is easy to inadvertently fail to wait on all events. If you fail to do so, the OpenCL runtime will run out of DMA tag ids and you may loose the benefit of double buffering memory latency hiding. Printing the event_t (which correspond to MFC tags IDs) will tell you if that is happening.

    Dan B.
  • SystemAdmin
    SystemAdmin
    131 Posts
    ACCEPTED ANSWER

    Re: Are there any tools for performance measure?

    ‏2011-01-20T23:54:26Z  in response to dbrandes
    I failed to mention, OpenCL has a profiling feature that will provide timing information for enqueued commands. The sample program "fluid" demonstrates the use of this feature by profiling the execution time for each of the 8 kernels used by this demo.

    Dan B.