Topic
  • 13 replies
  • Latest Post - ‏2011-06-21T13:28:47Z by SystemAdmin
SystemAdmin
SystemAdmin
131 Posts

Pinned topic Bug (?) with local_size[0] > 1 on POWER7, works with PPC970

‏2011-05-18T13:07:07Z |
Hello,

I think I found a bug. I have a code that has two versions, the 'real' one (using double-precision) and a 'test' one (using single-precision, to test machines w/o double precision support). This code is 'known to work' on many SDKs, so I think it's OK.

The single-precision (no DP support) works fine on a PowerStation with dual 970MP (4 cores), no matter what local_size[0] I use. It works fine on a PS702 (16 to 64 POWER7 cores, depending on SMT), as long as I use local_size[*] == 1 (and so does the DP version). The code uses only 1 dimension for global_size and local_size.

I only have trouble on the POWER7 system when I use a local_size[0] > 1. For both version, the symptom is that one value every local_size[0] is exact, and everything else is 0.0. Other tests in that code (printf are cool :-) show that only the first work_item of each work_group is computed, the others are never computed.

The only difference I see is that the PowerStation OpenCL is hacked to use 'ppu-xlcl' instead of 'xlcl' (see Generating for PowerPC 2.01 and not 2.03).

Is that a known problem with the POWER7 CL compiler?
Updated on 2011-06-21T13:28:47Z at 2011-06-21T13:28:47Z by SystemAdmin
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: Bug (?) with local_size[0] > 1 on POWER7, works with PPC970

    ‏2011-05-18T13:54:13Z  
    I think that we're going to have to see some code to try to recreate this. We have tests that do local work group sizes greater than 1, so it must be something specific that it's your code that's causing the problem.

    thx.bri.
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: Bug (?) with local_size[0] > 1 on POWER7, works with PPC970

    ‏2011-05-18T14:04:43Z  
    I think that we're going to have to see some code to try to recreate this. We have tests that do local work group sizes greater than 1, so it must be something specific that it's your code that's causing the problem.

    thx.bri.
    > {quote:title=bri wrote:}{quote}
    > I think that we're going to have to see some code to try to recreate this. We have tests that do local work group sizes greater than 1,
    > so it must be something specific that it's your code that's causing the problem.

    I sort of hoped for a "under circumstances so-and-so there is a known problem", because unfortunately it's a client's code, so posting it to a public forum is a big no-can-do.

    It'll have to wait until I can run it on a machine with a support contract (at the moment, it runs courtesy of the nice folks at IBM france who let me use a machine to test the code; thanks a lot guys :-) so I can submit it in private.

    Cordially,
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: Bug (?) with local_size[0] > 1 on POWER7, works with PPC970

    ‏2011-05-18T14:13:36Z  
    > {quote:title=bri wrote:}{quote}
    > I think that we're going to have to see some code to try to recreate this. We have tests that do local work group sizes greater than 1,
    > so it must be something specific that it's your code that's causing the problem.

    I sort of hoped for a "under circumstances so-and-so there is a known problem", because unfortunately it's a client's code, so posting it to a public forum is a big no-can-do.

    It'll have to wait until I can run it on a machine with a support contract (at the moment, it runs courtesy of the nice folks at IBM france who let me use a machine to test the code; thanks a lot guys :-) so I can submit it in private.

    Cordially,
    I don't think that we had any issues left over from testing. If you can cut the kernel down to a minimum 'clean' size that shows the problem, we might be able to tell just from the kernel and not need the host-side code, if that helps. Does the kernel do barriers in it? Loops? Does it fail if you add the '__attribute__ ((reqd_work_group_size(#,1,1)))' to the kernel?

    .bri.
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: Bug (?) with local_size[0] > 1 on POWER7, works with PPC970

    ‏2011-05-18T14:53:00Z  
    I don't think that we had any issues left over from testing. If you can cut the kernel down to a minimum 'clean' size that shows the problem, we might be able to tell just from the kernel and not need the host-side code, if that helps. Does the kernel do barriers in it? Loops? Does it fail if you add the '__attribute__ ((reqd_work_group_size(#,1,1)))' to the kernel?

    .bri.
    > {quote:title=bri wrote:}{quote}
    > I don't think that we had any issues left over from testing. If you can cut the kernel down to a minimum 'clean' size that shows the problem, we might be able to tell just from the kernel and not need the host-side code, if that helps. Does the kernel do barriers in it? Loops? Does it fail if you add the '__attribute__ ((reqd_work_group_size(#,1,1)))' to the kernel?

    The __attribute__ is there by default (removing it doesn't solve the problem).

    I've tried to narrow it down, and I think I might have found the problem.

    I was disabling code by #if ... #endif and found a line with only computations that could change between 'everything run' and 'one-per-group run', depending on whether I disabled it or not. This line includes a call to double precision intrinsic expm1(). This seem to be the culprit. This naughty function has already caused problems to another SDK before ;-)

    If at the beginning of the code I do (please don't tell mathematicians :-) :

    #define expm1(a) (a)

    ... then suddenly every work-item is executed.

    In fact, if I do the slightly more accurate:

    #define expm1(a) (exp(a)-1.)

    It still "works" (as in, every work-item is computed ; obviously accuracy is shot, there's a reason why expm1() exists :-).

    So maybe expm1() is buggy on the VSX-enabled processors, or something like that?

    Cordially,
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: Bug (?) with local_size[0] > 1 on POWER7, works with PPC970

    ‏2011-05-18T15:08:22Z  
    > {quote:title=bri wrote:}{quote}
    > I don't think that we had any issues left over from testing. If you can cut the kernel down to a minimum 'clean' size that shows the problem, we might be able to tell just from the kernel and not need the host-side code, if that helps. Does the kernel do barriers in it? Loops? Does it fail if you add the '__attribute__ ((reqd_work_group_size(#,1,1)))' to the kernel?

    The __attribute__ is there by default (removing it doesn't solve the problem).

    I've tried to narrow it down, and I think I might have found the problem.

    I was disabling code by #if ... #endif and found a line with only computations that could change between 'everything run' and 'one-per-group run', depending on whether I disabled it or not. This line includes a call to double precision intrinsic expm1(). This seem to be the culprit. This naughty function has already caused problems to another SDK before ;-)

    If at the beginning of the code I do (please don't tell mathematicians :-) :

    #define expm1(a) (a)

    ... then suddenly every work-item is executed.

    In fact, if I do the slightly more accurate:

    #define expm1(a) (exp(a)-1.)

    It still "works" (as in, every work-item is computed ; obviously accuracy is shot, there's a reason why expm1() exists :-).

    So maybe expm1() is buggy on the VSX-enabled processors, or something like that?

    Cordially,
    Ok, we'll debug and see if we can identify anything. Are you doing double? or vectors (double2, double4, ..)?

    thx.bri.
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: Bug (?) with local_size[0] > 1 on POWER7, works with PPC970

    ‏2011-05-18T16:17:12Z  
    Ok, we'll debug and see if we can identify anything. Are you doing double? or vectors (double2, double4, ..)?

    thx.bri.
    > {quote:title=bri wrote:}{quote}
    > Ok, we'll debug and see if we can identify anything. Are you doing double? or vectors (double2, double4, ..)?

    Regular double, and I get the same result with regular float (using (exp(a) - 1.f) in the define).

    It should be easy to reproduce, as in my code doing:

    a[n] = expm1(b[n]);

    and nothing else (where n is computed as (get_global_id(1) * get_global_size(0) + get_global_id(0))) exhibits the problem. The only possible tricky part is that there is a bunch of parameters, I haven't tried w/o them.

    Cordially,
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: Bug (?) with local_size[0] > 1 on POWER7, works with PPC970

    ‏2011-05-18T18:28:31Z  
    > {quote:title=bri wrote:}{quote}
    > Ok, we'll debug and see if we can identify anything. Are you doing double? or vectors (double2, double4, ..)?

    Regular double, and I get the same result with regular float (using (exp(a) - 1.f) in the define).

    It should be easy to reproduce, as in my code doing:

    a[n] = expm1(b[n]);

    and nothing else (where n is computed as (get_global_id(1) * get_global_size(0) + get_global_id(0))) exhibits the problem. The only possible tricky part is that there is a bunch of parameters, I haven't tried w/o them.

    Cordially,
    Make sure you check for any errors, to make sure that you aren't getting an error on the clEnqueueNDRange() call. Looks like we have an issue where something is causing the kernel to only work on a local work group size of 1,1,1. BUT - if that were the case, you should get an error if you try the enqueue with a larger value.

    .bri.
  • SystemAdmin
    SystemAdmin
    131 Posts

    Bug with expm1() & local_size[0] > 1 on POWER7, works with PPC970

    ‏2011-05-19T07:33:57Z  
    Make sure you check for any errors, to make sure that you aren't getting an error on the clEnqueueNDRange() call. Looks like we have an issue where something is causing the kernel to only work on a local work group size of 1,1,1. BUT - if that were the case, you should get an error if you try the enqueue with a larger value.

    .bri.
    > {quote:title=bri wrote:}{quote}
    > Make sure you check for any errors, to make sure that you aren't getting an error on the clEnqueueNDRange() call. Looks like we have an issue where something is causing the kernel to only work on a local work group size of 1,1,1. BUT - if that were the case, you should get an error if you try the enqueue with a larger value.

    clCreateProgramWithSource, clBuildProgram, clCreateKernel, clEnqueueNDRangeKernel are all properly checked and do not produce errors. Also, the build log (from clGetProgramBuildInfo) doesn't have any error or warning. I think it's really a bug and not just a misfeature.

    I fix the subject line to be more accurate.
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: Bug with expm1() & local_size[0] > 1 on POWER7, works with PPC970

    ‏2011-05-19T10:35:01Z  
    > {quote:title=bri wrote:}{quote}
    > Make sure you check for any errors, to make sure that you aren't getting an error on the clEnqueueNDRange() call. Looks like we have an issue where something is causing the kernel to only work on a local work group size of 1,1,1. BUT - if that were the case, you should get an error if you try the enqueue with a larger value.

    clCreateProgramWithSource, clBuildProgram, clCreateKernel, clEnqueueNDRangeKernel are all properly checked and do not produce errors. Also, the build log (from clGetProgramBuildInfo) doesn't have any error or warning. I think it's really a bug and not just a misfeature.

    I fix the subject line to be more accurate.
    OK, we'll look into it more. We've at least identified 1 problem here, just not totally sure it's the same as you're seeing.

    Can you add this get

    size_t kernel_work_group_size;
    rc = clGetKernelWorkGroupInfo (kernel, device_id, CL_KERNEL_WORK_GROUP_SIZE, sizeof (size_t), &kernel_work_group_size, NULL);

    after the clCreateKernel() and see what kernel_work_group_size reports?

    thx.bri.
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: Bug with expm1() & local_size[0] > 1 on POWER7, works with PPC970

    ‏2011-05-19T16:15:09Z  
    OK, we'll look into it more. We've at least identified 1 problem here, just not totally sure it's the same as you're seeing.

    Can you add this get

    size_t kernel_work_group_size;
    rc = clGetKernelWorkGroupInfo (kernel, device_id, CL_KERNEL_WORK_GROUP_SIZE, sizeof (size_t), &kernel_work_group_size, NULL);

    after the clCreateKernel() and see what kernel_work_group_size reports?

    thx.bri.
    > {quote:title=bri wrote:}{quote}
    > after the clCreateKernel() and see what kernel_work_group_size reports?

    It reports the correct value, i.e. whatever is in the attribute 'reqd_work_group_size' of the kernel. If that matches the local_size from the NDRange call, then it runs with the result previously described, and if it doesn't match (i.e. I patched the kernel file) then it doesn't run at all and reports as expected "CL_INVALID_WORK_GROUP_SIZE".

    Cordially,
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: Bug with expm1() & local_size[0] > 1 on POWER7, works with PPC970

    ‏2011-05-19T16:24:39Z  
    > {quote:title=bri wrote:}{quote}
    > after the clCreateKernel() and see what kernel_work_group_size reports?

    It reports the correct value, i.e. whatever is in the attribute 'reqd_work_group_size' of the kernel. If that matches the local_size from the NDRange call, then it runs with the result previously described, and if it doesn't match (i.e. I patched the kernel file) then it doesn't run at all and reports as expected "CL_INVALID_WORK_GROUP_SIZE".

    Cordially,
    > {quote:title=RomainDolbeau wrote:}{quote}
    > It reports the correct value, i.e. whatever is in the attribute 'reqd_work_group_size' of the kernel. If that matches the local_size from the NDRange call, then it runs with the result previously described, and if it doesn't match (i.e. I patched the kernel file) then it doesn't run at all and reports as expected "CL_INVALID_WORK_GROUP_SIZE".

    Extra data point: if I remove the attribute, I have weird results.

    If the NDRange local_size is 64, and I don't have an attribute at all, I have two different possibilities:

    1) I do NOT use expm1()
    -> clGetKernelWorkGroupInfo() returns 1024, and I get my approximate results, it works

    2) I DO use expm1()
    -> clGetKernelWorkGroupInfo() returns 1, and the kernel fails to launch with CL_OUT_OF_RESOURCES

    If the NDRange local_size is 1, then I have the same values from clGetKernelWorkGroupInfo() (1 or 1024), but the code works in both cases.

    Cordially,
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: Bug with expm1() & local_size[0] > 1 on POWER7, works with PPC970

    ‏2011-06-02T15:00:54Z  
    > {quote:title=RomainDolbeau wrote:}{quote}
    > It reports the correct value, i.e. whatever is in the attribute 'reqd_work_group_size' of the kernel. If that matches the local_size from the NDRange call, then it runs with the result previously described, and if it doesn't match (i.e. I patched the kernel file) then it doesn't run at all and reports as expected "CL_INVALID_WORK_GROUP_SIZE".

    Extra data point: if I remove the attribute, I have weird results.

    If the NDRange local_size is 64, and I don't have an attribute at all, I have two different possibilities:

    1) I do NOT use expm1()
    -> clGetKernelWorkGroupInfo() returns 1024, and I get my approximate results, it works

    2) I DO use expm1()
    -> clGetKernelWorkGroupInfo() returns 1, and the kernel fails to launch with CL_OUT_OF_RESOURCES

    If the NDRange local_size is 1, then I have the same values from clGetKernelWorkGroupInfo() (1 or 1024), but the code works in both cases.

    Cordially,
    Romain,

    We now understand the cause of this problem and I believe we have a simple v0.3 temporary workaround that should work (at least it did for me) until the next release of the SDK.

    In the file /usr/include/CL/device/cl_kernel_builtin_CPU.h add the following two lines:

    static inline double __expm1_double(double in) __attribute__ ((pure, simd));
    static inline double2 __expm1_double2(double2 in) __attribute__ ((pure, simd));

    For Cell users, you will need to add the lines to the file /usr/spu/include/CL/device/cl_kernel_builtin_AccelCellSPU.h

    Dan B.
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: Bug with expm1() & local_size[0] > 1 on POWER7, works with PPC970

    ‏2011-06-21T13:28:47Z  
    Romain,

    We now understand the cause of this problem and I believe we have a simple v0.3 temporary workaround that should work (at least it did for me) until the next release of the SDK.

    In the file /usr/include/CL/device/cl_kernel_builtin_CPU.h add the following two lines:

    static inline double __expm1_double(double in) __attribute__ ((pure, simd));
    static inline double2 __expm1_double2(double2 in) __attribute__ ((pure, simd));

    For Cell users, you will need to add the lines to the file /usr/spu/include/CL/device/cl_kernel_builtin_AccelCellSPU.h

    Dan B.
    > We now understand the cause of this problem and I believe we have a simple v0.3 temporary workaround that should work (at least it did for me) until the next release of the SDK.

    The customer has deployed OpenCL 0.3 on the test machine and included your patch. The code works fine now.

    Thanks for the help & the fix.