Topic
  • 11 replies
  • Latest Post - ‏2011-04-27T19:42:35Z by SystemAdmin
Diko
Diko
5 Posts

Pinned topic Copying a structure

‏2011-04-23T16:49:49Z |
Hello,

i'm trying to run some OpenCL code on a PS3. This is my kernel:

#define BUFFER_SIZE 1   struct Test 
{ 

float buffer[BUFFER_SIZE]; 
};   __kernel 

void kernel1(__global struct Test* pb1) 
{ __local struct Test t[4]; printf(
"%p %p\n", &t[get_local_id(0)], &pb1[get_global_id(0)]); t[get_local_id(0)] = pb1[get_global_id(0)]; 
}


When BUFFER_SIZE is 1 or 2, then everything is fine.
If BUFFER_SIZE is greater than 2, then i get a segfault, when the printf statement
is present. Without printf it works.

Global and local size is 4.

This is the backtrace:
#0 0x00001afc in __cache_fetch_dirty () from kernel.ocl@0xf6ee0000 <5>
#1 0x00001020 in memcpy_ea () from kernel.ocl@0xf6ee0000 <5>
#2 0x00000a18 in kernel1 (b1=0x3a9f0) at IBM_OpenCL_kernel.cl:12
#3 0x000001e0 in __kernel1_1D_exec (currentWorkGroup=240192, workGroupCount=240192, args=0x3aa40) at IBM_OpenCL_kernel.cl:95
#4 0x0003c804 in executeKernelPayload () from CLRuntimeAccelCellSPU@0xf96a500 <5>
#5 0x0003ced8 in main () from CLRuntimeAccelCellSPU@0xf96a500 <5>
#6 0x0003b08c in _start () from CLRuntimeAccelCellSPU@0xf96a500 <5>
#7 <cross-architecture call>
#8 0x0fbeee78 in syscall () from /lib/libc.so.6
#9 0x0f92918c in _base_spe_context_run () from /home/i2cluster/sidikotr/lib/libspe2.so.2
#10 0x0f91e204 in spe_context_run () from /home/i2cluster/sidikotr/lib/libspe2.so.2
#11 0x0f95b6f0 in ibm::openclDeviceCellSPU::CellSPUUnit::threadFunc(void*) () from /usr/lib/CL/debug/CL/device/libCLDevAccelCellSPU.so
#12 0x0fad0528 in start_thread () from /lib/libpthread.so.0
#13 0x0fbf37c0 in clone () from /lib/libc.so.6

Am i missing something? Some restrictions?
Updated on 2011-04-27T19:42:35Z at 2011-04-27T19:42:35Z by SystemAdmin
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: Copying a structure

    ‏2011-04-25T12:55:16Z  
    Diko,

    I don't see anything obviously wrong with your code nor am I aware of a restriction that would result in segfaults you are encountering. I suspect the problem has to do with the allocation of the global data buffer pointed to by pb1 and/or the enqueue command used to invoke kernel1. More of the (i.e., complete) source code is needed to pin point the problem.

    Dan B.
  • Diko
    Diko
    5 Posts

    Re: Copying a structure

    ‏2011-04-25T15:13:47Z  
    Diko,

    I don't see anything obviously wrong with your code nor am I aware of a restriction that would result in segfaults you are encountering. I suspect the problem has to do with the allocation of the global data buffer pointed to by pb1 and/or the enqueue command used to invoke kernel1. More of the (i.e., complete) source code is needed to pin point the problem.

    Dan B.
    Hi,
    my host code is attached. You'll need the C++ OpenCL bindings to compile this.
    On the host side i just create a buffer with 1024 floats. And run the kernel
    with global size of 4 and local size of 4. This is just my testcode to
    recreate the problem.

    Kernel code is this:
    
    #define BUFFER_SIZE 13   struct Test 
    { 
    
    float buffer[BUFFER_SIZE]; 
    };   __kernel 
    
    void kernel1(__global struct Test* pb1) 
    { __local struct Test t[4]; printf(
    "%p %p\n", &t[get_local_id(0)], &pb1[get_global_id(0)]); t[get_local_id(0)] = pb1[get_global_id(0)]; 
    }
    


    It crashes always in __cache_fetch_dirty().

    Attachments

  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: Copying a structure

    ‏2011-04-26T09:15:59Z  
    {quote:title=Diko wrote:}{quote}
    > When BUFFER_SIZE is 1 or 2, then everything is fine.
    > If BUFFER_SIZE is greater than 2, then i get a segfault, when the printf statement
    > is present. Without printf it works.

    From my limited testing, the problem is using the address of 't' (inside printf) on an ACCELERATOR device (AFAICT, everything is fine on a CPU device). It seems to work for BUFFER_SIZE < 3, but not otherwise. You can't print it, and you can't use it, unless the '*' operator is applied directly.

    So this:

    *(&tget_local_id(0)) = pb1get_global_id(0);

    works (should always work per 6.3.n!) but this:

    *((&t[0])+get_local_id(0)) = pb1get_global_id(0);

    doesn't, because the address is really used.

    As far as I understand the specifications, it should work...
  • Diko
    Diko
    5 Posts

    Re: Copying a structure

    ‏2011-04-26T09:58:05Z  
    {quote:title=Diko wrote:}{quote}
    > When BUFFER_SIZE is 1 or 2, then everything is fine.
    > If BUFFER_SIZE is greater than 2, then i get a segfault, when the printf statement
    > is present. Without printf it works.

    From my limited testing, the problem is using the address of 't' (inside printf) on an ACCELERATOR device (AFAICT, everything is fine on a CPU device). It seems to work for BUFFER_SIZE < 3, but not otherwise. You can't print it, and you can't use it, unless the '*' operator is applied directly.

    So this:

    *(&tget_local_id(0)) = pb1get_global_id(0);

    works (should always work per 6.3.n!) but this:

    *((&t[0])+get_local_id(0)) = pb1get_global_id(0);

    doesn't, because the address is really used.

    As far as I understand the specifications, it should work...
    Ah yes, thank you.

    I tested this code on AMD CPU, ATI GPU and the Cell PPU and it worked fine.

    As i see it, the OpenCL implementation is not correct on the SPU. Or there
    are some restrictions i dont know about.
  • masterzorag
    masterzorag
    25 Posts

    Re: Copying a structure

    ‏2011-04-26T18:03:12Z  
    • Diko
    • ‏2011-04-26T09:58:05Z
    Ah yes, thank you.

    I tested this code on AMD CPU, ATI GPU and the Cell PPU and it worked fine.

    As i see it, the OpenCL implementation is not correct on the SPU. Or there
    are some restrictions i dont know about.
    just to know,
    how many spes did you have access to, 6 or more? or
    what CL_MAX_COMPUTE_UNITS, or spu-top are saying about?
    maybe, in case of 6+, something goes wrong...
  • Diko
    Diko
    5 Posts

    Re: Copying a structure

    ‏2011-04-26T18:16:47Z  
    just to know,
    how many spes did you have access to, 6 or more? or
    what CL_MAX_COMPUTE_UNITS, or spu-top are saying about?
    maybe, in case of 6+, something goes wrong...
    Only 6.
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: Copying a structure

    ‏2011-04-26T18:38:50Z  
    • Diko
    • ‏2011-04-26T09:58:05Z
    Ah yes, thank you.

    I tested this code on AMD CPU, ATI GPU and the Cell PPU and it worked fine.

    As i see it, the OpenCL implementation is not correct on the SPU. Or there
    are some restrictions i dont know about.
    As best as I can tell, this problem is a result of the compiler calling memcpy_ea in response to a block memory copy from __global to __local memory. The memcpy_ea function takes two __ea void pointers. This function assumes that the caller has correctly converted the local storage pointer to a __ea void pointer by adding the ea base address. Without this addition, the memcpy_ea function will attempt to cache the __local data thinking it is a global memory pointer. Then when the cache line must be evicted, then it attempts to DMA the cache line to the local storage address which segfaults because there is no system (global) memory at that address.

    I have opened a defect for the compiler developers to look at this further in that they may be using a different set of ea memory functions than what is currently OpenCL sourced.

    In the interim, if you can avoid block moves, you shouldn't experience this problem. For example, I modified your code sample to the copy one float at a time. This works.

    
    #define BUFFER_SIZE 4   struct Test 
    { 
    
    float buffer[BUFFER_SIZE]; 
    };   #
    
    if 0 __kernel 
    
    void kernel1(__global struct Test* pb1) 
    { __local struct Test t[4]; printf(
    "%p %p\n", &t[get_local_id(0)], &pb1[get_global_id(0)]); t[get_local_id(0)] = pb1[get_global_id(0)]; 
    } #
    
    else __kernel 
    
    void kernel1(__global struct Test* pb1) 
    { __local struct Test t[4]; 
    
    int i; printf(
    "%p %p\n", &t[get_local_id(0)], &pb1[get_global_id(0)]); 
    // Copy structure one float at a time. 
    
    for (i=0; i<BUFFER_SIZE; i++) 
    { t[get_local_id(0)].buffer[i] = pb1[get_global_id(0)].buffer[i]; 
    } 
    } #endif
    
  • Diko
    Diko
    5 Posts

    Re: Copying a structure

    ‏2011-04-26T18:52:25Z  
    As best as I can tell, this problem is a result of the compiler calling memcpy_ea in response to a block memory copy from __global to __local memory. The memcpy_ea function takes two __ea void pointers. This function assumes that the caller has correctly converted the local storage pointer to a __ea void pointer by adding the ea base address. Without this addition, the memcpy_ea function will attempt to cache the __local data thinking it is a global memory pointer. Then when the cache line must be evicted, then it attempts to DMA the cache line to the local storage address which segfaults because there is no system (global) memory at that address.

    I have opened a defect for the compiler developers to look at this further in that they may be using a different set of ea memory functions than what is currently OpenCL sourced.

    In the interim, if you can avoid block moves, you shouldn't experience this problem. For example, I modified your code sample to the copy one float at a time. This works.

    <pre class="jive-pre"> #define BUFFER_SIZE 4 struct Test { float buffer[BUFFER_SIZE]; }; # if 0 __kernel void kernel1(__global struct Test* pb1) { __local struct Test t[4]; printf( "%p %p\n", &t[get_local_id(0)], &pb1[get_global_id(0)]); t[get_local_id(0)] = pb1[get_global_id(0)]; } # else __kernel void kernel1(__global struct Test* pb1) { __local struct Test t[4]; int i; printf( "%p %p\n", &t[get_local_id(0)], &pb1[get_global_id(0)]); // Copy structure one float at a time. for (i=0; i<BUFFER_SIZE; i++) { t[get_local_id(0)].buffer[i] = pb1[get_global_id(0)].buffer[i]; } } #endif </pre>
    Alright, thank you.

    Odd, that nobody encountered this before.
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: Copying a structure

    ‏2011-04-27T14:56:46Z  
    • Diko
    • ‏2011-04-26T18:52:25Z
    Alright, thank you.

    Odd, that nobody encountered this before.
    Another way to work around this problem is to use async_work_group_copy. For block data transfers, this is the performance preferred method on the SPE accelerator devices. For example, the following kernel will also work:

    
    #define BUFFER_SIZE 4   struct Test 
    { 
    
    float buffer[BUFFER_SIZE]; 
    };   __kernel 
    
    void kernel1(__global struct Test* pb1) 
    { __local struct Test t[4]; event_t event; printf(
    "%p %p\n", &t[get_local_id(0)], &pb1[get_global_id(0)]); event = async_work_group_copy((__local 
    
    float *)(&t[get_local_id(0)]), (__global 
    
    float *)(&pb1[get_global_id(0)]), BUFFER_SIZE, 0); wait_group_events(1, &event); 
    }
    
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: Copying a structure

    ‏2011-04-27T15:19:01Z  
    Another way to work around this problem is to use async_work_group_copy. For block data transfers, this is the performance preferred method on the SPE accelerator devices. For example, the following kernel will also work:

    <pre class="jive-pre"> #define BUFFER_SIZE 4 struct Test { float buffer[BUFFER_SIZE]; }; __kernel void kernel1(__global struct Test* pb1) { __local struct Test t[4]; event_t event; printf( "%p %p\n", &t[get_local_id(0)], &pb1[get_global_id(0)]); event = async_work_group_copy((__local float *)(&t[get_local_id(0)]), (__global float *)(&pb1[get_global_id(0)]), BUFFER_SIZE, 0); wait_group_events(1, &event); } </pre>
    {quote:title=brokensh wrote:}{quote}
    > Another way to work around this problem is to use async_work_group_copy. For block data transfers, this is the performance preferred method on the SPE accelerator devices. For example, the following kernel will also work:
    >
    > event = async_work_group_copy((__local float *)(&t[get_local_id(0)]), (__global float *)(&pb1[get_global_id(0)]), BUFFER_SIZE, 0);

    I have a doubt. Per OpenCL 1.1.36, 6.11.10:

    "(...)this built-in function must therefore be encountered by all work-items in a work-
    group executing the kernel with the same argument values"

    ... and both pointer will differ between work-items (different local_id and global_id) in your example. Shouldn't it be:

    event = async_work_group_copy((__local float *)(&t[0]), (__global float *)(&pb1[get_global_id(0) & ~(get_local_size(0)-1)]), BUFFER_SIZE * 4, 0); // this assume get_local_size(0) is a power of 2 ; the '4' that mutiplies BUFFER_SIZE is the '4' from the local definition of t.

    At least that's my understanding of the specifications ; I don't use async_work_group_copy() much, as the inability to use arbitrary element size (only 'gentype' are allowed) makes it pretty much useless on GPU in my experience.
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: Copying a structure

    ‏2011-04-27T19:42:35Z  
    {quote:title=brokensh wrote:}{quote}
    > Another way to work around this problem is to use async_work_group_copy. For block data transfers, this is the performance preferred method on the SPE accelerator devices. For example, the following kernel will also work:
    >
    > event = async_work_group_copy((__local float *)(&t[get_local_id(0)]), (__global float *)(&pb1[get_global_id(0)]), BUFFER_SIZE, 0);

    I have a doubt. Per OpenCL 1.1.36, 6.11.10:

    "(...)this built-in function must therefore be encountered by all work-items in a work-
    group executing the kernel with the same argument values"

    ... and both pointer will differ between work-items (different local_id and global_id) in your example. Shouldn't it be:

    event = async_work_group_copy((__local float *)(&t[0]), (__global float *)(&pb1[get_global_id(0) & ~(get_local_size(0)-1)]), BUFFER_SIZE * 4, 0); // this assume get_local_size(0) is a power of 2 ; the '4' that mutiplies BUFFER_SIZE is the '4' from the local definition of t.

    At least that's my understanding of the specifications ; I don't use async_work_group_copy() much, as the inability to use arbitrary element size (only 'gentype' are allowed) makes it pretty much useless on GPU in my experience.
    Romain,
    You are correct that the code sample I provided may not work on all implementations in that I violated the requirement you stated.