Topic
  • 12 replies
  • Latest Post - ‏2011-04-28T16:01:55Z by SystemAdmin
SystemAdmin
SystemAdmin
131 Posts

Pinned topic kernel works on PPE but not on SPE

‏2011-04-26T10:59:01Z |
Hi, I wrote this kernel code:

 #pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable

typedef unsigned int uint_t;
typedef unsigned char u08b_t;
typedef unsigned long u64b_t;

void Skein_Put64_LSB_First(u08b_t *dst, const u64b_t *src, unsigned long bCnt) {
unsigned long n;
for (n = 0; n < bCnt; n++)
dst[n] = (u08b_t)(src n >> 3 >> (8 * (n&7)));
}

#define SKEIN_MODIFIER_WORDS ( 2)
#define SKEIN_256_STATE_WORDS ( 4)
#define SKEIN_256_BLOCK_BYTES ( 8*SKEIN_256_STATE_WORDS)

typedef struct {
unsigned long hashBitLen;
unsigned long bCnt;
u64b_t T SKEIN_MODIFIER_WORDS;
} Skein_Ctxt_Hdr_t;

typedef struct {
Skein_Ctxt_Hdr_t h;
u64b_t X SKEIN_256_STATE_WORDS;
u08b_t b SKEIN_256_BLOCK_BYTES;
} Skein_256_Ctxt_t;

typedef struct {
uint_t statebits;
union {
Skein_Ctxt_Hdr_t h;
Skein_256_Ctxt_t ctx_256;
} u;
} hashState;

__kernel void update(__global unsigned char *in, __global unsigned char *out, __global hashState *gHs, __local hashState *lhs) {
int glId = get_global_id(0);
int lId = get_local_id(0);
int grId = get_group_id(0);

printf("global id: %x\n", glId);
printf("local id: %x\n", lId);
printf("group id: %x\n", grId);

gHs += grId;
lhs += lId;

Skein_256_Ctxt_t ctx;
ctx = gHs->u.ctx_256;
ctx.h.hashBitLen = 142;
gHs->u.ctx_256 = ctx;
}

When I compyle and run this code using CL_DEVICE_TYPE_CPU device, all things going fine. But when I switch to CL_DEVICE_TYPE_ACCELERATOR device (I have now only one thread) then it's gave me "Segmentation fault" error.
So I did debugging and the "Segmentation fault" error is making be this line "ctx = gHs->u.ctx_256;", here is the debugger output:
 (gdb) n
50 ctx = gHs->u.ctx_256;
(gdb) n

Program received signal SIGSEGV, Segmentation fault.
0x00001f90 in __cache_miss (ea=0x3a650, n_bytes_dirty=48) at ../../toolchain/gcc/gcc/config/spu/cachemgr.c:255
255 ../../toolchain/gcc/gcc/config/spu/cachemgr.c: No such file or directory.
in ../../toolchain/gcc/gcc/config/spu/cachemgr.c
(gdb)

and here is the backtrace:

 (gdb) bt
#0 0x00001f90 in __cache_miss (ea=0x3a650, n_bytes_dirty=48) at ../../toolchain/gcc/gcc/config/spu/cachemgr.c:255
#1 __cache_fetch_dirty (ea=0x3a650, n_bytes_dirty=48) at ../../toolchain/gcc/gcc/config/spu/cachemgr.c:417
#2 0x00001240 in memmove_ea (dest=0x5700, src=<value optimized out>, n=22272) at ../../../../../../src/newlib/libc/machine/spu/memmove_ea.c:66
#3 0x00000030 in __zerofunction__exec () at IBM_OpenCL_kernel.cl:189
#4 0x000000b4 in Skein_Put64_LSB_First (dst=0x1004dc80 "", src=0x0, bCnt=549755814016) at IBM_OpenCL_kernel.cl:189
#5 0x000000b4 in Skein_Put64_LSB_First (dst=0xf3aaf100 "", src=0x3a800, bCnt=0) at IBM_OpenCL_kernel.cl:189
#6 <cross-architecture call>
#7 0xf7f5d144 in syscall () from /lib/libc.so.6
#8 0x0f38818c in _base_spe_context_run () from /usr/lib/libspe2.so.2
#9 0x0f37d204 in spe_context_run () from /usr/lib/libspe2.so.2
#10 0x0f3ba6f0 in ibm::openclDeviceCellSPU::CellSPUUnit::threadFunc(void*) () from /usr/lib/CL/debug/CL/device/libCLDevAccelCellSPU.so
#11 0x0fcf66d4 in start_thread () from /lib/libpthread.so.0
#12 0xf7f613a4 in clone () from /lib/libc.so.6

I'm wondering when I look at #3, #4, #5 frame, there is line number 189 and my kernel has only 53 lines, next think is that I didn't call Skein_Put64_LSB_First function from my kernel. Have any ideas?
Updated on 2011-04-28T16:01:55Z at 2011-04-28T16:01:55Z by SystemAdmin
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: kernel works on PPE but not on SPE

    ‏2011-04-26T12:54:28Z  
    tautou,

    The problem you are experiencing seems to be the same as diko.
    See http://www.ibm.com/developerworks/forums/thread.jspa?messageID=14610195.
    I'm still debugging the problem, but it appears that it occurs as a result of a block copy (memcpy or memmove) between global memory and local memory. I'll post an update when I got a full handle on the problem.

    Dan B.
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: kernel works on PPE but not on SPE

    ‏2011-04-26T13:11:42Z  
    tautou,

    The problem you are experiencing seems to be the same as diko.
    See http://www.ibm.com/developerworks/forums/thread.jspa?messageID=14610195.
    I'm still debugging the problem, but it appears that it occurs as a result of a block copy (memcpy or memmove) between global memory and local memory. I'll post an update when I got a full handle on the problem.

    Dan B.
    If you do, it would be very helpful for me. Thanks a lot.
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: kernel works on PPE but not on SPE

    ‏2011-04-26T18:41:29Z  
    If you do, it would be very helpful for me. Thanks a lot.
    As best as I can tell, this problem is a result of the compiler calling memcpy_ea or memmove_ea in response to a block memory copy from __global to __local memory. The memmove_ea function takes two __ea void pointers. This function assumes that the caller has correctly converted the local storage pointer to a __ea void pointer by adding the ea base address. Without this addition, the memmove_ea function will attempt to cache the __local data thinking it is a global memory pointer. Then when the cache line must be evicted, then it attempts to DMA the cache line to the local storage address which segfaults because there is no system (global) memory at that address.

    I have opened a defect for the compiler developers to look at this further in that they may be using a different set of ea memory functions than what is currently OpenCL sourced.
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: kernel works on PPE but not on SPE

    ‏2011-04-26T21:47:22Z  
    As best as I can tell, this problem is a result of the compiler calling memcpy_ea or memmove_ea in response to a block memory copy from __global to __local memory. The memmove_ea function takes two __ea void pointers. This function assumes that the caller has correctly converted the local storage pointer to a __ea void pointer by adding the ea base address. Without this addition, the memmove_ea function will attempt to cache the __local data thinking it is a global memory pointer. Then when the cache line must be evicted, then it attempts to DMA the cache line to the local storage address which segfaults because there is no system (global) memory at that address.

    I have opened a defect for the compiler developers to look at this further in that they may be using a different set of ea memory functions than what is currently OpenCL sourced.
    So as I'm understanding this in right way, the solution is that I must to wait for developers to make patch and build new release? Is there another way? Thanks a lot for your time.
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: kernel works on PPE but not on SPE

    ‏2011-04-26T22:14:21Z  
    So as I'm understanding this in right way, the solution is that I must to wait for developers to make patch and build new release? Is there another way? Thanks a lot for your time.
    I believe that you can work around the problem by writing a utility to copy structures between local and global memory (and vice versa) by copying the data each base structure element at a time. See the code sample I posted in the other thread that exhibits what I believe to be the same problem.
    http://www.ibm.com/developerworks/forums/thread.jspa?messageID=14610195
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: kernel works on PPE but not on SPE

    ‏2011-04-27T10:06:48Z  
    I believe that you can work around the problem by writing a utility to copy structures between local and global memory (and vice versa) by copying the data each base structure element at a time. See the code sample I posted in the other thread that exhibits what I believe to be the same problem.
    http://www.ibm.com/developerworks/forums/thread.jspa?messageID=14610195
    I should have also suggested that using the async_work_group_copy built-in is another workaround to block move memory between global and local memory. For large transfers that can achieve better performance on the SPU too.
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: kernel works on PPE but not on SPE

    ‏2011-04-27T14:23:15Z  
    I should have also suggested that using the async_work_group_copy built-in is another workaround to block move memory between global and local memory. For large transfers that can achieve better performance on the SPU too.
    Can you give me some example or refer me to example source code, because I'm new in this.
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: kernel works on PPE but not on SPE

    ‏2011-04-27T14:58:08Z  
    Can you give me some example or refer me to example source code, because I'm new in this.
    Many of the code samples shipped with the SDK which are packaged as a separately downloadably zip file include use cases for async_work_group_copy. These include the sparse matrix-vector multiply sample (spmv), the Black-Scholes option pricing sample, and the fluid simulation sample.

    In addition, I verified that async_work_group_copy works with diko's code sample. Posted here:
    https://www.ibm.com/developerworks/forums/thread.jspa?messageID=14610837&#14610837
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: kernel works on PPE but not on SPE

    ‏2011-04-27T15:24:09Z  
    Many of the code samples shipped with the SDK which are packaged as a separately downloadably zip file include use cases for async_work_group_copy. These include the sparse matrix-vector multiply sample (spmv), the Black-Scholes option pricing sample, and the fluid simulation sample.

    In addition, I verified that async_work_group_copy works with diko's code sample. Posted here:
    https://www.ibm.com/developerworks/forums/thread.jspa?messageID=14610837&#14610837
    OK, thanks.
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: kernel works on PPE but not on SPE

    ‏2011-04-27T16:35:38Z  
    Many of the code samples shipped with the SDK which are packaged as a separately downloadably zip file include use cases for async_work_group_copy. These include the sparse matrix-vector multiply sample (spmv), the Black-Scholes option pricing sample, and the fluid simulation sample.

    In addition, I verified that async_work_group_copy works with diko's code sample. Posted here:
    https://www.ibm.com/developerworks/forums/thread.jspa?messageID=14610837&#14610837
    One more thing. You post this bug to developers, can you give me the bugzilla link of this bug where you posted it? Thanks.
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: kernel works on PPE but not on SPE

    ‏2011-04-27T18:21:36Z  
    One more thing. You post this bug to developers, can you give me the bugzilla link of this bug where you posted it? Thanks.
    The bug being tracked in an internal bug tracking system (which isn't bugzilla), so I can't give you a link. Sorry.
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: kernel works on PPE but not on SPE

    ‏2011-04-28T16:01:55Z  
    The bug being tracked in an internal bug tracking system (which isn't bugzilla), so I can't give you a link. Sorry.
    Hi, I trying to use async_work_group_copy() function in my code as you recommended to me.
    But this doesn't work form me because as you can see, I copy from __global to __private memory and vice versa.
    Not from __global to __local memory and vice versa. So this solution is not work form me.