Topic
5 replies Latest Post - ‏2012-01-21T00:36:47Z by masterzorag
masterzorag
masterzorag
25 Posts
ACCEPTED ANSWER

Pinned topic async_work_group_copy and workgroups

‏2012-01-05T11:48:51Z |
I'm sharing this piece of code (can't attach here), grab here: http://www.megaupload.com/?d=F9NUYN1G
it runs a kernel in one dimension, doing async_work_group_copy, apparently works.
in host code I set global as 1/4 of total inbuffer set, every workitem operate on data set of 5 arrays of 4 cl_uint4 buffers;
I can set how many workgroup, from 1 to 700, it run correctly. I do this setting in host:
int n= 4 *256 *700; // 4pmk at once * localsize * number of workgroup
global= n/4;
local= 256;
I change the number of workgroup (700) I want to run (see snapshot of start/end run)

kernel uses _local to store variables

l8[lid]= (uint8) 
{ gid, get_local_id(0), get_local_size(0), get_global_size(0),              
//get_global_id(0) / get_local_size(0),    // this == num_groups get_num_groups(0), get_group_id(0),             
//get_local_id(0) %2 offset, step5 
};

and two call to async_work_group_copy copies output on_run stats on _global:

event= async_work_group_copy((__global uint4 *) &He1[offset *5], (__local uint4 *) lT, local_size *5, event);

this copies 5 cl_uint4 * local_size (as there are many local_size cl_uint4 *5 allocated for) to E1 buffer,

event= async_work_group_copy((__global uint8 *) &out[offset *1], (__local uint8 *) l8, local_size *1, event);

this copies 1 cl_uint8 * local_size (as there are many local_size cl_uint8 *1 allocated for) to out buffer,
every workitem has its stats saved, at the end E10 to 4 should be equal to: 0 to 4 as

lT[lid5]    = 0; lT[lid5 +1] = 1; lT[lid5 +2] = 2; lT[lid5 +3] = 3; lT[lid5 +4] = 4;

Stats reports num of group, gid, lid... correctly

Question is: is a false positive?
Allocated _local is computed as:
num_groups= global / local,
allocated_local= (sizeof(cl_uint4) * local *5 + sizeof(cl_uint8) * local);
_local is allocated for only one workgroup! because only one workgroup can be executed once?
But on 6 spe? (see snapshot1.png)
Max allocable _local is computed as:
Local Memory (b): 247808 - (CL_KERNEL_LOCAL_MEM_SIZE): 12368b = 235440b ?
Reported CL_DEVICE_LOCAL_MEM_SIZE is for 1 spe, so we can use *6 compute units?
I allocate only 28672b ! Allocated _local is for compute unit ?
Updated on 2012-01-21T00:36:47Z at 2012-01-21T00:36:47Z by masterzorag
  • SystemAdmin
    SystemAdmin
    131 Posts
    ACCEPTED ANSWER

    Re: async_work_group_copy and workgroups

    ‏2012-01-05T17:36:37Z  in response to masterzorag
    Not sure what problem you are seeing? I can't get to the code and I don't see a screen snapshot (.png file) so not sure what's not working? Is it working with 700 workgroups, but failing higher? Does the code run OK on the CPU device?

    The memory returned values are per compute unit, so don't *6 due to 6 SPEs. Each compute unit has it's own local memory (and size) and needs to be thought of as separate.

    .bri.
    • masterzorag
      masterzorag
      25 Posts
      ACCEPTED ANSWER

      async_work_group_copy and workgroups, different hw + same kernel, but...

      ‏2012-01-13T17:35:14Z  in response to SystemAdmin
      There's something weird, I run the same kernel on my developing nvidia, here output of my program:
      
      Connecting to OpenCL device:     NVIDIA Corporation GeForce 8400M G Version:             OpenCL 1.0 CUDA Profile:                FULL_PROFILE Driver:            285.05.09 Extensions:           285.05.09 Max Work-Item Sizes:  (512,512,64) Max Work Group Size:       512 Max Compute Units:  1 Max Frequency (Mhz):  800 Cache Line (bytes): 0 Global Memory (MB):   127 Local Memory (b):   16384 Local Memory Type:        CL_LOCAL Max Mem Alloc (MB):    128 Max Param Size (MB):        4352 Base Mem Align (bits):     2048 Address Space (bits):      32 Image Support:               1 Building from OpenCL source:  blackhole_oclkernel.cl Compiling/quering CL_program:    test CL_KERNEL_WORK_GROUP_SIZE  128 CL_KERNEL_LOCAL_MEM_SIZE    48b global:112, local:16, should be 7 groups sets:112, total inbuffer   35840b remaining allocable_local:       16336b (1021b, 63.81 cl_uint4 per lid) allocated _local:                3072b lost/not_used _local:             13264b (829b per lid) testing alloc _local:     3072b Test vectors matrix: ipad, opad, e1, e2 x 3031010141 2097023113 3687644185  193597754 3539340543 x 3381037128 3133677280    6352479 2353724843 2715349157 x 1047897839 3669443064 1923090211 1545323242 1555370701 x  451126258 2465162893 2966289863 1553395727 1274994140 k execution:14425.00 ms Reading back, MAP: 8960b gid,         lid0,   lsize0, gid0/lsize0,    gsize0, n_gr0,  lid5,   offset gid:    0        lid:0   Ls:16   Gs:112  gr0s:7, gr0:0, offset:0, step5:0 x 3798810640 3127673557 3902155923  740317528 4146618039 gid:   16     lid:0   Ls:16   Gs:112  gr0s:7, gr0:1, offset:16, step5:80 x 2416824577 2984929202 3420684371 2475865352  775399312 gid:   17   lid:1   Ls:16   Gs:112  gr0s:7, gr0:1, offset:16, step5:85 x 2416824577 2984929202 3420684371 2475865352  775399312 gid:   32   lid:0   Ls:16   Gs:112  gr0s:7, gr0:2, offset:32, step5:160 x 2416824577 2984929202 3420684371 2475865352  775399312 gid:   34  lid:2   Ls:16   Gs:112  gr0s:7, gr0:2, offset:32, step5:170 x 2416824577 2984929202 3420684371 2475865352  775399312 gid:   48  lid:0   Ls:16   Gs:112  gr0s:7, gr0:3, offset:48, step5:240 x 2416824577 2984929202 3420684371 2475865352  775399312 gid:   51  lid:3   Ls:16   Gs:112  gr0s:7, gr0:3, offset:48, step5:255 x 2416824577 2984929202 3420684371 2475865352  775399312 gid:   64  lid:0   Ls:16   Gs:112  gr0s:7, gr0:4, offset:64, step5:320 x 2416824577 2984929202 3420684371 2475865352  775399312 gid:   68  lid:4   Ls:16   Gs:112  gr0s:7, gr0:4, offset:64, step5:340 x 2416824577 2984929202 3420684371 2475865352  775399312 gid:   80  lid:0   Ls:16   Gs:112  gr0s:7, gr0:5, offset:80, step5:400 x 2416824577 2984929202 3420684371 2475865352  775399312 gid:   85  lid:5   Ls:16   Gs:112  gr0s:7, gr0:5, offset:80, step5:425 x 2416824577 2984929202 3420684371 2475865352  775399312 gid:   96  lid:0   Ls:16   Gs:112  gr0s:7, gr0:6, offset:96, step5:480 x 2416824577 2984929202 3420684371 2475865352  775399312 gid:  102  lid:6   Ls:16   Gs:112  gr0s:7, gr0:6, offset:96, step5:510 x 2416824577 2984929202 3420684371 2475865352  775399312 *Result vectors are good!* Done!
      

      Then I run on a ps3, here output:
      
      Connecting to OpenCL device:      IBM ACCELERATOR CellBE processor Version:               OpenCL 1.1 BUILD 201103161830 Profile:          EMBEDDED_PROFILE Driver:                0.3 Extensions:         0.3 Max Work-Item Sizes:        (256,256,256) Max Work Group Size:      256 Max Compute Units:  6 Max Frequency (Mhz):  3200 Cache Line (bytes):        0 Global Memory (MB):   44 Local Memory (b):    247808 Local Memory Type:       CL_LOCAL Max Mem Alloc (MB):    128 Max Param Size (MB):        1024 Base Mem Align (bits):     1024 Address Space (bits):      32 Image Support:               0 Building from OpenCL source:  blackhole_oclkernel.cl Compiling/quering CL_program:    test CL_KERNEL_WORK_GROUP_SIZE  64 CL_KERNEL_LOCAL_MEM_SIZE     51296b global:112, local:16, should be 7 groups sets:112, total inbuffer        35840b remaining allocable_local:       196512b (12282b, 767.62 cl_uint4 per lid) allocated _local:             3072b lost/not_used _local:             193440b (12090b per lid) testing alloc _local:  3072b Test vectors matrix: ipad, opad, e1, e2 x 3031010141 2097023113 3687644185  193597754 3539340543 x 3381037128 3133677280    6352479 2353724843 2715349157 x 1047897839 3669443064 1923090211 1545323242 1555370701 x  451126258 2465162893 2966289863 1553395727 1274994140 k execution:311.00 ms Reading back, MAP: 8960b gid,           lid0,   lsize0, gid0/lsize0,    gsize0, n_gr0,  lid5,   offset gid:    0        lid:0   Ls:16   Gs:112  gr0s:7, gr0:0, offset:0, step5:0 x          0          0          0          0          0 gid:   16     lid:0   Ls:16   Gs:112  gr0s:7, gr0:1, offset:16, step5:80 x          0          0          0          0          0 gid:   17   lid:1   Ls:16   Gs:112  gr0s:7, gr0:1, offset:16, step5:85 x          0          0          0          0          0 gid:   32   lid:0   Ls:16   Gs:112  gr0s:7, gr0:2, offset:32, step5:160 x          0          0          0          0          0 gid:   34  lid:2   Ls:16   Gs:112  gr0s:7, gr0:2, offset:32, step5:170 x          0          0          0          0          0 gid:   48  lid:0   Ls:16   Gs:112  gr0s:7, gr0:3, offset:48, step5:240 x          0          0          0          0          0 gid:   51  lid:3   Ls:16   Gs:112  gr0s:7, gr0:3, offset:48, step5:255 x          0          0          0          0          0 gid:   64  lid:0   Ls:16   Gs:112  gr0s:7, gr0:4, offset:64, step5:320 x          0          0          0          0          0 gid:   68  lid:4   Ls:16   Gs:112  gr0s:7, gr0:4, offset:64, step5:340 x          0          0          0          0          0 gid:   80  lid:0   Ls:16   Gs:112  gr0s:7, gr0:5, offset:80, step5:400 x          0          0          0          0          0 gid:   85  lid:5   Ls:16   Gs:112  gr0s:7, gr0:5, offset:80, step5:425 x          0          0          0          0          0 gid:   96  lid:0   Ls:16   Gs:112 gr0s:7, gr0:6, offset:96, step5:480 x          0          0          0          0          0 gid:  102   lid:6   Ls:16   Gs:112  gr0s:7, gr0:6, offset:96, step5:510 x          0          0          0          0          0 x          0          0          0          0          0 x          0          0          0          0          0 Done!
      

      Compiler tells me that I can also set larger workgroups, but on a ps3 I got all zero in my output buffers, where's the weirdness?
      kernel follows:
      
      __kernel 
      //__attribute__ ((reqd_work_group_size (WORK_GROUP_SIZE, 1, 1)))   This is not more a restriction in current release, true? 
      
      void test ( __global 
      
      const uint4 *ipad, __global 
      
      const uint4 *opad, __global uint4 * restrict He1, __global uint4 * restrict He2, __local  uint4 * restrict lA, __local  uint4 * restrict lT, __local  uint8 * restrict l8,             
      //to save debug output __global uint8 *out )                     
      //to save debug output 
      { 
      
      const 
      
      int gid =           get_global_id(0); 
      
      const 
      
      int gid5 =            gid *5; 
      
      const 
      
      int lid =               get_local_id(0); 
      
      const 
      
      int lid5 =             lid *5; 
      
      const 
      
      int local_size =        get_local_size(0); 
      
      const 
      
      int step5 =          lid5 + (get_group_id(0) * local_size *5); 
      
      const 
      
      int offset =          local_size * get_group_id(0); event_t event; 
      // event = async_work_group_copy ((__local uint4 *) lA, (__global const uint4 *) &He1[offset *5], local_size *5, 0); event = async_work_group_copy ((__local uint4 *) lT, (__global 
      
      const uint4 *) &He1[offset *5], local_size *5, 0); wait_group_events (1, &event);   
      // start doing something useless, (but accessing _global) uint4 i_ctx[5]; i_ctx[0   ] = ipad[gid5   ]; i_ctx[0 +1] = ipad[gid5 +1]; i_ctx[0 +2] = ipad[gid5 +2]; i_ctx[0 +3] = ipad[gid5 +3]; i_ctx[0 +4] = ipad[gid5 +4];   uint4 o_ctx[5]; o_ctx[0   ] = opad[gid5   ]; o_ctx[0 +1] = opad[gid5 +1]; o_ctx[0 +2] = opad[gid5 +2]; o_ctx[0 +3] = opad[gid5 +3]; o_ctx[0 +4] = opad[gid5 +4];   uint4 T[5]; 
      // fill from async read! T[0] = lT[lid5]; T[1] = lT[lid5+1]; T[2] = lT[lid5+2]; T[3] = lT[lid5+3]; T[4] = lT[lid5+4]; 
      // use _private and _local 
      
      for(
      
      int i=0; i<4096-1; i++ ) 
      { sha1_process3(&i_ctx[0], &T[0], &lA[lid5]); sha1_process3(&o_ctx[0], &T[0], &lA[lid5]); lT[lid5]      ^= T[0]; lT[lid5+1] ^= T[1]; lT[lid5+2] ^= T[2]; lT[lid5+3] ^= T[3]; lT[lid5+4] ^= T[4]; 
      } event= async_work_group_copy((__global uint4 *) &He1[offset *5], (__local uint4 *) lT, local_size *5, event);                
      //five elements per lid   
      // refill from _global now! T[0] = He2[gid5]; T[1] = He2[gid5+1]; T[2] = He2[gid5+2]; T[3] = He2[gid5+3]; T[4] = He2[gid5+4]; 
      // use _private and _global 
      
      for(
      
      int i=0; i<4096-1; i++ ) 
      { sha1_process3(&i_ctx[0], &T[0], &lA[lid5]); sha1_process3(&o_ctx[0], &T[0], &lA[lid5]); He2[gid5] ^= T[0]; He2[gid5+1] ^= T[1]; He2[gid5+2] ^= T[2]; He2[gid5+3] ^= T[3]; He2[gid5+4] ^= T[4]; 
      }   
      //Save debug output l8[lid]= (uint8) 
      { gid,                                        get_local_id(0), get_local_size(0),             get_global_size(0),             
      //get_global_id(0) / get_local_size(0),    // this == num_groups get_num_groups(0),              get_group_id(0),                
      //get_local_id(0) %2 offset,                             step5 
      }; event= async_work_group_copy((__global uint8 *) &out[offset *1], (__local uint8 *) l8, local_size *1, event);          
      //one element per lid, should reflect clSetKernelArg call argoument wait_group_events (1, &event);   
      }
      

      Strange is that readed back _global uint8 *out buffer reflects correctly ndrange stats as lid, gid, num_groups, workgroup_offset
      • SystemAdmin
        SystemAdmin
        131 Posts
        ACCEPTED ANSWER

        Re: async_work_group_copy and workgroups, different hw + same kernel, but...

        ‏2012-01-14T12:26:59Z  in response to masterzorag
        not sure if it affects your results or not, but i think that each async_work_group_copy should have '0' for the event getting passed. and the 2nd one needs a wait after it.

        i'll look thru the code some more to see if i can see anything else amiss..

        .bri.
        • masterzorag
          masterzorag
          25 Posts
          ACCEPTED ANSWER

          Re: async_work_group_copy and workgroups, different hw + same kernel, but...

          ‏2012-01-21T00:36:47Z  in response to SystemAdmin
          first of all, thank you for your support

          I've noticed that output on CPU is the same as GPU zeroed's vectors output, so:
          1. on CPU kernel's results are good, one issue was faulting when feeding inbuffer data: on GPU I'm using clEnqueueWriteBuffer, on PS3 I'm using clEnqueueMapBuffer in a wrong order, cutted from host:
          
          cl_uint4 *ipad, *opad, *pmk1, *pmk2 __attribute__ ((aligned(16))); ipad= calloc(sets, sizeof(cl_uint4) *5); 
          // allocate other buffers too 
          // fill ipad, opad, pmk1, pmk2 arrays   cl_mem    IP, OP, E1, E2, ST; IP= clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, sets * sizeof(cl_uint4) *5, NULL, &res); 
          // create other buffers too   ipad= clEnqueueMapBuffer(cmd_queue, IP, CL_TRUE, CL_MAP_WRITE, 0, sets * sizeof(cl_uint4) *5, 0, NULL, NULL, &res); 
          // now I should fill ipad, opad, pmk1, pmk2 arrays!   
          // execute the ndrange   
          // read back result pmk1= clEnqueueMapBuffer(cmd_queue, E1, CL_TRUE, CL_MAP_READ, 0, sets * sizeof(cl_uint4) *5, 0, NULL, NULL, &res);
          

          About alignment I'm doing right?
          So, when unmap?

          2. on the SPUs I need a barrier(CLK_LOCAL_MEM_FENCE) at the end of the inner loop to ensure that at the loop restarts with ready _local!
          4096 calls to barrier
          3. you're right about the cl_event argoument:
          every single async_work_group_copy call should have 0 if executed alone, event is needed to be shared with another async_work_group_copy;
          after the first loop I have now:
          
          event= async_work_group_copy((__global uint4 *) &He1[offset *5], (__local uint4 *) lT, local_size *5, 0);               
          //five elements per lid wait_group_events (1, &event);
          

          nice, very nice
    • masterzorag
      masterzorag
      25 Posts
      ACCEPTED ANSWER

      On the CPU...

      ‏2012-01-13T17:44:39Z  in response to SystemAdmin
      I've run also on the CPU, I must set a localsize of 1 to run here:
      
      Connecting to OpenCL device:     IBM CPU Cell Broadband Engine, altivec supported 32bit Version:         OpenCL 1.1 BUILD 201103161830 Profile:          FULL_PROFILE Driver:            0.3 Extensions:         0.3 Max Work-Item Sizes:        (1024,1024,1024) Max Work Group Size:   1024 Max Compute Units: 2 Max Frequency (Mhz):  3192 Cache Line (bytes):        128 Global Memory (MB): 41 Local Memory (b):    524288 Local Memory Type:       CL_GLOBAL Max Mem Alloc (MB):   128 Max Param Size (MB):        1024 Base Mem Align (bits):     1024 Address Space (bits):      32 Image Support:               0 Building from OpenCL source:  blackhole_oclkernel.cl Compiling/quering CL_program:    test CL_KERNEL_WORK_GROUP_SIZE  1 CL_KERNEL_LOCAL_MEM_SIZE      0b global:112, local:1, should be 112 groups sets:112, total inbuffer   35840b remaining allocable_local:       524288b (524288b, 32768.00 cl_uint4 per lid) allocated _local:          192b lost/not_used _local:              524096b (524096b per lid) testing alloc _local: 192b Test vectors matrix: ipad, opad, e1, e2 x 3031010141 2097023113 3687644185  193597754 3539340543 x 3381037128 3133677280    6352479 2353724843 2715349157 x 1047897839 3669443064 1923090211 1545323242 1555370701 x  451126258 2465162893 2966289863 1553395727 1274994140 k execution:2513.00 ms Reading back, MAP: 8960b gid,           lid0,   lsize0, gid0/lsize0,    gsize0, n_gr0,  lid5,   offset gid:    0        lid:0   Ls:1    Gs:112  gr0s:112, gr0:0, offset:0, step5:0 x 2416824577 2984929202 3420684371 2475865352  775399312 gid:    1   lid:0   Ls:1    Gs:112  gr0s:112, gr0:1, offset:1, step5:5 x 2416824577 2984929202 3420684371 2475865352  775399312 gid:    2   lid:0   Ls:1    Gs:112  gr0s:112, gr0:2, offset:2, step5:10 x 2416824577 2984929202 3420684371 2475865352  775399312 gid:    3  lid:0   Ls:1    Gs:112  gr0s:112, gr0:3, offset:3, step5:15 x 2416824577 2984929202 3420684371 2475865352  775399312 ...CUT... gid:  109        lid:0   Ls:1    Gs:112  gr0s:112, gr0:109, offset:109, step5:545 x 2416824577 2984929202 3420684371 2475865352  775399312 gid:  110     lid:0   Ls:1    Gs:112  gr0s:112, gr0:110, offset:110, step5:550 x 2416824577 2984929202 3420684371 2475865352  775399312 gid:  111     lid:0   Ls:1    Gs:112  gr0s:112, gr0:111, offset:111, step5:555 x 2416824577 2984929202 3420684371 2475865352  775399312 x 2416824577 2984929202 3420684371 2475865352  775399312 x 2416824577 2984929202 3420684371 2475865352  775399312 Done!
      

      Got some results, but wrong!