Topic
  • 6 replies
  • Latest Post - ‏2011-03-30T08:32:54Z by SystemAdmin
masterzorag
masterzorag
25 Posts

Pinned topic matrix_add_vector_async_copy + vectors sums, wrong results in __local

‏2010-06-16T18:04:48Z |
I've adapt something to run some of the example kernels, this runs a matrix_add_vector_async_copy over a couple of n scalar input data;
I do C_output[i] = A_input[i] + B_input[i], about error checking every C_output[i] = n

__kernel __attribute__ ((reqd_work_group_size (1, 1, 1)))
void matrix_add_vector_async_copy (
__global unsigned int *A,
__global unsigned int *B,
__global unsigned int *C,
__local unsigned int *lA,
__local unsigned int *lB,
__local unsigned int *lC,
unsigned int size
)
{
unsigned int i, c;
unsigned int cnt = size/4;
unsigned int offset = (cnt) * get_global_id(0);
event_t event = (event_t)0;

printf("cnt:%d, offset:%d %d\n", cnt, offset);

//get the matrices into local memory
event = async_work_group_copy (lA, (__global const unsigned int*)&(Aoffset), cnt, event);
event = async_work_group_copy (lB, (__global const unsigned int*)&(Boffset), cnt, event);
wait_group_events (1, &event);

for (i = 0; i < cnt/4; i++)
{
uint4 tA[i], tB[i], tC[i];
int n = i*4;
//printf("%d %d %d %d\n", n+0, n+1, n+2, n+3);
tA[i].x = An ; tB[i].x = Bn ;
tA[i].y = An+1; tB[i].y = Bn+1;
tA[i].z = An+2; tB[i].z = Bn+2;
tA[i].w = An+3; tB[i].w = Bn+3;
//printf("tA%d:%d %d %d %d, %d - ", i, tA[i].x, tA[i].y, tA[i].z, tA[i].w, sizeof(tA[i]));
//printf("tB%d:%d %d %d %d, %d\n", i, tB[i].x, tB[i].y, tB[i].z, tB[i].w, sizeof(tB[i]));
tC[i] = tA[i] + tB[i];
printf("tC%d:%d %d %d %d, %d\n", i, tC[i].x, tC[i].y, tC[i].z, tC[i].w, sizeof(tC[i]));
}

for (i = 0; i < cnt; i++)
{
lC[i] = lA[i] + lB[i];
printf("lC%d:%u, %u+%u\n",i ,lC[i], lA[i], lB[i]);
}

event = async_work_group_copy (&Coffset, (__local const unsigned int*)lC, cnt, event);
wait_group_events (1, &event);
}

I'm trying to learn how depack input data in vectors, and how can I do with them, so I setup also vectors tC[i],
and fill them with same sums as tC[x].xyzw = n

  1. ./main
0+12
1+11
2+10
3+9
4+8
5+7
6+6
7+5
8+4
9+3
10+2
11+1
size of outbuffer:48, array count:12
Building kernel...
Connecting to IBM ACCELERATOR CellBE processor...
CL_KERNEL_WORK_GROUP_SIZE 1
cnt:12, offset:0 0
tC[0]:12 12 12 12, 16
tC[1]:12 12 12 12, 16
tC[2]:12 12 12 12, 16
lC[0]:12, 0+12
lC[1]:12, 1+11
lC[2]:12, 2+10
lC[3]:12, 3+9
lC[4]:12, 4+8
lC[5]:12, 5+7
lC[6]:12, 6+6
lC[7]:12, 7+5
lC[8]:12, 8+4
lC[9]:12, 9+3
lC10:12, 10+2
lC11:12, 11+1
output: 12 12 12 12 12 12 12 12 12 12 12 12

Host code attached, runs with n < 32, with major multiplier of 4 runs in computed errors; or better:
with n = 36 vector sums are correct, but not the sums on __local memory! as like this results readed back!

  1. ./main
-cutted-
size of outbuffer:144, array count:36
Building kernel...
Connecting to IBM ACCELERATOR CellBE processor...
CL_KERNEL_WORK_GROUP_SIZE 1
cnt:36, offset:0 0
tC[0]:36 36 36 36, 16
tC[1]:36 36 36 36, 16
tC[2]:36 36 36 36, 16
tC[3]:36 36 36 36, 16
tC[4]:36 36 36 36, 16
tC[5]:36 36 36 36, 16
tC[6]:36 36 36 36, 16
tC[7]:36 36 36 36, 16
tC[8]:36 36 36 36, 16
lC[0]:40, 4+36
lC[1]:38, 3+35
lC[2]:36, 2+34
lC[3]:34, 1+33
lC[4]:36, 4+32
lC[5]:36, 5+31
lC[6]:36, 6+30
lC[7]:36, 7+29
lC[8]:36, 8+28
lC[9]:36, 9+27
lC10:36, 10+26
lC11:36, 11+25
lC12:36, 12+24
lC13:36, 13+23
lC14:36, 14+22
lC15:36, 15+21
lC16:36, 16+20
lC17:36, 17+19
lC18:36, 18+18
lC19:36, 19+17
lC20:36, 20+16
lC21:36, 21+15
lC22:36, 22+14
lC23:36, 23+13
lC24:36, 24+12
lC25:36, 25+11
lC26:36, 26+10
lC27:36, 27+9
lC28:36, 28+8
lC29:36, 29+7
lC30:36, 30+6
lC31:36, 31+5
lC32:36, 32+4
lC33:36, 33+3
lC34:36, 34+2
lC35:36, 35+1
output: 40 38 36 34 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36

I've tryed others number on n, and when I set n > 32 I got wrong results, maybe about printf formatting? maybe not.
I'm compiling with gcc -o main try1.c -lm -lstdc++ -lCL -std=c99

Attachments

Updated on 2011-03-30T08:32:54Z at 2011-03-30T08:32:54Z by SystemAdmin
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: matrix_add_vector_async_copy + vectors sums, wrong results in __local

    ‏2010-06-17T11:50:03Z  
    i'll try it here.

    one thing i noticed, your clSetKernelArg for the 3 __local parameters is wrong - the size needs to be the size of the local area. so 'n * sizeof(cl_uint)' and not sizeof(cl_mem).

    .bri.
  • masterzorag
    masterzorag
    25 Posts

    Re: matrix_add_vector_async_copy + vectors sums, wrong results in __local

    ‏2010-06-17T16:10:13Z  
    i'll try it here.

    one thing i noticed, your clSetKernelArg for the 3 __local parameters is wrong - the size needs to be the size of the local area. so 'n * sizeof(cl_uint)' and not sizeof(cl_mem).

    .bri.
    Thanks, nice, works exactly!

    when I setup those temporary
    uint4 tA[i], tB[i], tC[i];
    in kernel code, are them allocated in __private by default?
    if I setup those temporary as
    __local uint4 tA[i], tB[i], tC[i];
    and then assign values at .xyzw is barrier a need?

    I'm running host code over a single GlobalWorkGroup, but async_work_group_copy is used to copy to local the two matrices using cnt, becouse I'm not associating every input data at any workitem, it's all in one workitem in one workgroup, it is correct?

    or better, I have ND=1, LW=1,1,1 where a single async_work_group_copy iterates over input arrays by number of item in a serial way;
    and not as, for example, ND=4, LW=2,2,1 where every workitem do an async_work_group_copy in parallel by workitem global id

    another one:
    can I have in host
    cl_uint A_input[4];
    and grab from kernel as one cl_uint4?
    how host input must be setup to be pointed by kernel as __global unsigned int4 *A?
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: matrix_add_vector_async_copy + vectors sums, wrong results in __local

    ‏2010-06-17T18:46:00Z  
    Thanks, nice, works exactly!

    when I setup those temporary
    uint4 tA[i], tB[i], tC[i];
    in kernel code, are them allocated in __private by default?
    if I setup those temporary as
    __local uint4 tA[i], tB[i], tC[i];
    and then assign values at .xyzw is barrier a need?

    I'm running host code over a single GlobalWorkGroup, but async_work_group_copy is used to copy to local the two matrices using cnt, becouse I'm not associating every input data at any workitem, it's all in one workitem in one workgroup, it is correct?

    or better, I have ND=1, LW=1,1,1 where a single async_work_group_copy iterates over input arrays by number of item in a serial way;
    and not as, for example, ND=4, LW=2,2,1 where every workitem do an async_work_group_copy in parallel by workitem global id

    another one:
    can I have in host
    cl_uint A_input[4];
    and grab from kernel as one cl_uint4?
    how host input must be setup to be pointed by kernel as __global unsigned int4 *A?
    tA address space: yes, tA, tB and tC will be in __private memory, since that's the default.

    right, your code is doing global=1 and local=1,1,1; you could do it as global=4 and local=4,1,1 and the async_work_group_copy could do the entire local_work_group size. both do the same work, tho with the =4 case, you could also run it with local=1,1,1 and it would get distributed across 4 compute units. or fold the cnt value into that. global could be your count size, and local 1,1,1 or 4,1,1 and let OpenCL handle looping across the data.

    cl_uint on the host -- you don't want to have cl_uint on the host to access them as int4 values, because of alignment. you would want to do them as a cl_uint4 on the host. that should work to go back and forth between the host and the kernel.

    .bri.
  • masterzorag
    masterzorag
    25 Posts

    Re: matrix_add_vector_async_copy + vectors sums, wrong results in __local

    ‏2011-03-30T00:33:00Z  
    tA address space: yes, tA, tB and tC will be in __private memory, since that's the default.

    right, your code is doing global=1 and local=1,1,1; you could do it as global=4 and local=4,1,1 and the async_work_group_copy could do the entire local_work_group size. both do the same work, tho with the =4 case, you could also run it with local=1,1,1 and it would get distributed across 4 compute units. or fold the cnt value into that. global could be your count size, and local 1,1,1 or 4,1,1 and let OpenCL handle looping across the data.

    cl_uint on the host -- you don't want to have cl_uint on the host to access them as int4 values, because of alignment. you would want to do them as a cl_uint4 on the host. that should work to go back and forth between the host and the kernel.

    .bri.
    Another one after something;
    I'm running this kernel, I've trim down useless stuff, going to the point.

    #define WORK_GROUP_SIZE 128
    __kernel __attribute__ ((reqd_work_group_size (WORK_GROUP_SIZE, 1, 1)))
    void sha1k2 (__global uint4 *He1, __local uint4 *lT ) {
    const int gid = get_global_id(0);
    const int gid5 = gid * 5;
    lT[gid5] = He1[gid5];
    lT[gid5+1] = He1[gid5+1];
    lT[gid5+2] = He1[gid5+2];
    lT[gid5+3] = He1[gid5+3];
    lT[gid5+4] = He1[gid5+4];
    ... processing ...
    }

    I'm copying data to _local with this gid5 mask, running in one linear dimension with Global 128, Local 128; result is only one workgroup, all is running fine. I'm trying to replace data transfer with:
    event_t event = (event_t)0;
    event = async_work_group_copy (lT, He1, get_global_size(0) * 5, event);
    wait_group_events (1, &event);
    to copy all values from He1 to lT, all is good until I run onto a single workgroup!

    If I draw 2 workgroups leaving Global 128, but setting Local 64 and WORK_GROUP_SIZE 64, only the first one gives me correct results!
    I'm missing something about the &He1[offset] for the second workgroup, in previous case num_elem = get_global_size(0) * 5 is the entire data buffer;
    Drawing 2 workgroups num_elem should be get_global_size(0) / get_num_groups(0) * 5' per group, I can't figure out the correct async_work_group_copy call for my case...
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: matrix_add_vector_async_copy + vectors sums, wrong results in __local

    ‏2011-03-30T08:13:42Z  
    Another one after something;
    I'm running this kernel, I've trim down useless stuff, going to the point.

    #define WORK_GROUP_SIZE 128
    __kernel __attribute__ ((reqd_work_group_size (WORK_GROUP_SIZE, 1, 1)))
    void sha1k2 (__global uint4 *He1, __local uint4 *lT ) {
    const int gid = get_global_id(0);
    const int gid5 = gid * 5;
    lT[gid5] = He1[gid5];
    lT[gid5+1] = He1[gid5+1];
    lT[gid5+2] = He1[gid5+2];
    lT[gid5+3] = He1[gid5+3];
    lT[gid5+4] = He1[gid5+4];
    ... processing ...
    }

    I'm copying data to _local with this gid5 mask, running in one linear dimension with Global 128, Local 128; result is only one workgroup, all is running fine. I'm trying to replace data transfer with:
    event_t event = (event_t)0;
    event = async_work_group_copy (lT, He1, get_global_size(0) * 5, event);
    wait_group_events (1, &event);
    to copy all values from He1 to lT, all is good until I run onto a single workgroup!

    If I draw 2 workgroups leaving Global 128, but setting Local 64 and WORK_GROUP_SIZE 64, only the first one gives me correct results!
    I'm missing something about the &He1[offset] for the second workgroup, in previous case num_elem = get_global_size(0) * 5 is the entire data buffer;
    Drawing 2 workgroups num_elem should be get_global_size(0) / get_num_groups(0) * 5' per group, I can't figure out the correct async_work_group_copy call for my case...
    {quote:title=masterzorag wrote:}{quote}
    > Another one after something;
    > I'm running this kernel, I've trim down useless stuff, going to the point.
    >
    > #define WORK_GROUP_SIZE 128
    > __kernel __attribute__ ((reqd_work_group_size (WORK_GROUP_SIZE, 1, 1)))
    > void sha1k2 (__global uint4 *He1, __local uint4 *lT ) {
    > const int gid = get_global_id(0);
    > const int gid5 = gid * 5;
    > lT[gid5] = He1[gid5];
    > lT[gid5+1] = He1[gid5+1];
    > lT[gid5+2] = He1[gid5+2];
    > lT[gid5+3] = He1[gid5+3];
    > lT[gid5+4] = He1[gid5+4];

    Are you sure "lT" (__local memory) uses the same offset as He1 (__global memory)?
    Usually, __local is much smaller and is used as a local cache. I'd expect:

    const int lid5 = get_local_id(0) * 5;
    lT[lid5] = He1[gid5]
    (...)
    to cache 5 elements in local memory.

    > I'm copying data to _local with this gid5 mask, running in one linear dimension with Global 128, Local 128; result is only one workgroup, all is running fine. I'm trying to replace data transfer with:
    > event_t event = (event_t)0;
    > event = async_work_group_copy (lT, He1, get_global_size(0) * 5, event);

    Both lT & He1 have no offset applied here. All workgroups will read & write the same addresses, and copy a lot of stuff, not just 5 elements as in the code above.
    I'd expect, to match my code above:

    event = async_work_group_copy (lT, (He1 + base_gid5), 5, event);

    where base_gid5 is the index of the first element of the workgroup (something like "base_gid5 = gid5 & ~(WORK_GROUP_SIZE-1));").

    It depends on what you want to do in the kernel.
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: matrix_add_vector_async_copy + vectors sums, wrong results in __local

    ‏2011-03-30T08:32:54Z  
    {quote:title=masterzorag wrote:}{quote}
    > Another one after something;
    > I'm running this kernel, I've trim down useless stuff, going to the point.
    >
    > #define WORK_GROUP_SIZE 128
    > __kernel __attribute__ ((reqd_work_group_size (WORK_GROUP_SIZE, 1, 1)))
    > void sha1k2 (__global uint4 *He1, __local uint4 *lT ) {
    > const int gid = get_global_id(0);
    > const int gid5 = gid * 5;
    > lT[gid5] = He1[gid5];
    > lT[gid5+1] = He1[gid5+1];
    > lT[gid5+2] = He1[gid5+2];
    > lT[gid5+3] = He1[gid5+3];
    > lT[gid5+4] = He1[gid5+4];

    Are you sure "lT" (__local memory) uses the same offset as He1 (__global memory)?
    Usually, __local is much smaller and is used as a local cache. I'd expect:

    const int lid5 = get_local_id(0) * 5;
    lT[lid5] = He1[gid5]
    (...)
    to cache 5 elements in local memory.

    > I'm copying data to _local with this gid5 mask, running in one linear dimension with Global 128, Local 128; result is only one workgroup, all is running fine. I'm trying to replace data transfer with:
    > event_t event = (event_t)0;
    > event = async_work_group_copy (lT, He1, get_global_size(0) * 5, event);

    Both lT & He1 have no offset applied here. All workgroups will read & write the same addresses, and copy a lot of stuff, not just 5 elements as in the code above.
    I'd expect, to match my code above:

    event = async_work_group_copy (lT, (He1 + base_gid5), 5, event);

    where base_gid5 is the index of the first element of the workgroup (something like "base_gid5 = gid5 & ~(WORK_GROUP_SIZE-1));").

    It depends on what you want to do in the kernel.
    {quote:title=RomainDolbeau wrote:}{quote}
    > const int lid5 = get_local_id(0) * 5;
    > lT[lid5] = He1[gid5]
    > (...)
    > to cache 5 elements in local memory.

    Except you want to cache 5 * WORK_GROUP_SIZE, not just 5, silly me.

    So you probably want something like:

    event = async_work_group_copy(lT, (He1 + base_gid5), 5 * get_local_size(0), event);

    assuming...

    lT has a size of (5 * WORK_GROUP_SIZE * sizeof(float))
    He1 has a size of (5 * get_global_size(0) * sizeof(float))

    and you want to copy (5 * WORK_GROUP_SIZE * sizeof(float)) bytes from He1 to lT, with each workgroup copying different, consecutive data from He1. (WORK_GROUP_SIZE == get_local_size(0) per the kernel attribute).

    Your code would copy (5 * WORK_GROUP_SIZE * sizeof(float)) in the variant w/o async_work_group_copy, but (5 * get_global_size(0) * sizeof(float)) in the variant w/ async_work_group_copy, so if you have more than one workgroup they aren't equivalent.