Topic
  • 24 replies
  • Latest Post - ‏2012-01-25T11:39:52Z by masterzorag
joker5bb
joker5bb
3 Posts

Pinned topic Pyrit with opencl on the PS3

‏2010-01-17T17:46:56Z |
I wanted to know how much is opencl cell optimized to do bruteforcing on the ps3?

i want to use this Project:
http://code.google.com/p/pyrit/

source is available, it has opencl support

is it compatible?
Updated on 2012-01-25T11:39:52Z at 2012-01-25T11:39:52Z by masterzorag
  • JoaquinM
    JoaquinM
    5 Posts

    Re: Pyrit with opencl on the PS3

    ‏2010-01-29T17:38:34Z  
    It looks like it should run fine. They define the WORKGROUP_SIZE as a #define in the OpenCL code. You'll need to change the OpenCL kernel to include the attribute "__attribute__((reqd_work_group_size(X, Y, Z)))" in the kernel for maximum performance on Cell.

    Give it a try and let us know the results!

    Joaquin
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: Pyrit with opencl on the PS3

    ‏2010-01-29T17:40:12Z  
    I am not aware of anyone trying the OpenCL pyrit on Cell. If it was well written and didn't utilize any OpenCL extensions or features that have yet to be implemented (see the OpenCL Dev Kit Install and User's Guide for details on the limitations), then I see no reason it shouldn't work. I would like to hear your results.
  • joker5bb
    joker5bb
    3 Posts

    Re: Pyrit with opencl on the PS3

    ‏2010-02-01T21:48:28Z  
    well i could not get the cell opencl sdk to install on ubuntu,

    how can i compile it for 64bit ppc
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: Pyrit with opencl on the PS3

    ‏2010-02-01T22:52:44Z  
    • joker5bb
    • ‏2010-02-01T21:48:28Z
    well i could not get the cell opencl sdk to install on ubuntu,

    how can i compile it for 64bit ppc
    you can't -- currently, we are only shipping 32bit OpenCL libraries.

    .bri.
  • masterzorag
    masterzorag
    25 Posts

    Re: Pyrit with opencl on the PS3

    ‏2010-03-22T00:38:36Z  
    • JoaquinM
    • ‏2010-01-29T17:38:34Z
    It looks like it should run fine. They define the WORKGROUP_SIZE as a #define in the OpenCL code. You'll need to change the OpenCL kernel to include the attribute "__attribute__((reqd_work_group_size(X, Y, Z)))" in the kernel for maximum performance on Cell.

    Give it a try and let us know the results!

    Joaquin
    I've OpenCL working on a PS3 as core in pyrit, see here:
    http://masterzorag.blogspot.com/2010/03/pyrit-opencl-ps3.html

    But I've noticed that adding the include dir /opt/ibmcmp/xlc/opencl/0.1/include the cpyrit_opencl/setup.py build fails with something related to a missing '__attribute__’ before ‘__vastart’, see log line 12

    Can be "__attribute__((reqd_work_group_size(X, Y, Z)))" as you suggest?
    So my video show a version of a module without include from IBM? It is compiled without, becouse without that dir compiled fine and runs!

    Any help is appreciated!

    Attachments

  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: Pyrit with opencl on the PS3

    ‏2010-03-22T14:29:22Z  
    I've OpenCL working on a PS3 as core in pyrit, see here:
    http://masterzorag.blogspot.com/2010/03/pyrit-opencl-ps3.html

    But I've noticed that adding the include dir /opt/ibmcmp/xlc/opencl/0.1/include the cpyrit_opencl/setup.py build fails with something related to a missing '__attribute__’ before ‘__vastart’, see log line 12

    Can be "__attribute__((reqd_work_group_size(X, Y, Z)))" as you suggest?
    So my video show a version of a module without include from IBM? It is compiled without, becouse without that dir compiled fine and runs!

    Any help is appreciated!
    why are you including that directory? you don't need to do that (and from the error, shouldn't do that). it's an internal compiler include directory, and it already gets included at the correct point of the compile.

    as you working code shows, you do not need to include the directory.

    .bri.
  • masterzorag
    masterzorag
    25 Posts

    Re: Pyrit with opencl on the PS3

    ‏2010-03-22T16:42:14Z  
    why are you including that directory? you don't need to do that (and from the error, shouldn't do that). it's an internal compiler include directory, and it already gets included at the correct point of the compile.

    as you working code shows, you do not need to include the directory.

    .bri.
    Thank you for this, now I've understood.
    But, what about this:
    You'll need to change the OpenCL kernel to include the attribute "__attribute__((reqd_work_group_size(X, Y, Z)))" in the kernel for maximum performance on Cell.
    Where can I include this attribute? I want to see the maximum performance on Cell!
    Thanks
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: Pyrit with opencl on the PS3

    ‏2010-03-23T11:15:52Z  
    Thank you for this, now I've understood.
    But, what about this:
    You'll need to change the OpenCL kernel to include the attribute "__attribute__((reqd_work_group_size(X, Y, Z)))" in the kernel for maximum performance on Cell.
    Where can I include this attribute? I want to see the maximum performance on Cell!
    Thanks
    read up about the clEnqueueNDRange call, and the parts about the local work group size. specifying and making your kernel work with a larger (larger than 1,1,1) local work group size will usually give better performance, because each invocation of the kernel will do more work, and the compiler may be able to schedule things better.

    on IBM's current release, we have a restriction that in order to use a local work group size greater than (1,1,1), you must specify the reqd_work_group_wize attribute on the __kernel definition. that attribute is describe in section 6.7.2, page 156 of the 1.0 spec.

    .bri.
  • ebfe
    ebfe
    1 Post

    Re: Pyrit with opencl on the PS3

    ‏2010-03-24T08:14:02Z  
    read up about the clEnqueueNDRange call, and the parts about the local work group size. specifying and making your kernel work with a larger (larger than 1,1,1) local work group size will usually give better performance, because each invocation of the kernel will do more work, and the compiler may be able to schedule things better.

    on IBM's current release, we have a restriction that in order to use a local work group size greater than (1,1,1), you must specify the reqd_work_group_wize attribute on the __kernel definition. that attribute is describe in section 6.7.2, page 156 of the 1.0 spec.

    .bri.
    Hi, I'm the author of Pyrit. Masterzorag pointed me to this thread. As far as I can see, the need to specify a value for reqd_work_group_size is an incredible bad shortcoming of IBM's OpenCL-implementation:

    Pyrit has no need for work-groups (in OpenCL-terms) as every task in completely independent from the other. Therefor we leave local_work_size to NULL when calling clEnqueueNDRange(). The specifications say that the platform has to figure out "appropiate" values for this parameter itself if it's meaning is unneeded (as in our case).

    The need to specify some arbitrary required minimum work-group-size undermines this approach completely: As you said, we get non-optimal performance if the value is too low (e.g. (1,1,1) or undefined). However we might get even worse results if we specifiy some other, arbitrary value (e.g. (64,1,1)) as register allocation and shared memory usage might force platforms like GPUs to run with lower occupancy-figures. In the end, we force many "threads" to be bound into a work-group without any need for that.
    Can you estimate the performance-gain by specifying reqd_work_group_size? I can't test this myself, as I don't have access to a Cell-platform...
  • masterzorag
    masterzorag
    25 Posts

    Re: Pyrit with opencl on the PS3

    ‏2010-03-25T15:30:46Z  
    • ebfe
    • ‏2010-03-24T08:14:02Z
    Hi, I'm the author of Pyrit. Masterzorag pointed me to this thread. As far as I can see, the need to specify a value for reqd_work_group_size is an incredible bad shortcoming of IBM's OpenCL-implementation:

    Pyrit has no need for work-groups (in OpenCL-terms) as every task in completely independent from the other. Therefor we leave local_work_size to NULL when calling clEnqueueNDRange(). The specifications say that the platform has to figure out "appropiate" values for this parameter itself if it's meaning is unneeded (as in our case).

    The need to specify some arbitrary required minimum work-group-size undermines this approach completely: As you said, we get non-optimal performance if the value is too low (e.g. (1,1,1) or undefined). However we might get even worse results if we specifiy some other, arbitrary value (e.g. (64,1,1)) as register allocation and shared memory usage might force platforms like GPUs to run with lower occupancy-figures. In the end, we force many "threads" to be bound into a work-group without any need for that.
    Can you estimate the performance-gain by specifying reqd_work_group_size? I can't test this myself, as I don't have access to a Cell-platform...
    I've to leave my PS3, should I come back from holiday next week.
    If you are interested, from April, I can give you ssh access to mine one, mail me.
  • masterzorag
    masterzorag
    25 Posts

    Re: Pyrit with opencl on the PS3

    ‏2010-04-13T00:19:53Z  
    read up about the clEnqueueNDRange call, and the parts about the local work group size. specifying and making your kernel work with a larger (larger than 1,1,1) local work group size will usually give better performance, because each invocation of the kernel will do more work, and the compiler may be able to schedule things better.

    on IBM's current release, we have a restriction that in order to use a local work group size greater than (1,1,1), you must specify the reqd_work_group_wize attribute on the __kernel definition. that attribute is describe in section 6.7.2, page 156 of the 1.0 spec.

    .bri.
    Thanks, I've read and added __attribute__((reqd_work_group_size(X, Y, Z))), but got about the same results!
    Computed 963.93 PMKs/s total.

    with (1, 1, 1) or others numbers give me
    SystemError: Failed to execute kernel (CL_INVALID_WORK_GROUP_SIZE)

    The other relevant line is:
    clEnqueueNDRangeKernel(self->dev_queue, self->dev_kernel, 1, NULL, gWorksize, NULL, 1, &clEvents[0], &clEvents[1])

    Can we do better?
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: Pyrit with opencl on the PS3

    ‏2010-04-13T16:54:38Z  
    Thanks, I've read and added __attribute__((reqd_work_group_size(X, Y, Z))), but got about the same results!
    Computed 963.93 PMKs/s total.

    with (1, 1, 1) or others numbers give me
    SystemError: Failed to execute kernel (CL_INVALID_WORK_GROUP_SIZE)

    The other relevant line is:
    clEnqueueNDRangeKernel(self->dev_queue, self->dev_kernel, 1, NULL, gWorksize, NULL, 1, &clEvents[0], &clEvents[1])

    Can we do better?
    spec says that if you set a reqd_work_group_size attribute, then you need to pass that size into the clEnqueueNDRangeKernel call. something like:
    size_t lWorksize[3] = { 16,1,1 };

    clEnqueueNDRangeKernel(self->dev_queue, self->dev_kernel, 1, NULL, gWorksize, lWorksize, 1, &clEvents[0], &clEvents[1])
    if you have 16,1,1 in the reqd_work_group_size.

    .bri.
  • masterzorag
    masterzorag
    25 Posts

    Re: Pyrit with opencl on the PS3

    ‏2010-04-13T23:37:45Z  
    spec says that if you set a reqd_work_group_size attribute, then you need to pass that size into the clEnqueueNDRangeKernel call. something like:
    size_t lWorksize[3] = { 16,1,1 };

    clEnqueueNDRangeKernel(self->dev_queue, self->dev_kernel, 1, NULL, gWorksize, lWorksize, 1, &clEvents[0], &clEvents[1])
    if you have 16,1,1 in the reqd_work_group_size.

    .bri.
    Thanks for the hints, this really changes performances, some examples of pyrit benchmark:

    16,1,1 Computed 869.67 PMKs/s total.
    16,16,1 Computed 99.02 PMKs/s total.

    64,1,1 Computed 858.74 PMKs/s total.
    64,16,1Computed 118.80 PMKs/s total.

    4,1,1 Computed 849.12 PMKs/s total
    4,4,4 Computed 97.94 PMKs/s total

    But seems that optimal performances is reached without __attribute__, or with X,Y,Z;
    Here I don't know the optimal numbers!
  • masterzorag
    masterzorag
    25 Posts

    Re: Pyrit with opencl on the PS3

    ‏2010-04-19T12:09:08Z  
    spec says that if you set a reqd_work_group_size attribute, then you need to pass that size into the clEnqueueNDRangeKernel call. something like:
    size_t lWorksize[3] = { 16,1,1 };

    clEnqueueNDRangeKernel(self->dev_queue, self->dev_kernel, 1, NULL, gWorksize, lWorksize, 1, &clEvents[0], &clEvents[1])
    if you have 16,1,1 in the reqd_work_group_size.

    .bri.
    Testing a 500000 words batching:
    with 1,1,1 got 1300;
    with 256,1,1 got about 1100
    This can explain about "Pyrit has no need for work-groups (in OpenCL-terms) as every task in completely independent from the other"

    But I've see demo program like perlin, where specifing different workgroupsize values I got 28x performances in computing;
    and blacksholes that uses five kernel cores to do different tasks in parallel!

    I think that special improvement can be made adapting the entire _cpyrit_oclkernel.cl code to do some work in parallel, like the sha1process in it.
    Maybe I'm wrong but the _cpyrit_opencl implementation was derived from the cuda intructions flow (as a fast addon), so
    I think it's about generic in opencl terms, I think that an optimized __cpyrit_cellcl can impress!

    If fact, as this change http://code.google.com/p/pyrit/source/detail?r=248 with one optimization
    cell passes from about 1000 to about 1200 pmks/s, than,
    adding the __attribute__ with 1,1,1 cell gots 1300 pmks/s

    Here is an example with several types of compute kernel that work well on Cell
    http://www.ibm.com/developerworks/forums/thread.jspa?messageID=14385403

    We should have an optimized interface to pyrit to get real cell benefits,
    something like the _cpyrit_calpp is doing now for ati cards, in the pyrit's land
  • masterzorag
    masterzorag
    25 Posts

    Re: Pyrit with opencl on the PS3

    ‏2010-05-07T12:52:29Z  
    Testing a 500000 words batching:
    with 1,1,1 got 1300;
    with 256,1,1 got about 1100
    This can explain about "Pyrit has no need for work-groups (in OpenCL-terms) as every task in completely independent from the other"

    But I've see demo program like perlin, where specifing different workgroupsize values I got 28x performances in computing;
    and blacksholes that uses five kernel cores to do different tasks in parallel!

    I think that special improvement can be made adapting the entire _cpyrit_oclkernel.cl code to do some work in parallel, like the sha1process in it.
    Maybe I'm wrong but the _cpyrit_opencl implementation was derived from the cuda intructions flow (as a fast addon), so
    I think it's about generic in opencl terms, I think that an optimized __cpyrit_cellcl can impress!

    If fact, as this change http://code.google.com/p/pyrit/source/detail?r=248 with one optimization
    cell passes from about 1000 to about 1200 pmks/s, than,
    adding the __attribute__ with 1,1,1 cell gots 1300 pmks/s

    Here is an example with several types of compute kernel that work well on Cell
    http://www.ibm.com/developerworks/forums/thread.jspa?messageID=14385403

    We should have an optimized interface to pyrit to get real cell benefits,
    something like the _cpyrit_calpp is doing now for ati cards, in the pyrit's land
    How pass -cl-fast-relaxed-math to kernel?

    Pyrit uses only int operations, as defined in _cpyrit_opencl.h
    #define uint32_t unsigned int
    The OpenCL application should use __global memory buffers whose type's size is a multiple of a quad-word (16 bytes) ... due to alignment checks
    uint should be 4 bytes, we should use uint4?

    But pyrit uses this structs:

    typedef struct {
    uint32_t h0,h1,h2,h3,h4;
    } SHA_DEV_CTX;

    typedef struct {
    SHA_DEV_CTX ctx_ipad;
    SHA_DEV_CTX ctx_opad;
    SHA_DEV_CTX e1;
    SHA_DEV_CTX e2;
    } gpu_inbuffer;

    About alignment check, I've found in cl_platform.h
    typedef uint32_t cl_uint4[4] __attribute__((aligned(16)));
    How can pad the data structure to 16 bytes?
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: Pyrit with opencl on the PS3

    ‏2010-05-07T14:30:47Z  
    How pass -cl-fast-relaxed-math to kernel?

    Pyrit uses only int operations, as defined in _cpyrit_opencl.h
    #define uint32_t unsigned int
    The OpenCL application should use __global memory buffers whose type's size is a multiple of a quad-word (16 bytes) ... due to alignment checks
    uint should be 4 bytes, we should use uint4?

    But pyrit uses this structs:

    typedef struct {
    uint32_t h0,h1,h2,h3,h4;
    } SHA_DEV_CTX;

    typedef struct {
    SHA_DEV_CTX ctx_ipad;
    SHA_DEV_CTX ctx_opad;
    SHA_DEV_CTX e1;
    SHA_DEV_CTX e2;
    } gpu_inbuffer;

    About alignment check, I've found in cl_platform.h
    typedef uint32_t cl_uint4[4] __attribute__((aligned(16)));
    How can pad the data structure to 16 bytes?
    -cl-fast-relaxed-math is a kernel compile option that is set by including the string in the options string parameter of clBuildProgram.

    The aligned attribute (e.g. "__attribute__((aligned(16)))" is not an alignment check, but instead informs the compiler/linker that the variable must be placed in memory on a 16 byte alignment. This facility can be used to ensure h0 is aligned on a 16-byte boundary. For example:

    
    typedef struct 
    { uint32_t h0 __attribute__ ((aligned(16))); uint32_t h1, h2, h3, h4; 
    } SHA_DEV_CTX;
    


    Adjacent instances of SHA_DEV_CTX will then be padded with 3 32-bit words. Alternatively, you can just add pad variables to your structure, but that won't guarantee the alignment of the structure beyond word (32-bit) aligned.
  • masterzorag
    masterzorag
    25 Posts

    Re: Pyrit with opencl on the PS3

    ‏2010-05-11T16:30:17Z  
    -cl-fast-relaxed-math is a kernel compile option that is set by including the string in the options string parameter of clBuildProgram.

    The aligned attribute (e.g. "__attribute__((aligned(16)))" is not an alignment check, but instead informs the compiler/linker that the variable must be placed in memory on a 16 byte alignment. This facility can be used to ensure h0 is aligned on a 16-byte boundary. For example:

    <pre class="jive-pre"> typedef struct { uint32_t h0 __attribute__ ((aligned(16))); uint32_t h1, h2, h3, h4; } SHA_DEV_CTX; </pre>

    Adjacent instances of SHA_DEV_CTX will then be padded with 3 32-bit words. Alternatively, you can just add pad variables to your structure, but that won't guarantee the alignment of the structure beyond word (32-bit) aligned.
    Thanks for aswering, I've added:
    const char *options = "-cl-fast-relaxed-math";
    clBuildProgram(self->dev_prog, 0, NULL, options, NULL, NULL);
    1) About syntax, how can I add this inline (without declare a variable)?

    I've read in guide:
    "Memory buffer objects used in conjunction with an SPU accelerator device should be aligned at 128 bytes for best performance."
    We have on host:
    cl_mem g_inbuffer, g_outbuffer;
    g_inbuffer = clCreateBuffer(self->dev_ctx, CL_MEM_READ_ONLY, gWorksize[0]*sizeof(gpu_inbuffer), NULL, &errcode);
    clSetKernelArg(self->dev_kernel, 0, sizeof(cl_mem), &g_inbuffer);

    According to:
    #ifndef uint32_t
    #define uint32_t unsigned int
    #endif

    typedef struct {
    SHA_DEV_CTX ctx_ipad;
    SHA_DEV_CTX ctx_opad;
    SHA_DEV_CTX e1;
    SHA_DEV_CTX e2;
    } gpu_inbuffer;

    typedef struct {
    uint32_t h0,h1,h2,h3,h4;
    } SHA_DEV_CTX;

    We have in kernel:
    __global gpu_inbuffer *inbuffer, so working over inbufferidx, this should be made of 20 uint = 80KB, isn't true?
    2) We should align to 128KB for best performance every g_inbuffer?

    I'm thinking to aggregate 4 uint in uint4 vector in this way:
    uint4 h0 = { inbufferidx.ctx_ipad.h0, inbufferidx.ctx_opad.h0, inbufferidx.e1.h0, inbufferidx.e2.h0 }
    uint4 h1 = { inbufferidx.ctx_ipad.h1, inbufferidx.ctx_opad.h1, inbufferidx.e1.h1, inbufferidx.e2.h1 }
    ... for h2, h3, h4 to operate over 5 uint4 vector = 80KB
    Then perform sha1_process over a uint4 vector

    "The maximum number of compute units on an SPU accelerator device is 16. The SPU accelerator device has a maximum local memory size of 256KB."
    3) So, in PS3 we have 256 x 6 = 1536KB of local memory shared between __global, __local, __private, variables, temp data, kernel?
    4) There is a cl_builtin function that give us values of mem used/free when running a kernel? How compute free mem?
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: Pyrit with opencl on the PS3

    ‏2010-05-12T17:03:52Z  
    Thanks for aswering, I've added:
    const char *options = "-cl-fast-relaxed-math";
    clBuildProgram(self->dev_prog, 0, NULL, options, NULL, NULL);
    1) About syntax, how can I add this inline (without declare a variable)?

    I've read in guide:
    "Memory buffer objects used in conjunction with an SPU accelerator device should be aligned at 128 bytes for best performance."
    We have on host:
    cl_mem g_inbuffer, g_outbuffer;
    g_inbuffer = clCreateBuffer(self->dev_ctx, CL_MEM_READ_ONLY, gWorksize[0]*sizeof(gpu_inbuffer), NULL, &errcode);
    clSetKernelArg(self->dev_kernel, 0, sizeof(cl_mem), &g_inbuffer);

    According to:
    #ifndef uint32_t
    #define uint32_t unsigned int
    #endif

    typedef struct {
    SHA_DEV_CTX ctx_ipad;
    SHA_DEV_CTX ctx_opad;
    SHA_DEV_CTX e1;
    SHA_DEV_CTX e2;
    } gpu_inbuffer;

    typedef struct {
    uint32_t h0,h1,h2,h3,h4;
    } SHA_DEV_CTX;

    We have in kernel:
    __global gpu_inbuffer *inbuffer, so working over inbufferidx, this should be made of 20 uint = 80KB, isn't true?
    2) We should align to 128KB for best performance every g_inbuffer?

    I'm thinking to aggregate 4 uint in uint4 vector in this way:
    uint4 h0 = { inbufferidx.ctx_ipad.h0, inbufferidx.ctx_opad.h0, inbufferidx.e1.h0, inbufferidx.e2.h0 }
    uint4 h1 = { inbufferidx.ctx_ipad.h1, inbufferidx.ctx_opad.h1, inbufferidx.e1.h1, inbufferidx.e2.h1 }
    ... for h2, h3, h4 to operate over 5 uint4 vector = 80KB
    Then perform sha1_process over a uint4 vector

    "The maximum number of compute units on an SPU accelerator device is 16. The SPU accelerator device has a maximum local memory size of 256KB."
    3) So, in PS3 we have 256 x 6 = 1536KB of local memory shared between __global, __local, __private, variables, temp data, kernel?
    4) There is a cl_builtin function that give us values of mem used/free when running a kernel? How compute free mem?
    If I understand your questions correctly

    1) About syntax, how can I add this inline (without declare a variable)?

    clBuildProgram(self->dev_prog, 0, NULL, "-cl-fast-relaxed-math", NULL, NULL);

    2) We should align to 128KB for best performance every g_inbuffer?

    For the SPU Accelerator is best that transfers are aligned to 128 byte (not 128KB) and that the transfers are multiples of 128 bytes. Allocated OpenCL memory objects are aligned to at least 128 bytes.
    3) So, in PS3 we have 256 x 6 = 1536KB of local memory shared between __global, __local, __private, variables, temp data, kernel?

    The SPE's 256 KB local storage is shared between the OpenCL runtime, the program kernels, __local and __private variables, stack, and the SW cache used for global memory accesses.

    4) There is a cl_builtin function that give us values of mem used/free when running a kernel? How compute free mem?

    I'm not aware of a such a built-in.
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: Pyrit with opencl on the PS3

    ‏2010-05-13T13:22:38Z  
    If I understand your questions correctly

    1) About syntax, how can I add this inline (without declare a variable)?

    clBuildProgram(self->dev_prog, 0, NULL, "-cl-fast-relaxed-math", NULL, NULL);

    2) We should align to 128KB for best performance every g_inbuffer?

    For the SPU Accelerator is best that transfers are aligned to 128 byte (not 128KB) and that the transfers are multiples of 128 bytes. Allocated OpenCL memory objects are aligned to at least 128 bytes.
    3) So, in PS3 we have 256 x 6 = 1536KB of local memory shared between __global, __local, __private, variables, temp data, kernel?

    The SPE's 256 KB local storage is shared between the OpenCL runtime, the program kernels, __local and __private variables, stack, and the SW cache used for global memory accesses.

    4) There is a cl_builtin function that give us values of mem used/free when running a kernel? How compute free mem?

    I'm not aware of a such a built-in.
    To elaborate on some of the responses from brokensh...

    2) OpenCL allows the memory object alignment size to be queried on a per-device basis through the CL_DEVICE_MEM_BASE_ADDR_ALIGN attribute. This attribute represents the alignment in BITS of memory objects supported by the given device. As brokensh mentioned, on the SPU device the alignment is 128 bytes or 1024 bits.

    4) As brokensh mentioned, there currently is no builtin to query available/free mem. As well, no allocator builtin is currently supported within a kernel, so I'm not sure what good the query would be anyhow. Kernel memory allocation is controlled through the arguments passed to the kernel as well as statically declared constant storage.

    Presumably you are interested in the amount of free LOCAL memory, in which case this value can be approximated by comparing the kernel CL_KERNEL_LOCAL_MEM_SIZE against the device CL_DEVICE_LOCAL_MEM_SIZE. The difference of these two should give you the amount of free device local memory. This calculation takes into account the kernel text and data usage as well as kernel argument allocations. This does not account for kernel stack usage.
  • masterzorag
    masterzorag
    25 Posts

    Re: Pyrit with opencl on the PS3

    ‏2010-05-19T01:30:03Z  
    To elaborate on some of the responses from brokensh...

    2) OpenCL allows the memory object alignment size to be queried on a per-device basis through the CL_DEVICE_MEM_BASE_ADDR_ALIGN attribute. This attribute represents the alignment in BITS of memory objects supported by the given device. As brokensh mentioned, on the SPU device the alignment is 128 bytes or 1024 bits.

    4) As brokensh mentioned, there currently is no builtin to query available/free mem. As well, no allocator builtin is currently supported within a kernel, so I'm not sure what good the query would be anyhow. Kernel memory allocation is controlled through the arguments passed to the kernel as well as statically declared constant storage.

    Presumably you are interested in the amount of free LOCAL memory, in which case this value can be approximated by comparing the kernel CL_KERNEL_LOCAL_MEM_SIZE against the device CL_DEVICE_LOCAL_MEM_SIZE. The difference of these two should give you the amount of free device local memory. This calculation takes into account the kernel text and data usage as well as kernel argument allocations. This does not account for kernel stack usage.
    Connecting to IBM ACCELERATOR CellBE processor...
    CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS 3
    CL_DEVICE_MAX_WORK_ITEM_SIZES 256
    CL_DEVICE_MAX_WORK_GROUP_SIZE 256
    CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE 65536

    CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT 4
    That's why we should use uint4;

    CL_DEVICE_LOCAL_MEM_SIZE 248832
    It's the total of entire Cell/Be, or is for every SPE ( about 256Kb, so should be considered x6 in PS3 )?

    CL_DEVICE_MEM_BASE_ADDR_ALIGN 1024
    That's why I'm tuning pyrit to pass a fixed set of 1024 ( num of PMKs in a single array to be processed by the kernel ) x 80 ( Kb for a single PMK data set )

    gWorksize[0] is computed against size of array and rounded to nearest 256, so I got 1024 at 99%;
    g_inbuffer = clCreateBuffer ... gWorksize[0]*sizeof(gpu_inbuffer) ... so I got 1024 x 80 = 81920Kb of memRO
    clEnqueueWriteBuffer ... the same ... so I got 1024 x 80 = 81920Kb
    g_outbuffer = clCreateBuffer ... gWorksize[0]*sizeof(gpu_outbuffer) ... so I got 1024 x 40 = 40960Kb of memWO
    g_inbuffer and g_outbuffer are __global pointers, so question is:
    I'm respecting the CL_DEVICE_MEM_BASE_ADDR_ALIGN?

    Another one question:
    How can pack 4 input fields into an uint4 vector, native uint4 vector are available while processing the kernel!
    I want to pack 4 struct of 4 x 5 uint = 4 x 80Kb = 320Kb into 20 uint4 = 20 x 16Kb = 320Kb

    typedef struct {
    uint32_t h0,h1,h2,h3,h4;
    } SHA_DEV_CTX;

    typedef struct {
    SHA_DEV_CTX ctx_ipad;
    SHA_DEV_CTX ctx_opad;
    SHA_DEV_CTX e1;
    SHA_DEV_CTX e2;
    } gpu_inbuffer;

    typedef struct {
    cl_uint4 ctx_ipad_h0, ctx_ipad_h1, ctx_ipad_h2, ctx_ipad_h3, ctx_ipad_h4;
    cl_uint4 ctx_opad_h0, ctx_opad_h1, ctx_opad_h2, ctx_opad_h3, ctx_opad_h4;
    cl_uint4 e1_h0, e1_h1, e1_h2, e1_h3, e1_h4;
    cl_uint4 e2_h0, e2_h1, e2_h2, e2_h3, e2_h4;
    } vector;
  • masterzorag
    masterzorag
    25 Posts

    On the isolated pmk kernel...

    ‏2010-07-12T23:44:34Z  
    I've only isolated the opencl core used by pyrit to be run under control to see if improvement can be made;
    host, header and kernel included; I'm using newest 0.2, compile with:
    gcc -o try2 try2.c -lm -lstdc++ -lOpenCL -std=c99
    by default it setup an array of 20480 empty structs and perform the in/out compute part to obtain the pmk
    1. ./try2

    Starting...
    calloc for size: 1638400, array count: 20480
    Building kernel...
    Connecting to IBM ACCELERATOR CellBE processor...
    memR: 1638400, EWB: 1638400
    Executing with CL_KERNEL_WORK_GROUP_SIZE 1...
    Timing...
    kernel execution time: 16.40
    number of computes/sec: 1248.54
    Reading back, ERB: 819200
    Done!

    Quite, this is only the opencl pyrit's part, this doesn't look at any input data loading functions, as memory accesses should be as pyrit's does.
    This host code only runs the untouched the pyrit opencl code, as we can see the svn code computes about 1200 per second!

    By the way, great 0.2 release, with clu and extensions like numa and hugepages there are many things to test...
  • masterzorag
    masterzorag
    25 Posts

    pmk, dissected

    ‏2011-02-20T13:20:56Z  
    I've only isolated the opencl core used by pyrit to be run under control to see if improvement can be made;
    host, header and kernel included; I'm using newest 0.2, compile with:
    gcc -o try2 try2.c -lm -lstdc++ -lOpenCL -std=c99
    by default it setup an array of 20480 empty structs and perform the in/out compute part to obtain the pmk
    1. ./try2

    Starting...
    calloc for size: 1638400, array count: 20480
    Building kernel...
    Connecting to IBM ACCELERATOR CellBE processor...
    memR: 1638400, EWB: 1638400
    Executing with CL_KERNEL_WORK_GROUP_SIZE 1...
    Timing...
    kernel execution time: 16.40
    number of computes/sec: 1248.54
    Reading back, ERB: 819200
    Done!

    Quite, this is only the opencl pyrit's part, this doesn't look at any input data loading functions, as memory accesses should be as pyrit's does.
    This host code only runs the untouched the pyrit opencl code, as we can see the svn code computes about 1200 per second!

    By the way, great 0.2 release, with clu and extensions like numa and hugepages there are many things to test...
    Some time ago was speaking here about how improve pyrit kernel, but I miss some knowledges...
    How improve cl code without knowing about C, bits and byte, mem alignments, math, and so on!
    Here I'm back, forget some of what I wrote about foolish questions, we can start improve for real.

    I have an host code that do the prepare part in C, linked with openssl, as pyrit does.
    after this prepare function pyrit stores entire data for 1 pmk (80b) into a struct of scalars uint:
    ipad.h0 to ipad.h4, opad.h0 to opad.h4, e1.h0 to e1.h4 and e2.h0 to e2.h4 = 4x5x4b = 80b
    pyrit pass arrays of this input structs to kernel, after process output is half (40b):
    pmk1.h0 to pmk1.h4, pmk2.h0 to pmk2.h4 = 2x5x4b = 40b
    after readed back, with first 32b of this 40b the host code compose final pmk and store this as result.

    I'm doing this instead:
    I setup 4 arrays of uint4, *ipad, *opad, *e1, *e2, to make use of vectors in kernel computing
    after prepare, I fold data for 4pmk into x, y, z and w of those arrays as this:
    data for first pmk, pmk[0]:
    ipad[0].h0 to ipad[0].h4 into ipad0 to 4.x
    opad[0].h0 to opad[0].h4 into opad0 to 4.x
    e1[0].h0 to e1[0].h4 into e10 to 4.x
    e2[0].h0 to e2[0].h4 into e20 to 4.x
    ... data for fourth pmk, pmk[3]:
    ipad[3].h0 to ipad[3].h4 into ipad0 to 4.w
    opad[3].h0 to opad[3].h4 into opad0 to 4.w
    e1[3].h0 to e1[3].h4 into e10 to 4.w
    e2[3].h0 to e2[3].h4 into e20 to 4.w

    As result I have 4 arrays (*ipad, *opad, *e1, *e2), of 5 (from 0 to 4) uint4 (with data for pmk[0] in .x, pmk[1] in .y, pmk[2] in .z, pmk[3] in .w) = 4x5x16 = 320b
    I'm doing 4pmk, as 320/4 = 80b

    This is my solution to: aligning memory at 16b (a uint4 is 16b), made use of native uint4 in kernel to do x4 in same time, and to coalesce data accesses.
    What about this?

    Putting simple: with this 4pmk case I do global=1, so kernel does 1 workitem to compute all 4.
    My kernel does gid5=get_global_id(0) *5 to grab 5 uint and process over sha1, output is for 4pmk, works!
    After reading back I've included a verify function that tell me if data output for 4pmk are valid, about this verify method I compare results over valid ones computed from original pyrit kernel.

    Now I've entire control (with verify working I know if I'm doing right computations) over pmk algo, I will start target the different address spaces and use of _local.

    Stay curious.
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: pmk, dissected

    ‏2011-02-21T17:56:35Z  
    Some time ago was speaking here about how improve pyrit kernel, but I miss some knowledges...
    How improve cl code without knowing about C, bits and byte, mem alignments, math, and so on!
    Here I'm back, forget some of what I wrote about foolish questions, we can start improve for real.

    I have an host code that do the prepare part in C, linked with openssl, as pyrit does.
    after this prepare function pyrit stores entire data for 1 pmk (80b) into a struct of scalars uint:
    ipad.h0 to ipad.h4, opad.h0 to opad.h4, e1.h0 to e1.h4 and e2.h0 to e2.h4 = 4x5x4b = 80b
    pyrit pass arrays of this input structs to kernel, after process output is half (40b):
    pmk1.h0 to pmk1.h4, pmk2.h0 to pmk2.h4 = 2x5x4b = 40b
    after readed back, with first 32b of this 40b the host code compose final pmk and store this as result.

    I'm doing this instead:
    I setup 4 arrays of uint4, *ipad, *opad, *e1, *e2, to make use of vectors in kernel computing
    after prepare, I fold data for 4pmk into x, y, z and w of those arrays as this:
    data for first pmk, pmk[0]:
    ipad[0].h0 to ipad[0].h4 into ipad0 to 4.x
    opad[0].h0 to opad[0].h4 into opad0 to 4.x
    e1[0].h0 to e1[0].h4 into e10 to 4.x
    e2[0].h0 to e2[0].h4 into e20 to 4.x
    ... data for fourth pmk, pmk[3]:
    ipad[3].h0 to ipad[3].h4 into ipad0 to 4.w
    opad[3].h0 to opad[3].h4 into opad0 to 4.w
    e1[3].h0 to e1[3].h4 into e10 to 4.w
    e2[3].h0 to e2[3].h4 into e20 to 4.w

    As result I have 4 arrays (*ipad, *opad, *e1, *e2), of 5 (from 0 to 4) uint4 (with data for pmk[0] in .x, pmk[1] in .y, pmk[2] in .z, pmk[3] in .w) = 4x5x16 = 320b
    I'm doing 4pmk, as 320/4 = 80b

    This is my solution to: aligning memory at 16b (a uint4 is 16b), made use of native uint4 in kernel to do x4 in same time, and to coalesce data accesses.
    What about this?

    Putting simple: with this 4pmk case I do global=1, so kernel does 1 workitem to compute all 4.
    My kernel does gid5=get_global_id(0) *5 to grab 5 uint and process over sha1, output is for 4pmk, works!
    After reading back I've included a verify function that tell me if data output for 4pmk are valid, about this verify method I compare results over valid ones computed from original pyrit kernel.

    Now I've entire control (with verify working I know if I'm doing right computations) over pmk algo, I will start target the different address spaces and use of _local.

    Stay curious.
    masterzorag,

    Your strategy of changing your data structures from an array of structures (AOS) to a structure of arrays (SOA) in which each array is sized to matched the device's preferred vector width should yield improved performance. This is especially true on CPU devices and SPE accelerator devices that are optimized for SIMD execution.

    Dan B.
  • masterzorag
    masterzorag
    25 Posts

    pmk, I've got improvements!

    ‏2012-01-25T11:39:52Z  
    masterzorag,

    Your strategy of changing your data structures from an array of structures (AOS) to a structure of arrays (SOA) in which each array is sized to matched the device's preferred vector width should yield improved performance. This is especially true on CPU devices and SPE accelerator devices that are optimized for SIMD execution.

    Dan B.
    Check this one: http://masterzorag.blogspot.com/2012/01/opencl-ps3-improvements.html
    I've to optimize more now!
    Second kernel uses _local and async_work_group_copy to move data, but a different SHA-1 function that left me set localsize = 64!
    My faster kernel operate on _private and is not using _local at all, but it needs a localsize = 1 to run!
    It's a more vectorized version operating on uint8: we need only 8192 SHA-1 rounds, doing outbuffer1 + outbuffer2 at same time!