Topic
  • 9 replies
  • Latest Post - ‏2011-07-01T12:49:50Z by SystemAdmin
erman_ibm
erman_ibm
8 Posts

Pinned topic How actually CBEA processors process work-items? and many other questions

‏2011-06-16T13:08:02Z |
Hi,

As the title of the thread, I want to ask some questions and hints on how to optimize my OpenCL application on Cell processor (I use PS3, Fedora 9, OpenCLSDK v0.3). The OpenCL on Power installation and user's guide already give me some information. However, I hope others here can help me to clarify more or correct me, if I have some misunderstanding about the implementation of OpenCL on Cell

1. On PS3, the PPE is a CPU device type (CL_DEVICE_TYPE_CPU) with 2 compute units (each thread in PPE is the compute unit). How the OpenCL memory model mapped to the Cell's system (PS3) if I use CPU device type (the PPE)? I know that PPE has 32 KB L1 inst cache, 32KB data cache, 512 L2 cache.

global memory --> main memory
private memory --> ??
constant memory --> ??
local memory --> ??
global memory cache, if any --> ??

Why the information (CL_GLOBAL_MEMORY_SIZE, I print use the cluPrintInfo()) not constant. It changes in each running, sometimes I got 8 MB, other time 11 MB, etc.?
2. The six SPEs on PS3 is accelerator device type, with 6 compute units. The same question as no.1.

How is the memory model mapped?

global memory --> ??
private memory --> ??
constant memory --> ??
local memory --> ??

The global memory size and local memory size are not constant. Any explanation about this?
I got the local memory size (CL_LOCAL_MEM_SIZE) is 242 KB, sometimes it's 244 KB. The LS size is 256 KB, right?
3. How actually CBEA processors process work-items?

For example here, I compared to AMD implementation

Radeon HD 5870
-20 cores/SIMD engines --> 20 compute units (in OpenCL term)
-each core/SIMD engine consists of 16 stream processors (no OpenCL term here)
-each stream processor consists of 5-way VLIW/ 5 ALUs --> so total 20 * 16 * 5 ALUs = 1600 ALUs (or processing elements in OpenCL term)
In AMD model,
-64 work-items formed a wavefront to be executed on one SIMD engine.
-each SIMD engine can have multiple wavefront to hide memory access latency.
-4 work-items from 1 wavefront are pipelined in 1 stream processor.

How about Cell? How the 6 SPEs process the work-item?
How many ALUs or processing elements in the SPU accelerator device?

I really need to know this because I want to compare the performance of my kernel in AMD GPU, CPU, and Cell processor.

My preliminary result: Cell is the slowest! even compared to the OpenCL CPU implementation. "What is going on here on Cell, It's 3.2 GHz, GPU only 850 Mhz"? Is it because of PS3? 1600 ALUs on compared to ??? ALUs on 6 SPEs? is the compiler? or other reason?

I really expect better result from OpenCL on Cell compared to OpenCL on CPU, at least it can compete with NVIDIA or AMD GPU.
4. Any information about the IBM OpenCL compiler implementation, at least the big picture? (if it's not confidential to disclose here). Is there a way to look on the .ocl_bin file to find out the assembly instruction?
That's my questions for now. I really appreciate helps and answers from anyone here.
Updated on 2011-07-01T12:49:50Z at 2011-07-01T12:49:50Z by SystemAdmin
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: How actually CBEA processors process work-items? and many other questions

    ‏2011-06-16T15:35:00Z  
    erman,

    In attempt to answer all or most of your questions:

    1 & 2. Left me preface this answer with the statement that this is not my area of expertise.

    The reported memory sizes is an area that I want to work on when I get some spare time. Instead of OpenCL stating that global memory is all the virtual memory, we currently report back 66% of available physical memory. This result can vary significantly, especially when memory is consumed by the file cache. So flushing the file cache (e.g. echo 3 > /proc/sys/vm/drop_caches) will maximize the reported global memory. It should be noted, that many of these heuristics for computing memory sizes were influenced by the OpenCL conformance test.

    On the CPU device, all memory types occupy a common address domain, system memory, so the memory resources are shared. The reported local memory is constrained to 512K so that application portability is preserved for devices that have private local memory. For example, on the SPE accelerator device, private and local memory occupy the SPE's 256 KB local storage. The CL_DEVICE_LOCAL_MEM_SIZE equals the space left over after subtracting off the SW cache and on-device OpenCL runtime. Of course, your program, its stack, its private variables, etc, will also occupy this memory so that you won't really have all the report memory available of local buffers. The CLU library (provided in the OpenCL samples) provides a utility for computing the available local memory after subtracting the kernel local memory usage.
    3. For both the CPU and SPU Accelerator devices, work groups are equally parceled out to each compute units. On a PS3, there are 2 CPU compute units and 6 SPU Accelerator compute units. The work group consists of 1 to N work items, where N is a power of 2 and at most 1024 for a CPU device, and 256 for a SPU accelerator device.

    I suspect that you are not getting good performance on the SPU for one (or both) of the following reasons:

    (A) Your SPU compute kernel directly accesses global (or constant) memory. Such accesses end out going through a SW cache provided by the kernel compiler. The recommended performance solution is to exploit the async_work_group_copy built-ins to move data from global memory into local memory (SPU's local storage, operate on the local memory, then copy the data back from local memory to global memory, using async_work_group_copy. Take a look at some of the provided samples for examples.
    (B) Your application is scalar. The SPU is a vector machine and can only process 16-byte vectors. Therefore, there is some inherent efficiencies when computing scalar results. The current OpenCL v0.3 kernel compiler doesn't support implicit vectorization, so you would have to manually vectorize your code.
    4. The binary files (e.g. .ocl_bin) are standard elf objects and can be dumped using standard elf tools like objdump and readelf. For Cell, there are device specific versions of these tools and are prefixed with "ppu-" for the CPU device, and "spu-" for the SPU Accelerator device.
  • erman_ibm
    erman_ibm
    8 Posts

    Re: How actually CBEA processors process work-items? and many other questions

    ‏2011-06-17T15:00:45Z  
    erman,

    In attempt to answer all or most of your questions:

    1 & 2. Left me preface this answer with the statement that this is not my area of expertise.

    The reported memory sizes is an area that I want to work on when I get some spare time. Instead of OpenCL stating that global memory is all the virtual memory, we currently report back 66% of available physical memory. This result can vary significantly, especially when memory is consumed by the file cache. So flushing the file cache (e.g. echo 3 > /proc/sys/vm/drop_caches) will maximize the reported global memory. It should be noted, that many of these heuristics for computing memory sizes were influenced by the OpenCL conformance test.

    On the CPU device, all memory types occupy a common address domain, system memory, so the memory resources are shared. The reported local memory is constrained to 512K so that application portability is preserved for devices that have private local memory. For example, on the SPE accelerator device, private and local memory occupy the SPE's 256 KB local storage. The CL_DEVICE_LOCAL_MEM_SIZE equals the space left over after subtracting off the SW cache and on-device OpenCL runtime. Of course, your program, its stack, its private variables, etc, will also occupy this memory so that you won't really have all the report memory available of local buffers. The CLU library (provided in the OpenCL samples) provides a utility for computing the available local memory after subtracting the kernel local memory usage.
    3. For both the CPU and SPU Accelerator devices, work groups are equally parceled out to each compute units. On a PS3, there are 2 CPU compute units and 6 SPU Accelerator compute units. The work group consists of 1 to N work items, where N is a power of 2 and at most 1024 for a CPU device, and 256 for a SPU accelerator device.

    I suspect that you are not getting good performance on the SPU for one (or both) of the following reasons:

    (A) Your SPU compute kernel directly accesses global (or constant) memory. Such accesses end out going through a SW cache provided by the kernel compiler. The recommended performance solution is to exploit the async_work_group_copy built-ins to move data from global memory into local memory (SPU's local storage, operate on the local memory, then copy the data back from local memory to global memory, using async_work_group_copy. Take a look at some of the provided samples for examples.
    (B) Your application is scalar. The SPU is a vector machine and can only process 16-byte vectors. Therefore, there is some inherent efficiencies when computing scalar results. The current OpenCL v0.3 kernel compiler doesn't support implicit vectorization, so you would have to manually vectorize your code.
    4. The binary files (e.g. .ocl_bin) are standard elf objects and can be dumped using standard elf tools like objdump and readelf. For Cell, there are device specific versions of these tools and are prefixed with "ppu-" for the CPU device, and "spu-" for the SPU Accelerator device.
    Thank you. I will try to optimize my kernels, and will come back if I have another question

    Erman
  • erman_ibm
    erman_ibm
    8 Posts

    Re: How actually CBEA processors process work-items? and many other questions

    ‏2011-06-30T09:50:36Z  
    • erman_ibm
    • ‏2011-06-17T15:00:45Z
    Thank you. I will try to optimize my kernels, and will come back if I have another question

    Erman
    Hi,

    I have another questions.

    You said that the work-groups are equally parceled to the compute units. Let take example I have 20 work-groups (each WG has 256 work-items, so total = 5120 work-items).

    Questions:

    1. (*Work-group distribution*) In my understanding, these 20 WGs are distributed equally to 6 SPEs in PS3. Each SPE gets 3 WGs (= 768 work-items each SPE). The remaining 2 WGs must wait to be processed when an SPE is free. Is it correct?

    2. (*Work-group processing in one SPE*) From no.1 case, each SPE gets 768 work-items (3 work-groups). But as far as I know, only one work-group can be executed at a time, so it is 256 work-items, while the remaining 512 work-items have to wait until the first work-group finish executed. Is it correct?

    3. Now, I focus on the one work-group processing (i.e, the 256 work-items) in one SPE. Is these 256 work-items really executed in parallel in once? or is it take turns? For example, 4 work-items first, then another four, and so on until all 256 work items are processed.

    Thank you
  • erman_ibm
    erman_ibm
    8 Posts

    Re: How actually CBEA processors process work-items? and many other questions

    ‏2011-06-30T09:54:47Z  
    I have new questions.
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: How actually CBEA processors process work-items? and many other questions

    ‏2011-06-30T15:25:02Z  
    • erman_ibm
    • ‏2011-06-30T09:50:36Z
    Hi,

    I have another questions.

    You said that the work-groups are equally parceled to the compute units. Let take example I have 20 work-groups (each WG has 256 work-items, so total = 5120 work-items).

    Questions:

    1. (*Work-group distribution*) In my understanding, these 20 WGs are distributed equally to 6 SPEs in PS3. Each SPE gets 3 WGs (= 768 work-items each SPE). The remaining 2 WGs must wait to be processed when an SPE is free. Is it correct?

    2. (*Work-group processing in one SPE*) From no.1 case, each SPE gets 768 work-items (3 work-groups). But as far as I know, only one work-group can be executed at a time, so it is 256 work-items, while the remaining 512 work-items have to wait until the first work-group finish executed. Is it correct?

    3. Now, I focus on the one work-group processing (i.e, the 256 work-items) in one SPE. Is these 256 work-items really executed in parallel in once? or is it take turns? For example, 4 work-items first, then another four, and so on until all 256 work items are processed.

    Thank you
    You said that the work-groups are equally parceled to the compute units. Let take example I have 20 work-groups (each WG has 256 work-items, so total = 5120 work-items).

    Questions:

    1. (*Work-group distribution*) In my understanding, these 20 WGs are distributed equally to 6 SPEs in PS3. Each SPE gets 3 WGs (= 768 work-items each SPE). The remaining 2 WGs must wait to be processed when an SPE is free. Is it correct?

    Roughly, yes. The distribution could be different if the time to process the work was unequal -- ie, if the work-group for 1 SPE took a really long time, then the other SPE's would fetch and work on all of the remaining groups. But yes, if the work-groups are equal, then that is how it would probably get distributed.
    2. (*Work-group processing in one SPE*) From no.1 case, each SPE gets 768 work-items (3 work-groups). But as far as I know, only one work-group can be executed at a time, so it is 256 work-items, while the remaining 512 work-items have to wait until the first work-group finish executed. Is it correct?

    Yes, the SPE works on group of work-items at a time since the SPE is single threaded. Once it's done with 1 work-group, it goes on to the next work-group.
    3. Now, I focus on the one work-group processing (i.e, the 256 work-items) in one SPE. Is these 256 work-items really executed in parallel in once? or is it take turns? For example, 4 work-items first, then another four, and so on until all 256 work items are processed.

    The SPE is single threaded, so only 1 thing happens at a time, meaning that at the lowest level, 1 work-item is being processed at a time. That said, the compiler knows that it will need to do more than 1, so it can do loop unrolling and whatever other techniques it has, in order to process them faster.

    .bri.
  • erman_ibm
    erman_ibm
    8 Posts

    Re: How actually CBEA processors process work-items? and many other questions

    ‏2011-06-30T20:21:16Z  
    You said that the work-groups are equally parceled to the compute units. Let take example I have 20 work-groups (each WG has 256 work-items, so total = 5120 work-items).

    Questions:

    1. (*Work-group distribution*) In my understanding, these 20 WGs are distributed equally to 6 SPEs in PS3. Each SPE gets 3 WGs (= 768 work-items each SPE). The remaining 2 WGs must wait to be processed when an SPE is free. Is it correct?

    Roughly, yes. The distribution could be different if the time to process the work was unequal -- ie, if the work-group for 1 SPE took a really long time, then the other SPE's would fetch and work on all of the remaining groups. But yes, if the work-groups are equal, then that is how it would probably get distributed.
    2. (*Work-group processing in one SPE*) From no.1 case, each SPE gets 768 work-items (3 work-groups). But as far as I know, only one work-group can be executed at a time, so it is 256 work-items, while the remaining 512 work-items have to wait until the first work-group finish executed. Is it correct?

    Yes, the SPE works on group of work-items at a time since the SPE is single threaded. Once it's done with 1 work-group, it goes on to the next work-group.
    3. Now, I focus on the one work-group processing (i.e, the 256 work-items) in one SPE. Is these 256 work-items really executed in parallel in once? or is it take turns? For example, 4 work-items first, then another four, and so on until all 256 work items are processed.

    The SPE is single threaded, so only 1 thing happens at a time, meaning that at the lowest level, 1 work-item is being processed at a time. That said, the compiler knows that it will need to do more than 1, so it can do loop unrolling and whatever other techniques it has, in order to process them faster.

    .bri.
    Thank you, Brian.

    I thought that an SPE processes 4 work-items simultaneously, because it has 4 ALUs. Thank you for clarify this. I'm a little bit confused the between the SPE and VLIW architecture (used in AMD GPU and its hardware thread can process 64 work-item simultaneously), because the number of ALUs.

    Actually, I work on OpenCL on GPU and Cell for my school project and thesis. It's a shame if I write wrong information about this.

    I hope in the future IBM include more information about this in the guide.

    One more question. Brian, are you work for IBM? Are you Brian Watt? (one of the member of OpenCL working group from IBM?). I apologize for guessing.
    The person answer my question before. Is he Dan Brokenshire? (also one of the member of OpenCL working group from IBM)

    I want to acknowledge both of you, because I use the information/answer from both you in my writing.
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: How actually CBEA processors process work-items? and many other questions

    ‏2011-06-30T20:36:13Z  
    • erman_ibm
    • ‏2011-06-30T20:21:16Z
    Thank you, Brian.

    I thought that an SPE processes 4 work-items simultaneously, because it has 4 ALUs. Thank you for clarify this. I'm a little bit confused the between the SPE and VLIW architecture (used in AMD GPU and its hardware thread can process 64 work-item simultaneously), because the number of ALUs.

    Actually, I work on OpenCL on GPU and Cell for my school project and thesis. It's a shame if I write wrong information about this.

    I hope in the future IBM include more information about this in the guide.

    One more question. Brian, are you work for IBM? Are you Brian Watt? (one of the member of OpenCL working group from IBM?). I apologize for guessing.
    The person answer my question before. Is he Dan Brokenshire? (also one of the member of OpenCL working group from IBM)

    I want to acknowledge both of you, because I use the information/answer from both you in my writing.
    Yes, the SPE can do 4 operations at the same time via the vector instructions (ie, a float4 times a float4 will do the 4 floating point multiplies together). And if you code your kernel to use vectors (int4, float4, etc) the SPE will process things at the same time. However, today our OpenCL compiler does not automatically convert non-vector code into vector code.

    We're always interested in what people are using OpenCL for and their thoughts - what is your school project? Can you share any of the code? What is your impression of OpenCL and of our implementation compared to others?

    Yes, I work for IBM. I'm Brian Horton; and yes, the other person was Dan Brokenshire.

    .bri.
  • erman_ibm
    erman_ibm
    8 Posts

    Re: How actually CBEA processors process work-items? and many other questions

    ‏2011-07-01T12:09:04Z  
    Yes, the SPE can do 4 operations at the same time via the vector instructions (ie, a float4 times a float4 will do the 4 floating point multiplies together). And if you code your kernel to use vectors (int4, float4, etc) the SPE will process things at the same time. However, today our OpenCL compiler does not automatically convert non-vector code into vector code.

    We're always interested in what people are using OpenCL for and their thoughts - what is your school project? Can you share any of the code? What is your impression of OpenCL and of our implementation compared to others?

    Yes, I work for IBM. I'm Brian Horton; and yes, the other person was Dan Brokenshire.

    .bri.
    My school project topic is about parallel computation (statistical methods) on data stream. I plan to use OpenCL on GPU or Cell processor for this.

    Now, I'm in progress on learn how OpenCL works on GPU and Cell. What are the differences between the implementations from IBM and AMD, etc.

    For the code, Until now I tried some simple vector operations (add, subtract, multiply) on AMD 5870 GPU and Cell. The performance on AMD GPU (kernel execution time) is better than Cell (PS3). However this is not because IBM OpenCL is bad. It is only what I tried is scalar code (not use vector data type, float4, int4, etc.) also not use as Dan suggested me (async_work_group_copy() to local memory). I also realize that Cell system is host unified memory, so it is not optimal to use clEnqueueRead/WriteBuffer().

    Here is my impression after about 2 months learn OpenCL on AMD platform and IBM

    • Some works have to do for porting the OpenCL from GPU implementation to Cell. AMD use C++ (the OpenCL C++ wrapper) while IBM creates its own CLU in C. This is make the portability of host code difficult. But this is not a problem actually, because it's up to the programmer to choose which one he wants to use. CLU is good to help a beginner (like me :)) to start OpenCL code on Cell.

    • The documentation is clear enough, except for the questions I asked in this thread. It is understandable right now, since IBM have several systems (POWER7, CBEA, etc.). It is not possible to include all information in one user guide. I believe the documentation will be improved in the next release. I do not know if other people also use OpenCL on SPE. Maybe they will prefer to use it on newer processor such as Power6 or Power7.

    Talking about OpenCL on CPU, I have another questions.
    I think there is a lack of documentation to explain how it works on CPU. Seems like the vendors forget that not all people out there have those expensive GPUs, but they have multicore CPU that can run OpenCL.

    Can you explain how it works on Cell's PPE?

    1. How is the mapping of work-group and work-item execution? Is it mapped to pthreads or something?
    I only know that PPE has 2 thread --> it is 2 compute units --> one work-group mapped to one compute unit.

    2. On AMD CPU (x86), the kernel code compiled to x86 assembly that use SSE intructions (XMM registers, etc.). I know IBM has Altivec instruction set, Does it use this instruction set?
  • SystemAdmin
    SystemAdmin
    131 Posts

    Re: How actually CBEA processors process work-items? and many other questions

    ‏2011-07-01T12:49:50Z  
    • erman_ibm
    • ‏2011-07-01T12:09:04Z
    My school project topic is about parallel computation (statistical methods) on data stream. I plan to use OpenCL on GPU or Cell processor for this.

    Now, I'm in progress on learn how OpenCL works on GPU and Cell. What are the differences between the implementations from IBM and AMD, etc.

    For the code, Until now I tried some simple vector operations (add, subtract, multiply) on AMD 5870 GPU and Cell. The performance on AMD GPU (kernel execution time) is better than Cell (PS3). However this is not because IBM OpenCL is bad. It is only what I tried is scalar code (not use vector data type, float4, int4, etc.) also not use as Dan suggested me (async_work_group_copy() to local memory). I also realize that Cell system is host unified memory, so it is not optimal to use clEnqueueRead/WriteBuffer().

    Here is my impression after about 2 months learn OpenCL on AMD platform and IBM

    • Some works have to do for porting the OpenCL from GPU implementation to Cell. AMD use C++ (the OpenCL C++ wrapper) while IBM creates its own CLU in C. This is make the portability of host code difficult. But this is not a problem actually, because it's up to the programmer to choose which one he wants to use. CLU is good to help a beginner (like me :)) to start OpenCL code on Cell.

    • The documentation is clear enough, except for the questions I asked in this thread. It is understandable right now, since IBM have several systems (POWER7, CBEA, etc.). It is not possible to include all information in one user guide. I believe the documentation will be improved in the next release. I do not know if other people also use OpenCL on SPE. Maybe they will prefer to use it on newer processor such as Power6 or Power7.

    Talking about OpenCL on CPU, I have another questions.
    I think there is a lack of documentation to explain how it works on CPU. Seems like the vendors forget that not all people out there have those expensive GPUs, but they have multicore CPU that can run OpenCL.

    Can you explain how it works on Cell's PPE?

    1. How is the mapping of work-group and work-item execution? Is it mapped to pthreads or something?
    I only know that PPE has 2 thread --> it is 2 compute units --> one work-group mapped to one compute unit.

    2. On AMD CPU (x86), the kernel code compiled to x86 assembly that use SSE intructions (XMM registers, etc.). I know IBM has Altivec instruction set, Does it use this instruction set?
    erman,

    Both the C++ wrapper and CLU sit on top of OpenCL. Your are free to use either (or no) facility to code to either of the implementations, AMD or Cell.

    You are correct. The CPU (PPE) implementation is a pthread for each compute unit. The OpenCL implementation creates one pthread for each of the two SMT threads per core.

    The IBM kernel compiles to the target's instruction set. This is either the SPU ISA (for the Cell SPE accelerator device) or Power ISA (version 2.02, for the Cell CPU device).

    Dan B.