|Forum watch: Overcoming DMA list issues|
|From August 1 to September 1, 2008: What you need to know about DMA special handling to move "small" messages. How do I make my BLAS program access the SPEs? Why do I get nonresponse timeouts when I try to debug with the Cell sim? Can I move PPE cache data to ls/SPE? How do I get a clean compile to build the CellSs framework? Why are some fields not displayed correctly with |
This blog-based column looks at some of the more interesting problems and challenges posed recently in the Cell Broadband Engine Architecture forum.
Koobas wants to know why he can't [successfully] use DMA lists with one-byte messages: I am trying to use DMA lists with one-byte messages and I am having problems. Simple DMAs work fine, but my attempts to use DMA lists result in deadlocks. (I am still investigating the issue.) There seems to be "asymmetry" in how DMA lists work compared to simple DMAs.
If you're doing a one-byte DMA, the sub-word offset has to match at the source and at the destination. However, DMA list elements do not seem to have that limitation. Looks like the only requirement is that at the source address the byte is the first byte in the quadword, but it can be any byte at the destination. Also, please note that the effective address can designate another local store, so it seems legal for transfers between local stores, which is what I am trying to do.
[IBM SDK Service Administrator]: See if this section from CBE_Handbook_v1.1_24APR2007_pub.pdf (page 530, section 19.4.4) explains what you're seeing:
"The DMA list command specifies one LS starting address for the entire DMA list, and the data in the LS is accessed sequentially with a minimum step of one quadword. Each list element specifies a new starting address in the effective-address space, but the LS address for a list element starts where the last list element left off. The one exception to this rule is that when a list element in a DMA list contains a short transfer size of 1, 2, 4, or 8 bytes, the LS address increments as needed to start the next list element on a quadword boundary; thus, for a short transfer size, some data in LS will be skipped between list elements. For transfer sizes that are a multiple of a quadword, no data is skipped."
[Koobas]: Consider the following scheme. I use a DMA list to move messages of size 1 byte:
- The source is a physical address in local store of an SPE.
- The destination is an effective address mapping to another location in the same local store of the same SPE.
- At the source the addresses are quadword aligned, at the destination they are not.
Is there any reason why this scheme would fail?
[mkistler]: Yes ... this scheme is guaranteed to fail. The CBEA, Section 7.4, says:
"Special handling is performed for list elements with a transfer size less than 16 bytes. In this case, the local storage address for the transfer is adjusted to have the same quadword (16-bytes) alignment as the effective address for the transfer."
So ... you may think that the local store address is quadword aligned ... but it's not. It takes on the alignment of the effective address.
[Koobas]: Thank you so much for the explanation. Finally I have the whole picture -- it all makes sense now. I am using this scheme to set completion flags at the destination so I can make the alignment at the source whatever it needs to be. For a dual-cell algorithm I have to send messages to 16 destinations and it fills out the message queues stalling double-buffered data transfers, so I had to move from simple DMAs to DMA lists.
I tried this scheme and it does what I want it to. This forum is awesome. It always solves my problems.
Editor's note: Even though I have nothing to do with the success of this forum, I have to enthusiastically second the previous comment. This forum -- with both its IBM and non-IBM participants -- is one of the best tools I've ever seen for technology troubleshooting.
|Can you answer this one? Why can't I build the CellSs framework? Answer: think compiler pointers.|
ultimA wants to know why the BLAS program he wrote won't take advantage of the SPEs: I am using the 3.0 SDK supplied BLAS library and if I understand the docs well, it should automatically offload the calculations to one or more of the SPEs in some accelerated routines, like
SGEMV. The program I wrote, howewer, does not seem to take advantage of the SPEs at all. The
BLAS_NUMSPES environment variable does not have the least impact on performance and the overall performance of
SGEMV on a QS21 is 1/10 of a 2GHz Core2Duo no matter what matrix size I choose (100-10K square).
Are there any special steps necessary to make use of the SPEs? I am using libblas.so.1 (which also requires
-lnuma) with header cblas.h.
[IBM SDK Service Administrator]: Are you saying the SPEs are not being used at all or just not efficiently? To confirm SPEs are being used, you can
ls /spu while the program is running. There is of course overhead in starting up each SPE, but presumably increasing the size as you have should have overcome that.
Are you able to attach your code so I can try it out?
ls /spu no SPU shows up during the calculations. Not included in the attachment, but the program is linked against libblas.so.1 from the Fedora 7 Cell SDK 3.0 in the blas-3.0-6.ppc.rpm package. The calculations take about 12-14 seconds with the command-line switches
-m10000 -d0.001 which is far too long. power_cell (ZIP, 7.3KB)
[bug2app3r]: Can you please give me the complete set of inputs you have given to
SGEMV routine -- this will help me in getting better idea about your problem.
[ultimA to bug2app3r]: Currently there are no explicit input sets, so I cannot provide these, but you shouldn't be needing one anyway because the program automatically generates its input at startup.
The application calculates an eigenvector of a matrix with the power-iteration algorithm. The matrix and the starting vector are generated randomly at runtime; this is why there are no input sets. The only trick the application does is to use 32-bit floats at the beginning and than change to doubles when the precision of float starts to get limiting.
There are two command line options:
-msets the matrix size and can be anything between 4 and what the machine can handle, eg.
-m1000for a 1Kx1K matrix.
-dis a floating point number and influences when to switch from singles to doubles.
-d0.001seems to be the fastest most times, but not always.
There are only two BLAS calls (
cblas_sgemv) in the whole program; these are to be found in main.cpp on lines 119 and 131.
The arguments to the blas functions are pretty standard, eg.:
cblas_sgemv(CblasRowMajor, CblasNoTrans, m, n, 1, sp_mat, n, sp_vec, 1, 0, sp_vec2, 1);m,n - matrix dimensions (n=m always)sp_mat, sp_vec - single prec. matrix and vector to multiplysp_vec2 - target (return value)
[bug2app3r]: When you are calling
cblas_sgemv(), is this matrix being passed is in Row Major format in your application? This routine internally calls the FORTRAN interface
sgemv_(). If the matrix input given to
cblas_sgemv() routine is in Row major Format, it calls
sgemv_() with Transpose option (since fortran interface accepts matrices in Column Major order).
In SDK 3.0
dgemv_(), both the routines were not supporting "Trans" option; whenever this option was called, a scalar code was called. This is the reason you were not getting good performance numbers and no effect with the
To get the optimized version of
sgemv routine (in SDK 3.0), call
cblas_sgemv() with matrix order as
columnmajor; otherwise use the newer release of SDK.
[ultimA]: I tried your suggestion and it really did the trick! It works now.
|Can you answer this one? Why are some fields not displayed correctly with |
Tom1979 wants to know why he gets a "nonresponding timeout" when trying to debug with the Cell simulator: Debugging using the local cell simulator with Eclipse IDE intermittently gets:
Thread 2 cell (Suspended: Breakpoint hit.) <Stack is not available: Target is not responding (timed out).>
The first breakpoint is always successful. After a successful resume from the first breakpoint the timeout intermittently occurs.
[IBM SDK Service Administrator]: I don't think the problem is within IDE, but with the remote debug session it's calling. The error is caused by either the application being terminated or the debugger session being dropped.
This can be caused by various factors:
- a problem on the user application;
- a problem while running gdbserver on the simulator; or
- even a problem with the version of the libraries used on the host system used to run Eclipse.
This can be caused by using the SDK on platforms that are not officially supported by the SDK because the libraries used by gdb running on the host system may not match their versions with the libraries running on the remote system where gdbserver is running. This, eventually, can make gdbserver to stop running.
Since under the covers the IDE is doing a remote connection with ppu-gdbserver and ppu-gdb, you might have to do that manually to see why this disconnect is happening.
[Tom1979]: I'm using SDK 3.0 with x86 Fedora 7. Where can I find instructions for manually doing the remote connection with ppu-gdbserver and ppu-gdb?
[IBM SDK Service Administrator]: See Chapter 3. Debugging Cell BE applications in CBE_Programmers_Guide_v3.0.pdf.
|Threads worth pursuing|
Need help getting Eclipse to play nice with the SDK?.
Chia-Heng would like to know if L2 cache data inside PPE can be moved into local store in SPE: We are interested to know that if data could be moved from L2 cache inside PPE into local store in SPE? We have seen this concept from the slides at the one-day IBM Cell Programming Workshop 2007 at Georgia Tech, "Cell Broadband Engine – An Introduction."
[iamrohitbanga]: i believe that virtual addressing does the trick. My guess: The EIB makes a request to the PPE with an address. If it happens to be in the cache, request is serviced from there. Note in case the processor is heavily loaded then the requested page may not be present in the main storage also. So request should be made to the PPU. So that if it is not paged in DRAM then the kernel may bring the data in the memory.
Forum statistics for August 2008 Threads: 56 | Participants: 3,829 | Replies: 153 | % threads answered: 46%