Part 1 in this series described the internals of the Cell Broadband Engine (Cell BE) architecture and the main components that provide the on-chip DMA capabilities between the PPE and the SPEs. While the previous article covered DMA initiated by the PPE's Memory Flow Controller (MFC), this article delves deeper into the other half of the on-chip DMA transactions, covering the Cell BE processor's SPE DMA architecture, channels, and DMAs from the SPE's perspective.
SPE channels are the primary interface between the MFC and the SPEs. The channels are akin to a one-way communication pipe in that they can be configured as either in read-only or write-only mode. Each channel has an associated channel count which indicates the outstanding operations that can be issued with the channel.
The channel count should be initialized with the SPE software while imposing a new context onto that particular SPE. Using the channel count property, the channels can be configured as blocking or non-blocking. Reading a read channel with a channel count of 0 causes the SPE to stall unless the data is available on the channel or an event happens which makes the channel count go non-zero. Similarly, a write to the write channel with channel count 0 causes the SPE to stall.
The SPEs stop executing when they are stalled. This design helps to save power while the processor has to wait for certain external events to occur. Traditional architectures would expect the software to implement a polling loop which keeps the processor unit executing instructions without making any forward progress. Good software design should minimize SPE stall cycles and try to keep the SPE as busy as possible. However, it is better to stall rather than execute redundant instruction sequences.
Each SPE has 128 channels, although not all of them are implemented and used. Please see the Cell Broadband Engine Architecture Specification (listed in Resources) for full details on channel implementation.
The SPE instruction set contains special instructions to use the channels. They are as follows:
-
Read channel (
rdch) This instruction reads the selected channel into a general purpose SPE register. This instruction can only be used on a channel configured as a read channel. Using this instruction on a write channel raises an invalid channel instruction interrupt.
For example:rdch rt, chmoves data from the channel denoted bychto the register denoted byrt. -
Write channel (
wrch) This instruction transfers the contents of a general purpose SPE register to the selected channel. This instruction can only be used on a channel configured as a write channel. Using this instruction on a read channel raises an invalid channel instruction interrupt.
For example:wdch ch, ramoves the data from registerrato channelch. -
Read channel count (
rchcnt) This instruction reads the channel count of the selected channel into an SPE register. This instruction is useful in determining whether the channel is ready for read or write using therdchorwrchinstruction without stalling the SPE. Read channel count could be used in combination with read/write channel instructions to implement software polling loops and avoid stalling the SPE.
For example:rchcnt rt, chwill return the channel count of channelchinto the register denoted byrt.
C/C++ intrinsics for Cell BE architecture include the following functions to perform read, write, and read channel count on channels:
-
spe_readch()
reads data from a channel. Example:
x = spe_readch(1)reads data from channel 1 to x. -
spe_writech()
writes data into a channel. Example:
spe_writech(1, data)writes data from x to channel 1. -
spe_readchcnt()
reads the channel count. Example:
x = spe_readchcnt(1)reads the count of channel 1 to x.
Each SPE has 128 channels, and although most of them are not implemented, the remaining channels are grouped together according to specific functionalities they provide. Some of the important functionalities of channels include event management, signal notification, mailbox management, DMA enqueue, and DMA status check. This section looks at each of the above-mentioned groups in detail.
The MFC provides mailbox queues for the SPEs to interact with a PPE or
any other external device. These mailboxes are used for communication
between the SPE and other external devices for sending status, return codes,
waiting for status, and so on. The SPE uses mailbox channels to access one end
of the queue, and a PPE or other external devices uses MMIO registers to
access the other end of the queue. The three mailbox queues are: the
SPE outbound mailbox queue, the SPE outbound mailbox interrupt queue, and the
SPE inbound mailbox queue. The SPE uses wrch on the outbound queues to send
data through the queue and uses rdch to retrieve data from
the inbound queues. The PPE or other external devices use the
corresponding MMIO registers to read data from the queue and write data
to the queue. The channels used for mailbox facility are as follows:
-
SPE write outbound mailbox channel
The SPE writes data to the mailbox queue by usingwrchon this channel, which decrements the channel count by 1. This channel is write blocking -- a write to the channel with channel count of 0 (outbound channel full) will cause the SPE to stall. In order to avoid the stalling of SPE, therchcntinstruction can be used to determine whether this channel can take further data. -
SPE write outbound mailbox interrupt channel
This is similar to the normal outbound mailbox channel, except an interrupt will be generated to the PPE anytime this queue is not empty (decrease in channel count). This will be a class 2 interrupt from the corresponding SPE which can be routed to PPE. -
SPE read inbound mailbox channel
The SPE usesrdchto read the data from PPE or external device using this channel. The PPE or external device uses inbound MMIO register to place the data.
SPE signal notification channels
The external devices or processor use the SPE signal notification facility to send signals to the SPEs. The external devices use MMIO
writes on signal notification registers to notify the signal to the SPE.
The SPE, from its end, reads the signal notification channels to
identify the signal. If the channel is read with no signals pending, it
leads to SPE stall. If the SPE is allowed to stall by reading the
channel while the count was zero, then the SPE would continue execution
once the signal is presented in the queue. Signals are bit-mask
corresponding to software events unlike mailbox which can be used to
exchange data. Special MFC sndsig commands can be used to update signal
notification channels among SPEs.
-
SPE signal notification channel 1 and 2
Ardchinstruction on these channels yields the 32bit signal word. The read resets any bit which was set. You can configure SPEs to 'OR' bit fields between successive signal updates.
The SPE provides event management facilities through event channels that keep track of various hardware events enabled in the SPE write event mask channel. The SPE programs use these channels to find out the status of various events. The SPE program uses the SPE write event mask channel and enables the bit fields corresponding to the expected events. Once the events are enabled in the SPE write event mask, a read using the SPE event status channel indicates the status of each of the events. If none of the events have occurred, then reading the SPE event status channel will cause the SPE to stall. These are SPE hardware events, unlike mailbox and signal events which software generates.
After the event occurs, the software's responsibility is to acknowledge all those events in one write operation to the SPE event acknowledgement channel. The software can then proceed to handle that particular event. If the event is not handled, this might lead to phantom events, as the read from SPE event status channel could still show the event pending. Unlike signal notification, reading of the event status channel alone does not acknowledge the interrupt. It is possible to avoid polling for events by configuring SPE interrupt generation when the event or group of events occurs. These interrupts are presented to the SPE and not routed to the PPE.
The following events can be monitored using the event status channel:
-
SPE decrementer event
Triggered when the Most Significant Bit (MSB) of the decrementer count transitions from 0 to 1 (or, when the value becomes negative). -
SPE inbound mailbox available event
Triggered when the SPE read inbound mailbox channel count becomes 0. -
SPE outbound mailbox available event
Triggered when the SPE write outbound mailbox channel count becomes greater than the pre-set program value. -
SPE signal notification available events
Triggered when an external device or processor writes into the signal notification channel(s). -
MFC tag group update event
Triggered when the MFC tag group status channel is updated, based on the tag status updates written into the tag status update request channel. -
Privilege attention event
Triggered by setting the privilege attention bit in the SPE privilege control register. Privileged software could use this feature to implement SPE debuggers.
The following channels are part of the SPE event management facility:
-
SPE read event status channel
Read of this channel indicates the status of all the events that have been enabled in the SPE write event mask channel. A read from this channel with channel count of 1 returns the status of all enabled events, and sets the channel count to 0. This provides a wait-on-event facility wherein the channel count becomes 1, leading the SPE to "unstall" when the desired event occurs. -
SPE write event mask channel
This channel contains bit fields for all the events that should affect the SPE event mechanism. If a bit corresponding to an event is enabled in this channel, subsequent reads of SPE read event status channel reveal the actual status of that particular event. -
SPE read event mask channel
This is a means to read the current SPE event mask value. -
SPE write event acknowledgement channel
A write to this channel with a specific bit set clears the corresponding bit in event status channel. This indicates that the event has been serviced by the software.
Refer to Table B-1 in the Cell BE architecture document for a complete list of implemented channels (see Resources).
The SPE DMAs can be used for transactions between main memory and SPE local store, or between any external device memory (for example, IO device) and SPE local store, or between any two SPE local stores. The SPEs initiate DMA transactions by using special DMA channels. The DMA channels are the primary interface from the SPE side to the MFC DMA engine. The SPE DMA enqueue logic is similar to the PPE DMA enqueue logic, and each of the DMA enqueue parameters has separate channels like source, destination, size, tag, and so on. The SPE can initiate up to 16 DMAs in parallel as the depth of the SPE DMA queue is 16.
Figure 1. SPE Channel interface DMA diagram
SPE DMAs are similar to PPE DMAs in that the DMAs are done from the perspective of the SPE. In other words, a DMA GET will transfer data from the external device to the SPE, and a DMA PUT will transfer data from the SPE to the external device. The SPE DMAs are classified into three types:
-
Single element DMA
This is similar to the PPE-initiated DMA where SPE initiates a single element data transfer between its local store and to an external entity. The external entity could be main memory or another SPE or any other IO device memory. -
List DMAs
This is used to transfer a list of elements from the local store of the SPE to the main memory, or from main memory to the local store. The DMA list command uses a list of effective addresses for transferring data to and from the local store. The effective address region need not be contiguous in the physical address space, whereas the SPE local store region is contiguous. A single DMA list transfer can have up to 2048 elements. Only the SPE can initiate a List DMA transaction. -
Atomic DMAs
Atomic DMAs provide atomic update functionality from the SPE side. They mimic the behavior of the atomic commandslwarx,stwcx,ldarx, andstdcxused in the PPE side. Atomic DMAs can be performed only on coherent and cacheable pages. Atomic DMAs, like List DMAs can be initiated only from the SPE side.
The following is a list of various DMA commands from the SPE side:
- GET moves data from external memory to local store.
- PUT moves data from local store to external memory.
- GETL moves lists, rather than a single data item, from external memory to local store.
- PUTL moves lists, rather than a single data item, from local store to external memory.
-
GETLLAR
gets a lock line and creates a reservation. This
is similar to the
lwarxandldarxoperations on the PPE. The size of transfer is one cache line. The command executes immediately and is not queued in the SPE DMA command queue. -
PUTLLC
puts a lock line based on a reservation obtained using GETLLAR.
This is similar to
stwcx,stdcxoperation in PPE. The size of transfer is one cache line. The command executes immediately and is not queued in the SPE DMA command queue. - PUTLLUC puts a lock line unconditionally, with or without a reservation for the lock line. The PUTLLUC operation is not dependant on a previous GETLLAR. The command executes immediately and is not queued in the SPE DMA command queue.
- PUTQLLUC puts a lock line unconditionally, but this command is placed in the SPE DMA command queue with other DMA commands.
The PUTLLUC or PUTLLQUC operations can clear any previous reservations made by GETLLAR.
The SPE has special DMA channels which are used for enqueuing DMAs. The
SPE application has to use wrch and rdch commands to enqueue DMAs. The
DMA channels are as follows:
-
Command and class ID channel
This channel contains DMA command and the class ID of the DMA to be enqueued. If the SPE DMA command queue is full, then any write to the channel can stall the SPE. In this caserchcntshould be used on this channel to determine the free slots available in the SPE DMA queue before enqueuing any new DMA. -
Command tag ID channel
This channel contains the identifier or tag for the DMA command. Any number of DMA commands can be tagged with the same tag and they are referred to as a tag group. The tag group can be used to query for the completion of the DMA. -
Transfer size or list size channel
This channel contains the size of DMA transfer. The maximum transfer size is 16KB in the case of normal DMA, and it refers to the size of the list in case of List DMAs. -
Effective address low or list address channel
This channel contains the lower 32 bits of the effective address in case of normal DMA, or the pointer to the list element in the local store in case of List DMA. If translation is enabled, the effective address needs to be translated to the real address using MFC segment table and page table. -
Effective address high channel
This channel contains the higher address of the DMA effective address. It is concatenated with the lower part of effective address to form a 64bit effective address. This channel can be set to zero in which case the effective address is only 32bit. -
Local address channel
This channel contains the local storage address of the DMA which can either be the source or target.
The DMA enqueue logic is similar to the PPE DMA enqueue logic as given
below. All the channels should be written from the SPE side using
successive wrch instructions:
- Write to the local storage address channel.
- Write the effective address high channel.
- Write the effective address low or list address channel.
- Write the transfer/list size channel.
- Write the command tag ID channel.
- Write the command and class ID channel.
The write channel to the command and class ID channel causes the DMA to
be enqueued in the command queue. You can do the steps preceding the writes to the
command and class ID channel in any order. The
command and class ID channel has a maximum count which is equal to the number of slots in the DMA queue. Software has to initialize the channel
count to the number of empty DMA slots before operating on this channel.
A wrch to this channel with a count of 0 (or, no more DMA queue slots
free) will cause the SPE to stall.
The DMA completion status is based on the command tag, which is a 5bit identifier programmed in the command tag ID channel as part of the DMA enqueue process. Some specific DMA status channels contain the state of the enqueued DMA based on the tag. The SPE program has to operate on these status channels and query for the command tag or tag group to verify whether the DMA or group of DMAs has completed. The DMA status channels are given below:
-
MFC Write Tag group query mask channel
This channel contains the bit masks of the tag groups that should be included as part of the DMA query. It has 32 bit fields representing 32 different tags or tag groups. This is a non-blocking channel. -
MFC Read Tag group status channel
This channel contains the status bits of all those tags or tag groups that have been enabled in the tag group query mask channel. If a bit is set to 1 in this channel, it indicates the DMA completion for that particular tag or tag group. A bit value of 0 indicates that either the DMA is still under progress or the tag is not part of the status query process. This is a read blocking channel, and the software should initialize the channel count to 1. -
MFC Write Tag status update request channel
This channel controls the mechanism of status update in the MFC read tag group status channel. This channel indicates that the status in the MFC read tag group can be one of the following:- Updated immediately
- Updated when any tag or tag group DMA has been completed
- Updated only when the DMA corresponding to all the tags or tag groups enabled in the write tag group query mask channel has been completed
A write operation to this channel has to be completed before any attempt is made to read from the tag group status channel, or else this will induce a deadlock scenario. This channel is write blocking channel with a maximum count of 1.
Assuming a DMA with a tag value of 10 has been enqueued, the algorithm for DMA status checking could be as follows:
- Perform a write channel to the MFC write tag group query mask channel with the value 10. This enables the bit position 10 and indicates that the tag is part of DMA status query.
- Write the tag status update request channel with a suitable value that controls when the status update needs to happen in the DMA read tag group status channel.
- Use read channel instruction on the tag group status channel and poll for the status of the DMA completion based on the tag. If the value of the DMA read tag status channel is 10, then it indicates that the DMA is complete.
Below is an example code snippet of SPE-initiated DMA in real mode using C/C++ intrinsics of Cell Broadband Engine SDK. The DMA command used is PUT, which moves the data from SPU local store to external memory (main memory or IO device memory). The local store memory or source memory is set at 0x0, and the destination memory is set at 0x2000.
Listing 1. Storing data to main memory
spe_dma.c
#include <spu_intrinsics.h>
#include <spu_internals.h>
#define SPE_ADDR 0x0
#define EA_ADDR 0x2000
int spu_dma()
{
int status = 0;
spu_writech(MFC_LSA, SPE_ADDR); // Program the LSA channel
spu_writech(MFC_EAH, 0x0); // Program EAH channel; high address is 0
spu_writech(MFC_EAL, EA_ADDR); // Program the EAL. 0x2000.
spu_writech(MFC_Size, 0x10); // DMA of 16 bytes
spu_writech(MFC_TagID, 5); // DMA tag of 5
spu_writech(MFC_Cmd, 0x20); // PUT. Move data from LSA to EA
// Check for DMA status
// Clear any pending tag status update
spu_writech(MFC_WrTagMask, 0); // zero out the tag mask channel
while(!spu_readchcnt(MFC_WrTagUpdate)); // read the tag update channel
// count until 1 is returned.
spu_readch(MFC_RdTagStat); // Read the status channel and
// discard the value
// Now program and wait for the DMA status
spu_writech(MFC_WrTagMask, 5); // Program the DMA tag
spu_writech(MFC_WrTagUpdate, 0x2); // Poll till all DMAs with tag of
// 5 are completed
while(!spu_readchcnt(MFC_WrTagUpdate)); // read the tag update channel
// count until 1 is returned.
status = spu_readch(MFC_RdTagStat) ; // read the status
if (status == 5)
// DMA SUCCESS
return 0;
else
// DMA FAILURE
return -1;
}
|
As is the case with PPE DMA, SPE-initiated DMA can happen in translation mode also, and is controlled by the MFC translation bit in the MFC_SR1 register. When DMA happens with translation enabled, the effective address gets translated using the MFC SLB and page table. The local store address is not translated, and the mechanism of translation and handling of translation-related exceptions are identical to PPE-initiated DMAs.
Atomic DMAs are similar to the atomic operations in standard PowerPC®. SPE can reference memory outside of local store using DMAs, and these atomic DMA operations help SPEs to synchronize with other processing elements. The sync word or the lock word is generally a main memory location that is accessible to PPE using load/store instruction. The Cell Broadband Engine architecture allows lock implementation across PPE and SPE software.
The PPE would execute a sequence of lwarx/stwcx instruction to
atomically update a lock word in main memory, while the SPEs would use
two DMA commands, getllar and putllc, to load and update the lock
word. If the PPE successfully executed stwcx, then the putllc DMA
command fails, and the SPE has to retry the getllar/putllc sequence.
The operation is very similar to the PPE lwarx instruction except that
getllar is a complete DMA GET command with an implied size equal to one
cache line.
The lock word is thus transferred to local store with reservation using
getllar, loaded into an SPE register using SPE load instruction, and modified and stored to local store. Then the putllc DMA command tries to
update the lock word in memory if this SPE still holds the reservation.
This DMA operation can succeed or fail depending on the reservation held
for that main memory address. SPE software reads Atomic Command
Status Channel(0x1B) to get completion status of getllar and success or
failure of putlluc. If the DMA failed, the SPE has to repeat the loop
starting from the getllar DMA command. DMA tag group completion channels
are not used to get status for atomic DMA commands such as getllar and
putllc.
Figure 2. Atomic DMA diagram
The SPE features special DMA commands which take a list of main memory effective addresses that is much similar to the scatter-gather list used with storage (SCSI/IDE disk) controllers. The list of main memory effective addresses and size is maintained in local store.
Any DMA command will move data between local store and main memory. The main memory address is actually an effective address that gets translated by the MFC's MMU to a particular real address in the system. In List DMAs, a list of such effective addresses (EA) and size is generated in the local store and given as input to the DMA command.
The EA of the DMA command follows the list, while the local store address is contiguous. One contiguous block of local address can be transferred to a scattered list of EAs using a single PUT LIST DMA command. The same is true for GET LIST command where data is picked up from different effective addresses and transferred to local store as one continuous block. The list of EAs saved in local store memory can be reused by providing them to other DMA commands. In fact, the same list can be used in a GET LIST and a subsequent PUT LIST DMA command. The list is an array of transfer size and EA pairs, not a linked list kind of structure. Section 7.4 in the Cell BE architecture document explains the list structure in detail (see Resources).
You can use list-based DMA techniques to collect data from different SPE local stores and place them in one block for further processing. List DMAs make this type of data aggregation faster by reducing the number of DMA enqueue commands and the amount of control code needed to check on the status of operations. More sophisticated features in List DMA allow you to make the DMA operation wait for certain events before proceeding to the next list element. Basically, the List DMA can wait for completion of operation on each SPU before collecting data. List DMA with stall and notify reduces the need for software polling loop and complex control logic needed to coordinate between SPEs.
Figure 3. List DMA diagram
SPE context consists of the following elements:
- Contents of local store
- Contents of all 128 registers
- State of the channels
- State of DMA commands in progress
The Cell Broadband Engine processor allows you to save and restore the context of an SPE at any point of time. Even the DMA operations in progress can be saved and restored or moved to another SPE. These features enable an OS to multitask SPE tasks on a given set of SPEs.
Though context switching an SPE is very time consuming and inefficient, the hardware provides the infrastructure, and the OS should use the feature as needed. Preemptive context switching of SPE is complex especially with DMA state management and is beyond the scope of this paper.
The OS running on the PPE has access to local store and channel states using MMIO registers. The register context of the SPE cannot be loaded directly from PPE. You would need a context-load code to be copied to local store along with the saved register context or initial register context. The context-load code should load each register with the value from the local store and then jump to SPE program start location.
Following are the essential steps to create a new SPE task:
- Set MFC SR1.
- Copy code and data into local store.
- Copy register context in predefined location in local store.
- Copy register context-load code into local store.
- Write NPC to point to register context-load code.
- Initialize channel count values.
- Initialize mail box channels and any other channel data.
- Initialize the SPE's MMU, specifically MFC_SDR1 and SLB entries.
- Write correct MFC_SR1, to enable or disable MFC MMU and other configurations.
- Write RunControl register to start the SPE. The SPE will execute the register context-load code and then jump to the actual SPE program start address.
All data copying operations to local store can be done using PPE-side DMA queue for that SPE. This saves a lot of cycles for the PPE core. The OS running on the PPE should be ready to handle any external interrupts coming from the SPE. SPE interrupts can also be masked during context load, and only the required interrupts can be enabled during SPE execution. Generally the stop-and-signal interrupt would be enabled to get the attention of PPE once this SPE program finished execution.
To retrieve the context of the SPE, you can follow a subset of the above steps. The register context of the SPE program might be needed only in case of debugging purpose. You can save the complete local store to main memory either through MMIO or DMA. A context-save program can be loaded into the local store and executed by setting NPC and run control register. This will move the register values to local store which can be subsequently moved out of the SPE.
Figure 4. SPE Context save/restore diagram
The Cell Broadband Engine processor is a very unique processor, described as a heterogeneous system-on-a-chip. All eight SPEs work in sync with the PPEs to carry out various tasks, and the on-chip DMA engines provide the means of data movement between the SPE and PPE tasks. This series of articles explored the various facets of the on-chip DMA and the power it brings for efficiently moving data in and out of SPEs, thereby forming the backbone of Cell Broadband Engine functionality.
The Cell BE architecture specification and the Cell Broadband Engine SDK C/C++ intrinsics provide comprehensive coverage of Cell Broadband Engine functionality and a suitable framework for developing applications using the IBM Full-System Simulator for the Cell Broadband Engine Architecture.
Learn
-
Read Part
1 of this series.
-
The Cell BE Programming Handbook provides information for developing applications, libraries, middleware, drivers, compilers, or operating systems for the Cell BE processor.
-
See what Max Aguilar and Mark Nutter have to say on programming to the Cell BE
processor (developerWorks, April 2006).
-
The MFC is discussed in Section 6 of Cell Broadband Engine Architecture
V1.0 (PDF format), while the first 27 DMA channels are listed in the
SPU C/C++ Language
Extensions V2.1 (PDF format) -- both of which you can find in the IBM
Semiconductor Solutions Technical Library's Cell
Broadband Engine documentation section.
-
The Cell Broadband Engine
project page at IBM Research offers a wealth of links, diagrams,
information, and articles.
-
Introduction
to the Cell multiprocessor (IBM Journal of Research and Development,
2005) has a good discussion of the history of the Cell BE project.
-
Find related articles, downloads, discussion forums, and more at the
IBM developerWorks Cell Broadband
Engine resource center: your definitive resource for all things Cell BE.
-
Keep abreast of all the Cell BE news: subscribe to the Power
Architecture Community Newsletter.
Get products and technologies
-
Get Cell BE: Contact
IBM E&TS.
-
Get the alphaWorks
Cell Broadband Engine downloads.
-
See all Power
Architecture-related downloads on one page.
Discuss
- Participate in the discussion forum.
-
Take part in the IBM developerWorks Power Architecture Cell Broadband Engine discussion forum.
-
Send a letter to the editor.
Vaidyanathan Srinivasan has a Masters degree in Electronics and Communication engineering from Bharathidasan University, India. He has been working in IBM Global Services (Software Labs), India since February 2000. He has developed device drivers, low-level stress tools and diagnostics software for various PowerPC processors and PowerPC-based systems. His areas of interest are processor architecture and system design. You can contact him at svaidyan@in.ibm.com.
Anand K. Santhanam has a Masters Degree in Software Systems from BITS Pilani, India. He has been in IBM Global Services (Software Labs), India, since July 1999. He has worked with ARM-Linux developing device drivers and power management in embedded systems, PCI device drivers, and developing stress tools for various PowerPC processors. His areas of interest include operating systems and processor architecture. You can reach him at asanthan@in.ibm.com.
Madhavan Srinivasan has a B.Eng. in Electrical and Electronics from Madras University, India. He has been in IBM Global Services (Software Labs), India, since November 2003. He has worked in developing Linux/AIX diagnostics and verification tools for floating point and system coherency units of various PowerPC server processors. His areas of interest include PowerPC architecture and operating systems. You can reach him at masriniv@in.ibm.com.



