Cell Broadband Engine processor DMA engines, Part 1: The little engines that move data

A single Cell Broadband Engine™ (Cell BE) processor consists of one PowerPC® core and eight SPEs each having their own DMA engine. The DMA engines are a key component of the overall Cell Broadband Engine Architecture (CBEA) as they move data between SPEs and the PowerPC core. Any operating system or application wishing to utilize the SPE depends on the DMA engines to manage work flow on behalf of the SPEs.


Vaidyanathan Srinivasan (svaidyan@in.ibm.com), PowerPC Test Tool Designer, IBM India Pvt Ltd.

Vaidyanathan Srinivasan has a Masters degree in Electronics and Communication engineering from Bharathidasan University, India. He has been working in IBM Global Services (Software Labs), India since February 2000. He has developed device drivers, low-level stress tools and diagnostics software for various PowerPC processors and PowerPC-based systems. His areas of interest are processor architecture and system design. You can contact him at svaidyan@in.ibm.com.

Anand K. Santhanam, PowerPC Test Tool Designer, IBM India Pvt Ltd.

Anand K. Santhanam has a Masters Degree in Software Systems from BITS Pilani, India. He has been in IBM Global Services (Software Labs), India, since July 1999. He has worked with ARM-Linux developing device drivers and power management in embedded systems, PCI device drivers, and developing stress tools for various PowerPC processors. His areas of interest include operating systems and processor architecture. You can reach him at asanthan@in.ibm.com.

Madhavan Srinivasan (masriniv@in.ibm.com), PowerPC Test Tool Designer, IBM India Pvt Ltd.

Madhavan Srinivasan has a B.Eng. in Electrical and Electronics from Madras University, India. He has been in IBM Global Services (Software Labs), India, since November 2003. He has worked in developing Linux/AIX diagnostics and verification tools for floating point and system coherency units of various PowerPC server processors. His areas of interest include PowerPC architecture and operating systems. You can reach him at masriniv@in.ibm.com.

06 December 2005

The Cell Broadband Engine (Cell BE) processor is a revolutionary new architecture designed and developed jointly by Sony, IBM, and Toshiba. The Cell Broadband Engine Architecture (CBEA) is unique in the sense that it consists of a powerful SMT PowerPC core with special auxiliary SIMD processing units. Each of these units are called synergistic processing elements (SPEs) and are capable of running compute-intensive applications. The SPEs of the Cell BE make it well-suited for graphics applications like gaming, image processing, and hi-definition TVs.

The Cell BE processor can be considered a system-on-a-chip (SoC) with heterogeneous multiprocessors (PPEs, or peripheral processor elements, and SPEs). The Cell BE processor adapts a more power-efficient design, and the crux of the architecture is to utilize various on-chip DMA engines to move data through the SPEs. The bandwidth of data that can be managed by the chip is extremely high by virtue of the direct memory access (DMA) engines and internal bus architecture. Cell BE-based applications and operating systems that run on the Cell BE processor have to effectively utilize the DMA engines to manage work flow to the SPEs. Each of the SPEs has its own DMA engine that can take multiple commands from the PowerPC and the SPE.

Cell Broadband Engine Architecture overview

The Cell BE processor consists of three main functional components, namely the PowerPC core, SPEs, and Memory Flow Controllers (MFCs) with DMA engines for each of the SPEs. A single Cell BE processor consists of one PowerPC core and eight SPEs each having its own DMA engine.

The PowerPC core present in the system is a general-purpose 64-bit PowerPC processor that handles the Cell BE's general-purpose workload (or, the operating system) and manages special-purpose workloads for the SPEs. The PowerPC core is an SMT core that has two threads which execute instructions every alternate cycle. The core has VMX, floating point, integer, and load/store units along with a branch unit as part of its execution hardware. Being a standard 64-bit PowerPC ISA implementation, the PowerPC core can run existing 64-bit PowerPC binaries.

The SPEs are SIMD units capable of operating on 128-bit vectors consisting of four 32-bit operand types at a time. Each SPE has a large register file of 128x128-bit registers for operating on 128-bit vector data types and has an instruction set heavily biased towards vector computation. The SPEs have a fairly simple implementation to save power and silicon area. No register renaming hardware and no complex micro-coded instructions like lmw/stmw that operate on multiple registers in the PowerPC are required for the SPEs. The SPE compilers are expected to take advantage of the large register file that can help in code optimization to a great extent. Also, hint-for-branch instructions enable branch optimization in hardware. The SPE compilers are expected to use these hint-for-branch instructions to further improve branch performance. Part 2 of this series covers the SPE DMA and SPE context management.

SPE instruction consists of common integer operations, shuffles, rotates, single/double precision loads/stores, 128-bit loads/stores, and so on. Each SPE has access to 256KB on-chip local store for its operation, which is locally addressable only to that particular SPE through load/store instructions. All of the instructions and data required by the code executing on the SPE must reside within its local store memory region. The SPEs cannot directly access other memory regions in the system except their own dedicated on-chip local store. If the SPE needs to access main memory or other memory regions within the system, then it needs to initiate DMA data transfer operations. DMAs can be used to effectively transfer data between the main memory and SPE's local store, or between two different SPE local stores. Each SPE has its own DMA engine that can move data from its own local store to any other part of the system including local stores belonging to other SPEs and IO devices.

Figure 1. Main system diagram
Main system diagram

Components of the DMA engine


The Cell BE processor's DMA engine is part of the MFC unit and is primarily responsible for data movement within the Cell BE processor and external IO devices.

Dedicated DMA engines of each of the SPEs can move streaming data in and out of the local stores in parallel with the program execution. Each MFC contains the DMA control unit and the Memory Management Unit (MMU) for that SPE. The DMA control unit processes queues of DMA commands. It consists of two DMA queues, one each for PPE-initiated DMA and SPE-initiated DMA. The MFC also contains an atomic unit (ATO), which performs atomic DMA updates for synchronization between software running on various SPEs and the PPE. Atomic DMA commands are similar to PowerPC locking primitives (lwarx/stwcx).

The MFC DMA engine has an MMU for address translation and protection using the standard PowerPC segment table and page table model. A DMA transaction involves a data transfer between local store address and an effective address which can be translated to any system-wide real address using the MFC page table. The standard PowerPC page-table protection mechanism and exceptions logic apply here. Any protection violations and page faults are presented as external interrupts to the PPE.

The MFC's MMU unit consists of the following:

  • Segment Lookaside Buffer (SLB) (managed through memory mapped input output (MMIO) registers)
  • Translation Lookaside Buffers (TLBs) to cache the DMA page table entries (option for hardware reload from page table or software loading through MMIO registers)
  • Storage Descriptor Register (SDR) which contains the DMA page table pointer (standard hashed PowerPC page table format)

The architectures allow the PPE and all of the MFCs to share a common page table, which enables the application to use their effective addresses directly in DMA operations without any need to locate the real address pages.

The good, the BAT, and the TLB

Older 32-bit PowerPC and 64-bit POWER3™ and RS64 CPUs use segment registers and Block Address Translations (BATs) or page table for address translation mechanisms. Each segment register maps 256MB of the effective address space to the corresponding virtual address, and the virtual addresses are mapped to real addresses using BATs or page table or TLBs. The virtual address to real address lookup happens in parallel using BATs or page table. In other words, if there is a match in BATs, the address is translated using BATs; otherwise page table is used.

The Cell BE processor -- on the other hand -- is a 64-bit processor similar to the PowerPC 970 and uses neither BATs nor segment registers. It uses Segment Lookaside Buffers (SLBs) and page tables for its address translation mechanism. There are 64 entries in the Cell BE SLB, and they form part of the process' context. SLB entries map the effective address of process address space to virtual address. Each SLB entry maps a 256MB effective address region, and so 16GB of address space of a process can be mapped at once. If a process address space is greater than 16GB, and if a particular effective address range of the currently running process is not mapped in the SLB, then DSFI/ISFI exceptions are generated. The OS should resolve this by filling in the correct entry in the SLB and replacing the suitable entry. In this way address space > 16GB can be effectively mapped using SLBs.

Cell BE MMIO interface

The SPE's control registers, including the local store memory and DMA enqueue registers of each of the SPEs, are memory mapped in the system address space. The PPE core manages the SPEs and initiates DMA data transfers by programming these control registers. The PPE core has to use MMIO (cache inhibited) loads/stores to these DMA registers to initiate DMAs. The entire local store memory is also accessible through the system memory map. The PPE can load SPE program and data through this MMIO interface as well. However there will be a performance overhead in copying data using the PPE MMIO instead of using the DMA engines.

SPE channel interface

The primary interface between the MFC DMA engine and the SPEs are the SPE channels. SPEs have special instructions that can read or write to these channels. Channels are configured as read-only or write-only (in other words, it is analogous to a read or write pipe). Each channel has a count associated with it that determines the number of outstanding requests on it. In addition, the channels are configured to be blocking or non-blocking. Blocking a channel makes the SPE stall when an SPE tries to read from a channel whose count is 0 (The receiving device or channel is full). As an aside, a non-blocking channel with a count of 0 will read the most recent contents again, as in the case of an MMIO register: the channel count is not updated, maintained, or used for a non-blocking channel. The SPE channels are of various types providing distinct functionalities like signal notification among SPEs, synchronization between the PowerPC core and SPEs using mailbox channels, status notification channels, DMA channels, and so on.

SPE uses the DMA channels to initiate DMA transactions. There are individual channels for each of the DMA parameters like source and target addresses, size, and the direction of data transfer (command code). The SPE can initiate up to 16 DMA transfers in parallel using the DMA queue. The DMAs are processed out-of-order by the DMA engine, and the completion status of individual DMAs based on tag can be queried. Cell BE applications are expected to utilize these DMA queues and pipeline the data flow, which enables extremely high throughput.

Figure 2. PPC/ SPE/ MFC Interface
PPC/ SPE/ MFC Interface

MFC DMA commands

Channel numbers

The channel numbers from the SPE side are defined in the Cell Broadband Engine Architecture V1.0 specification. The MMIO register as defined in the system memory map provides an interface to initialize the channel count and value as part of the SPE's context management. The memory map is defined in the Cell Broadband Engine Architecture as an offset to BP_Base: memory offsets are hard-wired, but BP_Base is configurable at Cell BE processor pre-bootup / configuration time using the service processor.

Part 2 of this series covers complete SPU DMAs and MFC synchronization primitives -- and DMA channels. Until then, you can find a listing of the first 27 channels in section 2.12 of the Cell SPU C/C++ Language Extensions; Section 7 of the Cell Broadband Engine Architecture V1.0 specification discusses MFC commands (offsets are also discussed in CBEA V1.0, in Appendix A). See Resources for links to both of these.

MFC DMA commands in Cell BE are a subset of MFC commands and always operate from the perspective of the SPEs. A data transfer from the external memory to the SPE local store is called a DMA GET command, and a data transfer from the SPE local store to the external memory is called a DMA PUT command. The Cell BE processor supports many DMA commands, and all of them are variants of GET or PUT. Note that MFC synchronization commands like mfcsync and mfceieio are different from GET/PUT commands. They can be used between multiple GET and PUT DMA commands to enforce ordering of DMA transactions relative to each other (but not with respect to SPE loads/stores). (SPE sync and dsync instruction enforce ordering of load/stores to local store; DMA operations are independent of SPE load/stores).

Some of the most commonly used DMA commands are listed below:

  • GET moves data from external memory to the SPE local store.
  • GETL moves data from external memory to the SPE local store using scatter-gather list.
  • GETS moves data from external memory to SPE local store and starts the SPE once DMA completes. This can be done only from the PPE core side.
  • PUT moves data from SPE local store to external memory.
  • PUTL moves data from SPE local store to external memory using scatter-gather list.
  • PUTS is similar to GETS.

DMA parameters

To initiate the DMA transfer either from the PPE side or from the SPE side, the application has to follow a well-defined enqueue logic, and program the MMIO DMA setup registers or SPE channels accordingly. The enqueue logic involves setting up the DMA parameters associated with the DMA command and programming them in a particular sequence. The parameters for DMA enqueue are the same for both the PPE and the SPE side, but the enqueue sequence is different. The different DMA parameters are as follows:

  • Command opcode that determines the direction of data flow.
  • Class ID determines the resources ID associated with the SPE.
  • Tag identifies the DMA or a group of DMAs. Any number of DMAs can be tagged with the same group. Tag is required for querying completion status of the group.
  • Size of the DMA. This size can be 1,2,4,8,16, or a multiple of 16 bytes up to 16KB.
  • LSA or, Local Store Address of the SPE. This can be the source or target depending on the DMA command (direction of data flow).
  • EAL or, Effective Address (Lower bits) of the external memory. This can be source or target. The effective address is translated to real address by the MFC's MMU unit. If the DMA is done in real mode, the effective address is not translated and is floated on the bus as is.
  • EAH Effective Address (Higher bits) of the external memory.

System memory map

The Cell BE processor's system memory map consists of six distinct regions, namely SPE local store area, SPE privileged area, PPE core area, IIC (Interrupt controller area), PMD (Power Management and Debug) area, and Implementation Dependant Expansion Area (IDEA).

Furthermore, for each SPE, the memory mapped regions are divided into different groups such as problem state, privilege 2 area, and privilege 1 area, for appropriate mapping by the OS. In general, "problem state" refers to user mode or application environment, whereas "privilege state" refers to supervisory/OS (kernel) mode. The OS will normally map the problem state registers (which includes DMA registers) in the application space. The privilege 2 and privilege 1 areas are owned by the OS and are used to resolve DMA page faults. Thus, applications running on the PPE can directly program the DMA registers for the SPEs they own without any need for OS intervention or device driver. The OS owns the DMA page table and hence ensures protection from wrong DMA commands enqueued by the application.

The system memory map in Figure 3 shows an interleaving of privilege 2 areas with local store and problem state, while privilege 1 is outside of this interleaved areas. All the problem and privilege areas are physically located at different page boundaries, which enables the OS to maintain different page level translations for user and kernel. No address masking or address manipulation is required to prevent privilege violations.

The problem state registers include DMA, SPE run-control/status, and mailbox registers. This completely enables application-level control of the SPE tasks. The local store memory can also be mapped to application space by the OS so that SPE context can be completely managed at the application (user) level. The application can transfer code and data to the SPE using the local store memory mapping or through DMA operation. The run-control register starts and stops the SPEs. The SPE stop event can be polled from the application using SPE status register, or it can be configured to generate an external interrupt to the OS.

Figure 3. System memory map
System memory map

MFC registers in SPE privilege area 1

  • MFC_SR1 State register controls the translation mechanism enable/disable mode for DMA.
  • MFC_SDR register contains the page table address and size information for the MMU.
  • MFC Interrupt Mask registers control the interrupts generated by SPEs, for example: DMA completion interrupt, MFC translation related exceptions, SPE stop and signal interrupts.
  • MFC Interrupt Status registers provide the status of interrupts generated by SPEs.
  • MFC DAR, DSISR registers (and other MMU control registers) for TLB reloading, invalidation, and other MMU exceptions management registers.

MFC registers in SPE privilege area 2

  • MFC MMU control registers for SLB management.
  • SPE channel control registers for SPE channel initialization and configuration.
  • MFC control register for DMA context management, purging DMA queues.
  • SPE debug enable registers for single-stepping of SPE.

MFC registers in SPE problem area

  • MFC DMA setup and status registers The application running in the PowerPC core can program these registers in non-privileged mode to initiate DMA transaction.
  • PPC and SPE mailbox registers for synchronization.
  • SPE run control and status register is used to start and stop the SPEs.
  • SPE signal notification registers for synchronization among SPEs.
  • SPE next program counter register which gives the current instruction address of the SPE.

PPE side DMA interface

The previous sections showed the importance of MFC with respect to Cell BE DMAs and gave an insight into the overall system memory map of Cell Broadband Engine Architecture. The following sections now turn to a discussion of how an application running in the PPE uses the memory map and the MFC interface to enqueue the DMA and complete a DMA transaction.

PPE DMA enqueue logic

The PPE-side enqueue logic follows a sequence of MMIO writes to memory mapped MFC DMA setup registers and querying for the DMA completion status register. The architecture requires the user to follow the well-defined sequence to successfully enqueue a DMA.

The SPE problem state MMIO area for SPE 0 will be BP_Base + 0x0004_0000 (refer memory map). The OS is expected to create a cache-inhibited translation from the application effective address space to this real address. Using the memory mapped effective address, the application can enqueue DMAs as outlined in the sequence below:

  1. Write the SPE local store address (source/target of DMA) to the MFC local store address offset. Offset 0x3004 in the SPE problem state area.
  2. Write the effective address (source/target of DMA) to the MFC effective address offset. Offset 0x3008 in the SPE problem state area.
  3. Write the DMA size and tag to the MFC DMA tag/size offset. Offset 0x3010 in the SPE problem state area.
  4. Write the DMA class ID and command to the MFC DMA class/command offset. Offset 0x3014 in the SPE problem state area.
  5. Read the DMA command status offset to complete enqueue. Offset 0x3014 in the SPE problem state area and is the same offset as the command/class ID.
  6. DMA initiation starts as soon as Step 5 is complete.
Figure 4. DMA data flow
DMA data flow

DMA completion

The application can use both polling as well as interrupt methodology to determine DMA completion. For a DMA completion to generate interrupt, the DMA group completion interrupt bit needs to be set by the OS in the MFC interrupt mask register.

Interrupts from the SPE are grouped into 3 classes, namely, error, translation, and application events. Class 0 corresponds to SPE errors, DMA errors, and DMA alignment errors. Class 1 corresponds to DMA translation exceptions like MFC page faults and segment faults. Class 2 corresponds to application-level SPE control events like SPE stop and signal, DMA completion interrupt, and mailbox interrupts. Each of these interrupt classes have their associated mask and status registers.

After the DMA is complete, the MFC DMA engine generates a Class 2 DMA completion interrupt as an external interrupt to the PPE core. The OS is expected to handle the interrupt by reading the Class 2 MFC interrupt status register and acknowledging the interrupt. The OS can notify the application through signals or other communication means.

Another mechanism for detecting DMA completion is by the application polling for completion of DMA based on tags. To enable the polling mechanism, the application needs to program the DMA tag group mask register with the DMA tags of interest. The DMA tag group mask register is part of the SPE problem state area. The application then enqueues the DMA with the specific tag and polls the DMA tag group status register for that DMA tag to complete. The DMA tag group status register returns a non-zero value for that tag if that particular DMA is still in progress. More than one DMA tag can be polled for completion at the same time. Given below is an example code for enqueueing a PPE-side DMA between the PPE core and SPE 0. The code uses BP_BASE as the start of the memory map region.

Listing 1. ppe_dma.c
#define SPU_PRIV1(n)                  BP_BASE + 0x400000+(n*0x20000)
// SPE 0 Privelege 1 area

#define SPU_LS_AREA(n)                BP_BASE + 0x40000*(n + 1)
// SPE 0 Local store

#define SPU_PROB_AREA(n)         SPU_LS_AREA(n)
// SPE 0 Problem area

#define DMA_LSA                  0x3004
// DMA LSA Offset

#define DMA_EA_HI                0x3008
// DMA Effective addr Hi offset

#define DMA_EA_LO                0x300C
// DMA Effective addr Lo offset

#define DMA_Size_Tag                   0x3010
// DMA size / tag

#define DMA_Class_CMD            0x3014
// DMA class / command

#define DMA_TagQ                       0x321C
// DMA Query mask

#define DMA_Status                     0x322C
// DMA Query status

// Function to write 64 bits to a memory location

write_64(unsigned long addr,unsigned int value)


 asm("std 4,0(3)");


// Function to write 32 bits to a memory location

write_32(unsigned long addr,unsigned int value)


 *(unsigned long *)addr = (unsigned long)value;


// Function to read 32 bits from a memory location

read_32(unsigned long addr)


 unsigned long value;

 value = *(unsigned long *)addr;


// This routine checks for DMA completion by polling

// for tag bits to be set

// Tag used in this example is 0. So check for bit 0.

dma_chk(unsigned long addr)


 unsigned int value = 0;



   value =*(unsigned int*)addr;

 } while(!(value & 0x1));


// This routine enqueues a PPE Side DMA to

// SPU 0 Using DMA MMIO registers

// DMA command used is GET and it performs

// 4 byte DMA from memory location 0x100

// to local store 0x0 of SPU 0



 write_64(SPU_PRIV1(0), 0x1);
// Enable access to SPU 0 Local Store

 write_32((SPU_PROB_AREA(0)+DMA_LSA), 0x0);
// Write the Local Store Address

 write_32((SPU_PROB_AREA(0)+DMA_EA_HI), 0x0);
// Write the Effective address High part

 write_32((SPU_PROB_AREA(0)+DMA_EA_LO), 0x100);
// Write the Effective address Low part

 write_32((SPU_PROB_AREA(0)+DMA_Size_Tag), 0x40000);
// DMA size is 4. Tag is 0

 write_32((SPU_PROB_AREA(0)+DMA_Class_CMD), 0x42);
// Command is GET. 0x42

// Read Command register for DMA enqueue completion

 write_32((SPU_PROB_AREA(0)+DMA_TagQ), 0x1);
// Write DMA Query Mask register with tag value

// Tag value is 1 (i.e bit 0)

// Check for DMA Completion


See also the Cell Broadband Engine Architecture V1.0 specification see Resources).

DMA modes

As mentioned earlier, DMAs can happen in translation-enabled mode as well as in real mode. The mode of DMA is controlled by the MFC translation bit present in the MFC SR1 register. This register is part of the SPE privilege 1 memory map. DMA translation is always with respect to the effective address enqueued as part of the DMA. The local store address is not translated, as it is an absolute address that falls within the local store address region (256KB) of that particular SPE. If the MFC translation bit is zero, then DMA happens in real mode (in other words, the effective address enqueued as part of DMA is treated as a real address).

If the MFC translation bit is turned on, then DMA happens in virtual mode, and the effective address needs to be translated to a corresponding valid real address. The application needs to choose a valid effective address owned by it. The OS should create the correct translation in the MFC's SLB and page table.

The MFC provides the following memory mapped registers as part of its MMU unit to manage translation and exceptions. All these registers are in the privilege 1 area:

  • MFC SDR is part of SPE privilege memory map and is the MFC page table pointer.
  • MFC DAR contains the faulting address in case of DSI exception.
  • MFC DSISR contains the reason code in case of an MFC translation exception, for example: DSI (page fault), DSFI (segment fault).
  • MFC TLB registers are used for TLB reloading, invalidation.
  • MFC SLB registers are used for SLB reloading, invalidation.

The MFC MMU unit can generate page fault and segment fault exceptions that are presented as external interrupts to the PPE. The exceptions for page faults and segment faults are enabled by setting appropriate bits in the MFC class 1 interrupt mask register. Upon receiving the interrupt, the OS should handle it by creating the appropriate entry for translation in SLB or in the page table or TLB. The DMA operation is resumed after the fault is resolved. The application also needs to reset the interrupt state thereby re-enabling further exceptions of the same type. This is done by writing (resetting) the status bits for these exceptions in the MFC class 1 interrupt status register.

Summary and conclusion

The Cell BE processor has a very flexible, versatile, and powerful DMA engine to manage data flow for streaming multimedia applications. In a typical Cell BE-based application, streaming audio or video data is moved from one SPE local store to another with each of the SPEs performing a distinctive operation such as, decryption, decompression, and so on. Performance numbers close to the theoretical limits can be achieved by making the SPEs and DMA engines work in parallel, feeding data to the SIMD execution units from the external IO devices and system memory. The role of the OS running on the PPE can be reduced to handling external interrupts caused by DMA page faults and DMA completion interrupts, while the intelligence of the data flow and programming the DMA engine lies with the applications running on the PPE and SPEs. The architecture does not require any abstraction layer or device driver in the OS to manage the DMA operations.


Cell Broadband Engine is a trademark of Sony Computer Entertainment Inc.



Get products and technologies



developerWorks: Sign in

Required fields are indicated with an asterisk (*).

Need an IBM ID?
Forgot your IBM ID?

Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.


All information submitted is secure.

Dig deeper into developerWorks

Zone=Multicore acceleration
ArticleTitle=Cell Broadband Engine processor DMA engines, Part 1: The little engines that move data