This article is adapted from the paper The Cell processor programming model presented at LinuxTag 2005; see the Resources section for more details.
The Cell processor from Sony, Toshiba, and IBM® is this year's most awaited newcomer on the CPU market. It promises unprecedented performance in the consumer and workstation market by employing a radically new architecture. Built around a 64-bit PowerPC® core, multiple independent vector processors called Synergistic Processing Units (SPUs) are combined on a single microprocessor.
Unlike existing SMP systems or multicore implementations of other processors, on the Cell, only the general purpose PowerPC core is able to run a generic operating system, while the SPUs are specialized to run computational tasks. Porting Linux™ to run on Cell's PowerPC core is a relatively easy task because of the similarities to existing platforms like IBM pSeries® or Apple Power Macintosh, but this does not give access to the enormous computing power of the SPUs.
Only the kernel can directly communicate with an SPU and therefore needs to abstract the hardware interface into system calls or device drivers. The most important functions of the user interface include loading a program binary into an SPU, transferring memory between an SPU program and a Linux userspace application, and synchronizing the execution. Other challenges are the integration of SPU program execution into existing tools like GDB or OProfile.
A joint team of Sony, IBM, and Toshiba employees based in Austin, Texas, did the groundwork for the Linux kernel port. The current set of kernel patches is based on the latest 2.6.xx snapshot kernel and is maintained by the IBM LTC (Linux Technology Center) team in Böblingen, Germany. The team hopes to integrate most of this into the 2.6.13 kernel release so it will become part of upcoming distribution releases.
The Cell processor has a PowerPC Processing Element (PPE) that follows the 64-bit PowerPC AS architecture, as the PowerPC 970 CPU (also known as the G5) and all recent IBM POWER™ processors also use. Like the 970, it can use the VMX (AltiVec) vector instructions to parallelize arithmetic operations.
Moreover, the Cell processor can use simultaneous multithreading (SMT) like the IBM POWER5™ processor or Intel®'s Pentium 4 processors with Hyperthreading.
The IBM LTC has a standard Linux distribution running on the PPE and needs only a small number of kernel patches to add support for some of the hardware features that differ from existing target platforms. In particular, the Cell processor includes an interrupt controller and an IOMMU implementation, both of which are incompatible with those supported by older kernel versions.
The hardware we are running on at the LTC is a prototype of the Cell processor-based Blade, with two Cell processors running as a symmetric multiprocessing (SMP) system and, currently, 512MB of memory. It is designed to be used in an IBM BladeCenter™ chassis.
The integration of support for the PPE in one of the next kernel releases will enable the use of a single kernel binary for all current 64-bit PowerPC machines including Cell, Apple Power Mac, and IBM pSeries.
While no plans are in place to support 32-bit Linux kernels on Cell, it is possible to run both 32- and 64-bit distributions on it using the PowerPC 64 kernel with support for the ELF32 binary format. Note that all 32-bit PowerPC applications are expected to work without modifications.
Synergistic Processing Elements
The Synergistic Processing Elements (SPEs) are the most interesting feature of the Cell processor, as they are the source of its overwhelming processing power. A single chip contains eight SPEs, each with an SPU, a Memory Flow Controller (MFC), and 256KB of SRAM that are used as local store memory.
An SPU uses vector operations itself and can thereby execute up to eight floating point instructions per clock cycle.
The Cell processor has three high-speed bus interfaces, one for memory and two for I/O or SMP connections. The memory interface connects XDRAM chips, which currently is the fastest available memory technology, substantially faster than current DDR or DDR2 interfaces.
Like the memory interface, the other two interfaces are also based on Rambus technology. One of them is used exclusively to connect I/O devices, typically a south bridge or north bridge chip for the FlexIO protocol. The other one can also be used for I/O, or alternatively as a coherent interface to connect multiple Cell processors to an SMP system.
An SPU resembles a cross between a simple CPU design and a digital signal processor. It can use the same instructions to do either 32-bit scalar or 128-bit vector processing. It has an 18-bit address space that accesses 256KB of local store that are part of the chip itself. Neither a memory management unit nor an instruction or data cache are used. Instead, the SPU can access any 128-bit word in the local store at L1 cache speed.
The MFC is the main communication vehicle between the local store memory and the system memory. As mentioned before, there is one MFC in each SPE. It has an integrated memory management unit that is normally used to provide access to the address space of one process by using the same page table lookup as the PPE.
A DMA request always involves moving data between the SPE local store and a virtual address space on the PPE side. The types of DMA requests include aligned read and write operations as well as single word atomic updates that can be used -- for example -- to implement spin-locks that are shared between SPEs and user processes.
Both the SPE and the PPE can initiate DMA transfers. The PPE does this through memory-mapped register access from kernel mode, while the SPE writes to its DMA channels from code running on the SPU.
An MFC can have multiple concurrent DMA requests to one address space outstanding from both the PPE and the SPU. Each MFC can access a separate address space.
Programs running inside the SPU need to be rather simplistic and self-contained, so you don't need complicated access protection or different privilege modes in the SPU itself. As a consequence, the instruction set contains mostly arithmetic and branch operations but none that resemble kernel mode instructions of the PPE.
Also, exceptions resulting from executed code aren't reported to the SPU itself. If a serious error occurs, for example, an invalid opcode, the SPU is stopped and an interrupt is delivered to the PPE. Some of the common sources of exceptions are not even possible on the SPU. For example, there are no addressing exceptions since all pointers get aligned and truncated to the local store size when attempting a memory access.
The arithmetic vector operations are similar to the VMX operations of the PPE, and you can use them for highly optimized video, image processing, or scientific applications, among others.
The main communication method of the SPU with other parts of the Cell processor is defined by a number of "channels." Each channel has a predefined function and is either a read channel or a write channel.
For example, a mailbox mechanism is a basic communication method between the SPE and the PPE. The SPU has a read channel for receiving a single data word from the mailbox and two write channels for sending data words (more on this below). One of those write channels is defined to generate external interrupts on the CPU when data is available, and the other does not have a notification mechanism.
When an SPU tries to read from an empty mailbox, it will stop execution until some value is written to its memory-mapped register.
When the PPE wants to access the mailbox, it needs to have access to the memory-mapped register space, which is normally only available to kernel space. It has three mailbox registers for each SPU, and each of those accesses one of the three SPU mailbox channels.
The memory-mapped registers are used by the PPE to control certain aspects of an SPE, but are not accessible by the SPU code itself. For example, one PPE-side mailbox register appears as a write-only physical memory location. When the PPE writes a data word to that address, the SPU can read from its corresponding mailbox read channels.
Other channels are used to access virtual memory associated with a user context on the PPE. By writing to DMA channels, the SPE can initiate a memory transfer, which is executed in parallel to both the SPU code execution and the PPE control flow. Only when a page fault is hit, for example, because the accessed page has been swapped out to disk, does the PPE receive an interrupt.
Some kernel code is needed to use the SPUs from a Linux application, since the controlling registers are only accessible from the PPE in privileged mode. The simplest way to give userspace applications access to hardware resources is through a character device driver controlled through read, write, and ioctl system calls.
This is suitable for many simple devices and at some point was used for testing the capabilities of the processor, but the approach has a number of problems. Most importantly, if each SPU is represented by a single character device, it becomes hard for an application to find an SPU that is not yet used by another. Moreover, that interface does not allow virtualization of the SPUs on a multiuser system in a sane way.
A different approach to using SPUs is to define a set of system calls. This makes it possible to replace physical SPUs as the underlying unit of the abstraction from processes running on the SPU. SPU processes can be scheduled by the kernel, and all users can create them without interfering with each other. On the downside, this also means duplicating some infrastructure of the kernel as well as adding a potentially large number of new system calls to provide all necessary functionality.
For example, when a new thread ID space is managed next to the existing Linux
process IDs, substantial changes to all system calls dealing
with PIDs (kill, getpriority, ptrace, and so on),
or alternatively new versions of those system calls, need to be provided.
Neither alternative is desirable from a cross-platform point of view.
The solution the LTC team finally chose is to create a virtual file system to
externalize the SPUs. A number of similar file systems are already present, for example, procfs, sysfs, or mqueue. Unlike device-backed file systems,
these do not need a partition to store data, but instead keep all their
resources in RAM while using regular system calls like open, read, or getdents to communicate between userspace and the
kernel functionality.
We chose to name the file system "spufs," and have it mounted on /spu by convention, although other mount points are possible.
Every directory in spufs refers to a logical SPU context. This SPU context is treated like a physical SPU, and the current implementation enforces a direct mapping between them. In the future, we are planning to change this so that more logical than physical SPU contexts can be present and have the kernel switch between them.
When the file system gets mounted, it is initially empty, and the only
valid operation in its root directory is to create new directories with
the mkdir system call.
Each context directory contains a fixed set of files that get created automatically when the context is established. The most important of these are:
- mem
represents the local store memory of the SPE. Processes can open this file and do regular I/O system calls like read and write on it. In particular, it is possible to map the file into the process address space itself. By mapping multiple SPUs into a single process, it becomes possible to use DMA to transfer data between the local store of two SPUs directly. - run
starts execution of SPU code. When a process performs the SPU_RUN ioctl on this file, the process itself will be suspended and the SPU starts executing at the instruction pointed to by the ioctl argument. When the SPU code finishes or hits a critical error condition, it stops executing and the host process returns from the ioctl call. - mbox
ibox
wbox
are abstractions for the userspace side of the mailbox. mbox and ibox are used to read data written to the SPU mailbox write channels and have slightly different semantics. wbox is used to write into the SPU mailbox read channel.
To use an SPU from a process, the user needs to have write
permission to the mount point of the spufs and choose an unused name for
the new SPU context. The mkdir system call creates the context, and the user process can subsequently open the
associated files inside that directory.
Listing 1. Example of the contents of an SPU context
$ mkdir /spu/myspu-12345 $ ls -lR /spu/ spu/: total 0 drwxr-xr-x 2 arnd arnd 4096 Jun 17 21:00 my-spu-12345 spu/my-spu-12345: total 0 -r--r----- 1 arnd arnd 0 Jun 17 21:01 ibox -r--r----- 1 arnd arnd 0 Jun 17 21:01 mbox -rw-rw---- 1 arnd arnd 262144 Jun 17 21:01 mem -rw-rw---- 1 arnd arnd 2048 Jun 17 21:01 regs -rw-rw---- 1 arnd arnd 0 Jun 17 21:01 run --w--w---- 1 arnd arnd 0 Jun 17 21:01 wbox $ |
The program text and data segments now need to get written into the mem
file, either using the write system call, or by
mapping the file into the process address space. Normally, no relocations
need to be applied, since SPU programs are statically linked.
Since every thread that executes the SPU_RUN ioctl blocks for the duration of the ioctl system call, it can not interact with any other system resources at the same time, including other SPU contexts or files belonging to the context it is executing in. A single process can work with multiple SPU contexts, but to run on more than one SPU at a given time, the process needs to contain at least one thread for each running SPU context.
Likewise, if the program communicates with the SPU code using mailbox
access, it needs to create a new thread, for example, by calling fork or pthread_create.
One of the threads then calls the SPU_RUN ioctl on the run file, while the
other runs in an event loop on the mailbox files and potentially other
file descriptors.
A program that does not need to communicate with the SPU code while that is running can simply have a single thread of execution that either progresses on the userspace side or on the SPU.
When a program is done using the SPU context, it must close
all file descriptors that are open to files inside that context directory,
and then it can remove the directory itself using the rmdir system call.
mkdir creates a fully populated directory,
while rmdir removes it with all contained
files.
A signal might need to be delivered to a thread that is executing SPU code. Often this is called by the SPU code itself. When this happens, the SPU is stopped and the ioctl call is interrupted.
If this causes a signal handler to be called in userspace, a new stack frame is created in that thread and the ioctl arguments are updated to reflect the current instruction pointer before the signal handler executed. Normally, the signal handler will return and cause the ioctl to be entered at exactly the point where it was left so the SPU program can continue.
A portable library interface has been established on top of that low-level programming model. The library interface does not rely on implementation details of the file system, but can also be used on other operating systems that might have a different kernel interface.
Instead of providing an abstraction of logical SPUs, this interface is thread-oriented and behaves in a similar way to the pthread library. When an SPU thread gets created, the library will create a new thread that manages the SPU context asynchronous to the main thread.
When the SPU needs to do any standard library calls like printf or exit, it has to
call back to the main thread. It does so by executing the special
stop-and-signal assembly instruction with a standardized argument value.
That value is returned from the ioctl call and the user thread must react
to that. This usually means copying the arguments over from the SPE local
store, running the respective library function in the user thread, and
continuing by calling the ioctl again.
Direct systems calls from an SPU
We are thinking about adding a direct system call mechanism to the spufs, where the stop-and-signal instruction does not trap into userspace but instead causes the kernel to read the system call arguments from local store and enter the system call directly.
Because this happens inside the ioctl system call of a user process, any pointer arguments to the SPU system call are assumed to point into the address space of that process, and the SPU program needs to use DMA to access them.
Because the Cell's PPE uses the same instruction set as the PowerPC 970 CPU, it does not require changes to the compiler or the binutils. However, compiled code runs more efficiently on it when using a compiler that schedules instructions specifically for the CPU's pipeline structure. You can get a patch to GCC to add a pipeline definition for the PPE so you can create optimized code.
Since the SPU instruction set is not directly related to an existing CPU architecture, a new back end was written for both GCC and binutils. SPU code is compiled separately from the PPC code and gets loaded at run time.
Both the changes to GCC for PPE optimizations and the SPU back end are expected to be released as part of a Linux distribution.
The compiler introduces new intrinsics to do DMA transfers or other mailbox accesses, because these are not part of the C language standard. Also, new intrinsics and a new data type use the vector instructions for parallel calculation. This is similar to what is done for VMX/AltiVec or SSE vector instructions.
Future GCC versions should also be able to use autovectorization and create vector code automatically, but we do not have that yet. It is also likely that explicit vector instructions are normally more efficient than relying on the compiler to generate them.
Debugging SPU programs creates a new set of problems. While it would be relatively easy to create a new GDB target for the SPU itself, most users need to debug interaction between PPE and SPE. The approach we are taking here is to enable GDB to support multiple targets in one binary, and change the PowerPC target to know about spufs. When the PowerPC debugger finds a program running SPU_RUN ioctl call, it switches to the SPU back-end code and works on the SPU context instead of the main program context.
While there currently is no working profiler for spufs programs, we have plans to extend OProfile to cover mixed PPE/SPE programs. This requires changes to the OProfile kernel code to sample the SPU instruction pointers regularly.
On the userspace side of OProfile, it needs an extra indirection level to come from the spufs file to the actual ELF file that was loaded into that.
With the current state of Linux on Cell, you can write special-purpose applications running on the prototype board while using the full performance of the chip. While most applications do not immediately run better on Cell, there is a lot of potential to port performance-critical applications to use library code running on an SPU for better performance.
The base platform support is currently making its way into the mainline Linux kernel, and the SPU file system interface is on its way to being stabilized enough to be included in upcoming releases of the kernel and of major distributions.
-
This article is adapted from The
Cell Processor Programming Model, a paper presented at LinuxTag 2005
by Arnd Bergmann of IBM.
- A joint team of Sony, IBM and Toshiba employees based in Austin, Texas did the groundwork for the Linux kernel port . The
partnership, known also as STI, has been working on Cell since 2001.
- The LTC team is currently working at IBM Deutschland Entwicklung
GmbH, also known as the
IBM Research Böblingen Lab.
-
Preliminary support for Cell is being contributed to GNU Compiler Collection (GCC) and GNU Project Debugger
(GDB).
- The kernel patches can be found in the ozlabs archive (see the Linux PowerPC64 Development
Patches) and are tracked with the help of the Patchwork patch tracking
system (Learn more about Patchwork.).
-
As well, you can use OProfile
with the base Linux code for Cell.
-
Sony recently published a paper on Programming
Cell, an outstanding resource that presents some alternative models for
programming Cell.
In particular, slide 24, Programming models,
offers a
good overview that is explored in subsequent slides.
-
The IBM Microelectronics Technical Library has a
new
section devoted to Cell. See also the excellent content up at
IBM Research's Cell pages.
- Have experience you'd be willing to share with Power Architecture zone
readers? Article submissions on all aspects of Power Architecture technology from authors inside and outside
IBM are welcomed. Check out the Power Architecture author
FAQ to learn more.
- Have a question or comment on this story, or
on Power Architecture technology in general?
Post it in the Power Architecture technical forum
or send in a letter to the editors.
- The Power Architecture Community Newsletter includes full-length articles as well as recent news about members of the Power Architecture community and upcoming events of interest.
Learn more
about the Power Architecture Community Newsletter and how to contribute to it. Subscription is free.
- All things Power-related are chronicled in the developerWorks Power
Architecture editors' blog, which is just one of many developerWorks
blogs.
- Find more articles and resources on Power Architecture
technology and all things
related in the developerWorks Power
Architecture technology zone.
- Download a IBM PowerPC 405 Evaluation Kit to demo a SoC in a simulated
environment, or just to explore the fully licensed version of
Power Architecture technology. This and other fine Power Architecture-related downloads are listed in
the developerWorks Power Architecture technology zone's downloads section.
Arnd Bergmann has been hacking the Linux kernel and a number of other open source packages since 1998. In his current project for The IBM Linux Technology Center, he is responsible for the kernel in the first Cell processor-based workstation computer, which premieres at LinuxTag 2005. You can reach him at arndb@de.ibm.com.




