Level: Intermediate Arnd Bergmann (arnd@arndb.de), Kernel Hacker, Linux on Cell kernel maintainer, IBM Deutschland Entwicklung GmbH
25 Jun 2005 Base platform support for Linux on the Cell has been established and is currently on its way into the mainstream Linux kernel tree. Read about the Cell's unique architecture and the SPU file system interface that allows Linux to run on it.
This article is adapted from the paper The Cell processor programming
model presented at LinuxTag 2005; see the Resources section for more details.
The Cell processor from Sony, Toshiba, and IBM® is this year's most awaited
newcomer on the CPU market. It promises unprecedented performance in the
consumer and workstation market by employing a radically new architecture.
Built around a 64-bit PowerPC® core, multiple independent vector processors
called Synergistic Processing Units (SPUs) are combined on a single
microprocessor.
Unlike existing SMP systems or multicore implementations of other processors, on the Cell, only the general purpose PowerPC core is able to run a generic operating system, while the SPUs
are specialized to run computational tasks. Porting Linux™ to run
on Cell's PowerPC core is a relatively easy task because of the
similarities to existing platforms like IBM pSeries® or Apple Power
Macintosh, but this does not give access to the enormous computing power
of the SPUs.
Only the kernel can directly communicate with an SPU and therefore
needs to abstract the hardware interface into system calls or device
drivers. The most important functions of the user interface include
loading a program binary into an SPU, transferring memory between an SPU
program and a Linux userspace application, and synchronizing the execution.
Other challenges are the integration of SPU program execution into
existing tools like GDB or OProfile.
A joint team of
Sony, IBM, and Toshiba employees based in Austin, Texas, did the groundwork for the Linux kernel port. The current set of
kernel patches is based on the latest 2.6.xx snapshot kernel and is
maintained by the IBM LTC (Linux Technology Center) team in Böblingen,
Germany. The team hopes to integrate most of this into the 2.6.13 kernel release
so it will become part of upcoming distribution releases.
The Cell processor
PowerPC Processing Element
The Cell processor has a PowerPC Processing Element (PPE) that follows
the 64-bit PowerPC AS architecture, as the PowerPC 970 CPU
(also known as the G5) and all recent IBM POWER™ processors also use. Like the 970, it can use
the VMX (AltiVec) vector instructions to parallelize arithmetic
operations.
Moreover, the Cell processor can use simultaneous multithreading
(SMT) like the IBM POWER5™ processor or Intel®'s Pentium 4 processors with
Hyperthreading.
The IBM LTC has a standard Linux distribution
running on the PPE and needs only a small number of kernel patches to add
support for some of the hardware features that differ from existing target
platforms. In particular, the Cell processor includes an interrupt
controller and an IOMMU implementation, both of which are incompatible
with those supported by older kernel versions.
The hardware we are running on at the LTC is a prototype of the Cell processor-based Blade,
with two Cell processors running as a symmetric multiprocessing (SMP)
system and, currently, 512MB of memory. It is designed to be used in an
IBM BladeCenter™ chassis.
The integration of support for the PPE
in one of the next kernel releases will enable the use of a single
kernel binary for all current 64-bit PowerPC machines including Cell,
Apple Power Mac, and IBM pSeries.
While no plans are in place to support 32-bit Linux kernels on Cell, it is
possible to run both 32- and 64-bit distributions on it using the PowerPC 64
kernel with support for the ELF32 binary format. Note that all 32-bit
PowerPC applications are expected to work without modifications.
Synergistic Processing Elements
The Synergistic Processing Elements (SPEs) are the most interesting
feature of the Cell processor, as they are the source of its overwhelming
processing power. A single chip contains eight SPEs, each with an SPU, a Memory Flow Controller (MFC), and
256KB of SRAM that are used as local store memory.
An SPU uses vector operations itself and can thereby execute up to
eight floating point instructions per clock cycle.
Bus interfaces
The Cell processor has three high-speed bus interfaces, one for memory
and two for I/O or SMP connections. The memory interface connects XDRAM chips, which currently is the fastest available memory
technology, substantially faster than current DDR or DDR2 interfaces.
Like the memory interface, the other two interfaces are also based on
Rambus technology. One of them is used exclusively to connect I/O devices,
typically a south bridge or north bridge chip for the FlexIO protocol. The
other one can also be used for I/O, or alternatively as a coherent
interface to connect multiple Cell processors to an SMP system.
Basic SPU design
An SPU resembles a cross between a simple CPU design and a digital signal
processor. It can use the same instructions to do either 32-bit scalar or
128-bit vector processing. It has an 18-bit address space that accesses
256KB of local store that are part of the chip itself. Neither a memory
management unit nor an instruction or data cache are used. Instead, the
SPU can access any 128-bit word in the local store at L1 cache
speed.
Memory Flow Controller
The MFC is the main communication vehicle between the
local store memory and the system memory. As mentioned before, there is
one MFC in each SPE. It has an integrated memory management unit that is
normally used to provide access to the address space of one process by
using the same page table lookup as the PPE.
A DMA request always involves moving data between the SPE local store and
a virtual address space on the PPE side. The types of DMA requests include
aligned read and write operations as well as single word atomic updates
that can be used -- for example -- to implement spin-locks that are shared between SPEs
and user processes.
Both the SPE and the PPE can initiate DMA transfers. The PPE does this
through memory-mapped register access from kernel mode, while the SPE
writes to its DMA channels from code running on the SPU.
An MFC can have multiple concurrent DMA requests to one address space
outstanding from both the PPE and the SPU. Each MFC can access a separate
address space.
Instruction set
Programs running inside the SPU need to be rather simplistic and
self-contained, so you don't need complicated access protection or
different privilege modes in the SPU itself. As a consequence, the
instruction set contains mostly arithmetic and branch operations but none
that resemble kernel mode instructions of the PPE.
Also, exceptions resulting from executed code aren't reported to the SPU
itself. If a serious error occurs, for example, an invalid opcode, the SPU is
stopped and an interrupt is delivered to the PPE. Some of the common
sources of exceptions are not even possible on the SPU. For example, there are no
addressing exceptions since all pointers get aligned and truncated to the
local store size when attempting a memory access.
The arithmetic vector operations are similar to the VMX operations of the
PPE, and you can use them for highly optimized video, image processing, or scientific applications, among others.
The main communication method of the SPU with other parts of the Cell
processor is defined by a number of "channels." Each channel has a
predefined function and is either a read channel or a write channel.
For example, a mailbox mechanism is a basic communication
method between the SPE and the PPE. The SPU has a read channel for
receiving a single data word from the mailbox and two write channels for
sending data words (more on this below). One of those write
channels is defined to generate external interrupts on the CPU when data
is available, and the other does not have a notification mechanism.
When an SPU tries to read from an empty mailbox, it will stop execution
until some value is written to its memory-mapped register.
When the PPE wants to access the mailbox, it needs to have access to the
memory-mapped register space, which is normally only available to kernel
space. It has three mailbox registers for each SPU, and each of those accesses
one of the three SPU mailbox channels.
The memory-mapped registers are used by the PPE to control certain
aspects of an SPE, but are not accessible by the SPU code itself. For
example, one PPE-side mailbox register appears as a write-only physical
memory location. When the PPE writes a data word to that address, the SPU
can read from its corresponding mailbox read channels.
Other channels are used to access virtual memory associated with a user
context on the PPE. By writing to DMA channels, the SPE can initiate a
memory transfer, which is executed in parallel to both the SPU code
execution and the PPE control flow. Only when a page fault is hit, for example,
because the accessed page has been swapped out to disk, does the PPE
receive an interrupt.
Possible programming models
Character devices
Some kernel code is
needed to use the SPUs from a Linux application, since the controlling registers are only accessible from the PPE
in privileged mode. The simplest way to give userspace applications access
to hardware resources is through a character device driver controlled
through read, write, and ioctl system calls.
This is suitable for many simple devices and at some point was used for
testing the capabilities of the processor, but the approach has a number
of problems. Most importantly, if each SPU is represented by a single
character device, it becomes hard for an application to find an SPU that
is not yet used by another. Moreover, that interface does not allow
virtualization of the SPUs on a multiuser system in a sane way.
System calls
A different approach to using SPUs is to define a set of system calls.
This makes it possible to replace physical SPUs as the underlying unit of
the abstraction from processes running on the SPU. SPU processes can be
scheduled by the kernel, and all users can create them without interfering
with each other. On the downside, this also means duplicating some
infrastructure of the kernel as well as adding a potentially large number
of new system calls to provide all necessary functionality.
For example, when a new thread ID space is managed next to the existing Linux
process IDs, substantial changes to all system calls dealing
with PIDs (kill, getpriority, ptrace, and so on),
or alternatively new versions of those system calls, need to be provided.
Neither alternative is desirable from a cross-platform point of view.
The SPU file system
Virtual file systems
The solution the LTC team finally chose is to create a virtual file system to
externalize the SPUs. A number of similar file systems are already present, for example, procfs, sysfs, or mqueue. Unlike device-backed file systems,
these do not need a partition to store data, but instead keep all their
resources in RAM while using regular system calls like open, read, or getdents to communicate between userspace and the
kernel functionality.
We chose to name the file system "spufs," and have it mounted on /spu by
convention, although other mount points are possible.
Mapping of hardware resources
Every directory in spufs refers to a logical SPU context. This SPU
context is treated like a physical SPU, and the current implementation
enforces a direct mapping between them. In the future, we are planning to
change this so that more logical than physical SPU contexts can be present
and have the kernel switch between them.
When the file system gets mounted, it is initially empty, and the only
valid operation in its root directory is to create new directories with
the mkdir system call.
Each context directory contains a fixed set of files that get created
automatically when the context is established. The most important of these are:
-
mem
represents the local store memory of the SPE. Processes can open this
file and do regular I/O system calls like read and write on it. In
particular, it is possible to map the file into the process address space
itself. By mapping multiple SPUs into a single process, it becomes
possible to use DMA to transfer data between the local store of two SPUs
directly.
-
run
starts execution of SPU code. When a process performs the
SPU_RUN ioctl on this file, the process itself will be suspended and the
SPU starts executing at the instruction pointed to by the ioctl argument.
When the SPU code finishes or hits a critical error condition, it stops
executing and the host process returns from the ioctl call.
-
mbox
ibox
wbox
are abstractions for the userspace side of the mailbox. mbox and ibox are
used to read data written to the SPU mailbox write channels and have
slightly different semantics. wbox is used to write into the SPU mailbox
read channel.
Using SPU contexts
To use an SPU from a process, the user needs to have write
permission to the mount point of the spufs and choose an unused name for
the new SPU context. The mkdir system call creates the context, and the user process can subsequently open the
associated files inside that directory.
Listing 1. Example of the contents of an SPU context
$ mkdir /spu/myspu-12345
$ ls -lR /spu/
spu/:
total 0
drwxr-xr-x 2 arnd arnd 4096 Jun 17 21:00 my-spu-12345
spu/my-spu-12345:
total 0
-r--r----- 1 arnd arnd 0 Jun 17 21:01 ibox
-r--r----- 1 arnd arnd 0 Jun 17 21:01 mbox
-rw-rw---- 1 arnd arnd 262144 Jun 17 21:01 mem
-rw-rw---- 1 arnd arnd 2048 Jun 17 21:01 regs
-rw-rw---- 1 arnd arnd 0 Jun 17 21:01 run
--w--w---- 1 arnd arnd 0 Jun 17 21:01 wbox
$
|
The program text and data segments now need to get written into the mem
file, either using the write system call, or by
mapping the file into the process address space. Normally, no relocations
need to be applied, since SPU programs are statically linked.
Since every thread that executes the SPU_RUN ioctl blocks for the
duration of the ioctl system call, it can not interact with any other
system resources at the same time, including other SPU contexts or files
belonging to the context it is executing in. A single process can work
with multiple SPU contexts, but to run on more than one SPU at a
given time, the process needs to contain at least one thread for each
running SPU context.
Likewise, if the program communicates with the SPU code using mailbox
access, it needs to create a new thread, for example, by calling fork or pthread_create.
One of the threads then calls the SPU_RUN ioctl on the run file, while the
other runs in an event loop on the mailbox files and potentially other
file descriptors.
A program that does not need to communicate with the SPU code while that
is running can simply have a single thread of execution that either
progresses on the userspace side or on the SPU.
When a program is done using the SPU context, it must close
all file descriptors that are open to files inside that context directory,
and then it can remove the directory itself using the rmdir system call.
mkdir creates a fully populated directory,
while rmdir removes it with all contained
files.
Signal handling
A signal might need to be delivered to a thread that is executing SPU
code. Often this is called by the SPU code itself. When this happens, the
SPU is stopped and the ioctl call is interrupted.
If this causes a signal handler to be called in userspace, a new stack
frame is created in that thread and the ioctl arguments are updated to
reflect the current instruction pointer before the signal handler
executed. Normally, the signal handler will return and cause the ioctl to
be entered at exactly the point where it was left so the SPU program can
continue.
SPU library abstraction
Library interfaces
A portable library interface has been established on top of that low-level programming model. The library interface does not rely on
implementation details of the file system, but can also be used on other
operating systems that might have a different kernel interface.
Instead of providing an abstraction of logical SPUs, this interface is
thread-oriented and behaves in a similar way to the pthread library. When
an SPU thread gets created, the library will create a new thread that
manages the SPU context asynchronous to the main thread.
Library calls from an SPU
When the SPU needs to do any standard library calls like printf or exit, it has to
call back to the main thread. It does so by executing the special
stop-and-signal assembly instruction with a standardized argument value.
That value is returned from the ioctl call and the user thread must react
to that. This usually means copying the arguments over from the SPE local
store, running the respective library function in the user thread, and
continuing by calling the ioctl again.
Direct systems calls from an SPU
We are thinking about adding a direct system call mechanism to the spufs,
where the stop-and-signal instruction does not trap into userspace but
instead causes the kernel to read the system call arguments from local
store and enter the system call directly.
Because this happens inside the ioctl system call of a user process, any
pointer arguments to the SPU system call are assumed to point into the
address space of that process, and the SPU program needs to use DMA to
access them.
Toolchain support
Compiler, binutils
Because the Cell's PPE uses the same instruction set as the PowerPC 970
CPU, it does not require changes to the compiler or the binutils. However,
compiled code runs more efficiently on it when using a compiler that
schedules instructions specifically for the CPU's pipeline structure.
You can get a patch to GCC to add a pipeline definition for the PPE so you
can create optimized code.
Since the SPU instruction set is not directly related to an existing CPU
architecture, a new back end was written for both GCC and binutils. SPU
code is compiled separately from the PPC code and gets loaded at run time.
Both the changes to GCC for PPE optimizations and the SPU back end are
expected to be released as part of a Linux distribution.
The compiler introduces new intrinsics to do DMA transfers or other
mailbox accesses, because these are not part of the C language standard.
Also, new intrinsics and a new data type use the
vector instructions for parallel calculation. This is similar to what is
done for VMX/AltiVec or SSE vector instructions.
Future GCC versions should also be able to use autovectorization
and create vector code automatically, but we do not have that yet. It is
also likely that explicit vector instructions are normally more efficient
than relying on the compiler to generate them.
Debugger
Debugging SPU programs creates a new set of problems. While it would be
relatively easy to create a new GDB target for the SPU itself, most users
need to debug interaction between PPE and SPE. The approach we are taking
here is to enable GDB to support multiple targets in one binary, and change
the PowerPC target to know about spufs. When the PowerPC debugger finds a
program running SPU_RUN ioctl call, it switches to the SPU back-end code
and works on the SPU context instead of the main program context.
Profiler
While there currently is no working profiler for spufs programs, we have
plans to extend OProfile to cover mixed PPE/SPE programs. This requires
changes to the OProfile kernel code to sample the SPU instruction pointers
regularly.
On the userspace side of OProfile, it needs an extra indirection level to
come from the spufs file to the actual ELF file that was loaded into that.
Conclusion
With the current state of Linux on Cell, you can write special-purpose applications running on the prototype board while using the full
performance of the chip. While most applications do not immediately run
better on Cell, there is a lot of potential to port performance-critical
applications to use library code running on an SPU for better performance.
The base platform support is currently making its way into the mainline
Linux kernel, and the SPU file system interface is on its way to being stabilized
enough to be included in upcoming releases of the kernel and of major
distributions.
Resources -
This article is adapted from The
Cell Processor Programming Model, a paper presented at LinuxTag 2005
by Arnd Bergmann of IBM.
- A joint team of Sony, IBM and Toshiba employees based in Austin, Texas did the groundwork for the Linux kernel port . The
partnership, known also as STI, has been working on Cell since 2001.
- The LTC team is currently working at IBM Deutschland Entwicklung
GmbH, also known as the
IBM Research Böblingen Lab.
-
Preliminary support for Cell is being contributed to GNU Compiler Collection (GCC) and GNU Project Debugger
(GDB).
- The kernel patches can be found in the ozlabs archive (see the Linux PowerPC64 Development
Patches) and are tracked with the help of the Patchwork patch tracking
system (Learn more about Patchwork.).
-
As well, you can use OProfile
with the base Linux code for Cell.
-
Sony recently published a paper on Programming
Cell, an outstanding resource that presents some alternative models for
programming Cell.
In particular, slide 24, Programming models,
offers a
good overview that is explored in subsequent slides.
-
The IBM Microelectronics Technical Library has a
new
section devoted to Cell. See also the excellent content up at
IBM Research's Cell pages.
- Have experience you'd be willing to share with Power Architecture zone
readers? Article submissions on all aspects of Power Architecture technology from authors inside and outside
IBM are welcomed. Check out the Power Architecture author
FAQ to learn more.
- Have a question or comment on this story, or
on Power Architecture technology in general?
Post it in the Power Architecture technical forum
or send in a letter to the editors.
- The Power Architecture Community Newsletter includes full-length articles as well as recent news about members of the Power Architecture community and upcoming events of interest.
Learn more
about the Power Architecture Community Newsletter and how to contribute to it. Subscription is free.
- All things Power-related are chronicled in the developerWorks Power
Architecture editors' blog, which is just one of many developerWorks
blogs.
- Find more articles and resources on Power Architecture
technology and all things
related in the developerWorks Power
Architecture technology zone.
- Download a IBM PowerPC 405 Evaluation Kit to demo a SoC in a simulated
environment, or just to explore the fully licensed version of
Power Architecture technology. This and other fine Power Architecture-related downloads are listed in
the developerWorks Power Architecture technology zone's downloads section.
About the author  | |  | Arnd Bergmann has been hacking the Linux kernel and a number of other
open source packages since 1998. In his current project for
The IBM Linux Technology Center, he is responsible
for the kernel in the first Cell processor-based workstation computer,
which premieres at LinuxTag 2005. You can reach him at arndb@de.ibm.com.
|
Rate this page
|