Virtualization is a revolutionary technology. Software virtualization is one of the most discussed IT topics in 2007. In this short series, you will learn about an efficient virtualization approach for the Cell Broadband Engine processor regarding hardware resources called container virtualization (also known as operating system virtualization). You will also learn about the OpenVZ project: an open source software project that brings the capability of container virtualization to Linux®. The commercial counterpart of OpenVZ is SWsoft Inc.'s Virtuozzo.
This article series uncovers what's needed to virtualize the Cell/B.E. processor with software methods. You will read an introduction to the interfaces of an OpenVZ Linux kernel that need to be modified to use the Cell/B.E. platform inside a container environment. You will also see a measurement report to demonstrate the efficiency of container virtualization in the Cell/B.E environment.
Understanding OpenVZ and the Cell/B.E. processor
OpenVZ is an open source software implementation that brings container virtualization support to Linux. OpenVZ consists of two components:
- OpenVZ Linux kernel
- OpenVZ user space tools
The main goal of the OpenVZ project is to produce isolated containers that run a virtual Linux operating system instance. All the containers have access to one single, virtualized kernel. The host system that provides this kernel handles the virtualized, so-called core four devices: CPU, memory, disk space, and network. You can also pass one device directly to the container so that it becomes inaccessible to the host system and to all other containers except the one that owns the device (in other words, dedicated device allocation).
Figure 1 shows the isolation of three container instances (labeled VS-1, VS-2, and VS-3). All three containers share the same OpenVZ Linux kernel. See Resources.
Figure 1. Isolation of three container instances
By now, you are probably quite familiar with the Cell/B.E. processor, which you can find in third-party products that run devices ranging from hybrid supercomputers to specialized rugged CABs to the Sony Playstation 3 game console. The processor combines a general purpose Power Architecture™ core with streamlined co-processing elements (SPUs) that greatly accelerate multimedia and vector applications. With a single-precision floating-point performance of 25.6 GFLOPS per SPU, the Cell/B.E. processor is often called a supercomputer on a chip.
Figure 2 outlines the basic architecture of a Cell/B.E. processor.
Figure 2. Basic Cell/B.E. processor architecture
The basic architecture consists of a Power Processing Element (PPE) that introduces a PowerPC core with a traditional memory subsystem and eight Synergistic Processing Elements (SPEs) that are connected using a high-bandwidth internal connector called the Element Interconnect Bus (EIB). Each SPE consists of a Synergistic Processing Unit (SPU), a 256KB local store (LS) that holds the instructions and the data of a SPE, and a memory flow controller (MFC) that handles the DMA transfers between the system's main memory and the LS of the SPEs. The PPE serves only the operating system. At this time, Linux is the only operating system that makes use of the Cell/B.E. features.
Understanding container virtualization and the Cell/B.E. processor
Figure 3 depicts the partitioning of the Cell/B.E. processor.
Figure 3. Partitioning the Cell/B.E. processor
The containers have granted access to the dedicated physical SPUs that are available in the system. The SPUs that are accessible in the containers are not accessible in any other container or the host system for the time of the dedicated allocation. Each container uses only the SPUs that it owns. To implement this, there are four things for you to do:
- Virtualize the SPU filesystem (spuf). The spuf must become accessible
inside the containers, but each container should only see the SPE threads that
it created itself.
- Adjust the virtual filesystem provided by the 2.6
Linux kernel (sysfs). The sysfs inside the container must contain the correct directory entries in
/sys/devices/system/spu for the SPUs that are allocated to it. The libspe2 uses
these directory entries to count the number of available SPUs inside the
container.
- Modify the SPU scheduler. The SPU scheduling must be modified so
that SPE threads that are created inside a container run only on the
SPUs that are dedicated to the same container.
- Modify the OpenVZ tools. The vzctl tool of the OpenVZ tools must support the allocation of SPUs to containers and the counterpart in order to free SPUs of a container. The allocation and freeing should work during the runtime of the container.
The concept of shared virtualized devices means that the device is accessible by all the containers, as shown in Figure 4.
Figure 4. Shared virtualization of the SPUs
You need a mechanism that controls how much of the device or for how long the whole device is accessible by the container. Because a single SPU is not shareable by amount, the containers share the SPU by the duration that they have access to it.
The implementation should follow these four steps:
- Virtualize the spufs. The spufs must become
accessible inside the containers, but each container should only see the SPE
threads that it created itself.
- Adjust the sysfs. The sysfs inside the container must contain the
correct directory entries in /sys/devices/system/spu. A better solution has the containers have access to all the SPUs that are installed in the system
so that the /sys/devices/system/spu directory has the same entries on the host
system and inside all containers.
- Modify the SPU scheduler. The SPU scheduler must be modified so
that SPE threads that are created inside a container get only access on the SPUs
that are available in the system for a certain amount of time. A scheduling
algorithm similar to the two-level Fair CPU scheduler that the OpenVZ
team implemented for ordinary processes can be designed for SPE threads. Even a
completely new scheduling algorithm can be deployed.
- Modify the OpenVZ tools. The vzctl tool of the OpenVZ tools must support the setting of per-container SPU execution time. This setting should be changeable during the runtime of the container.
Now, a bit about spufs.
Spufs is comparable to another well-known virtual filesystem (VFS): procfs. Procfs was the first abstract filesystem in Linux, and it was meant to represent processes in a simple way. While procfs represents processes running on the CPU, spufs represents threads running on the SPUs of a Cell/B.E. processor.
The spufs is a special-purpose filesystem. It is a VFS developed for
controlling the SPUs on the Cell/B.E. processor. Each directory in the spufs
refers to a logical SPE context. Such an SPE context is treated like a physical
SPE. The context properties are represented as files inside the directory.
Accesses to SPE contexts either manipulate a real SPE or the saved state of it in
memory. To start an SPU program, copy the SPU ELF executable into the SPE's LS and
then execute the spu_run system call.
If you want to run Cell/B.E.-specific code on Linux, such as programs that run on the SPU of a Cell/B.E., you have to use the spufs; there is no other way to start processes that use the SPUs of a Cell/B.E. on Linux.
As you can see in Listing 1, before executing the ls
command, the fft sample program of the Cell SDK was run with two SPE threads.
/spu is the mount point of the spufs.
Listing 1. Executing Cell SDK Fast-Fourier Transform sample with two SPE threads
[root@c02b12-0 ~]# ls /spu
spethread-20914-25296904 spethread-20914-25297464
[root@c02b12-0 ~]# ls /spu/*
/spu/spethread-20914-25296904:
cntl decr_status event_mask fpcr ibox_info lslr mbox_info mem mss
object-id proxydma_info regs signal1_type signal2_type wbox wbox_stat
decr dma_info event_status ibox ibox_stat mbox mbox_stat mfc npc physid
psmap signal1 signal2 srr0 wbox_info
/spu/spethread-20914-25297464:
cntl decr_status event_mask fpcr ibox_info lslr mbox_info mem mss
object-id proxydma_info regs signal1_type signal2_type wbox wbox_stat
decr dma_info event_status ibox ibox_stat mbox mbox_stat mfc npc physid
psmap signal1 signal2 srr0 wbox_info
|
The output of the first command shows that there is one directory (kobject) for each SPE thread with the process ID (PID) and the thread ID of the SPE thread in the name of the directory. The second output shows the files inside both directories.
Figure 5 shows how the two different types of ELF executables are handled.
Figure 5. How two different types of ELF executables are handled
An SPE executable in the form of an SPE object file is bound to a SPE thread that is executed on a physical SPE. A special SPE scheduler decides when and which SPE thread is executed on a physical SPE. PPE object files are treated like they are on standard PowerPC® architecture systems such as JS21 Blade Servers. The PPE object files are bound to standard Linux threads that are scheduled by the Linux scheduler that is implemented in the kernel. So there is really nothing unique about applications that run on the PPE.
Understanding the Cell/B.E. SDK and the libspe
The software development kit (SDK) for the Cell/B.E. is available in version 3.0 now (get the latest copy at Cell Resource Center Downloads). The Cell/B.E. SDK includes all the tools you need to develop applications that make use of the Cell/B.E. processor. Among other things, the SDK contains different compiler and linker suites, libraries that simplify development (like libspe), and sample programs such as SPU, MFC, and LS that demonstrate how to use the Cell/B.E. architecture-specific feature.
At this time, the libspe (in versions libspe1 and libspe2) is the
only user of the spufs. It is a framework that makes it easy for you to
develop code for the Cell/B.E. architecture that runs on the SPEs. It copies the
ELF executable into the LS of an SPE, and it calls the
spu_run system call so you don't have to
worry about all the internals of launching an executable on an SPE.
Figure 6 shows the hierarchical extensions that are implemented in Linux to use the Cell/B.E. processor.
Figure 6. Hierarchical extensions implemented in Linux to use Cell/B.E. processors
The graphic shows the SPE Management Runtime Library (libspe) in user space making use of the spufs filesystem implemented in the kernel space.
Sysfs was originally introduced as driverfs into the Linux kernel with the intention of having an overview of all the devices and drivers the kernel knows about. It was designed to be a much cleaner way to access devices and drivers than in procfs. The sysfs shows a hierarchy of kobject data structures (each of them as directory) and a set of attributes (of the kobject structure) that are files typically containing one single value encoded in a text string.
For example, Listing 2 is a listing of the /sys/devices/system/spu directory on a QS21 Cell/B.E. blade.
Listing 2. One directory for each SPU
[root@c02b12-0 ~]# ls /sys/devices/system/spu/
spu0 spu1 spu10 spu11 spu12 spu13 spu14 spu15 spu2 spu3 spu4 spu5 spu6
spu7 spu8 spu9
[root@c02b12-0 ~]#
|
A QS21 Cell/B.E. blade holds two Cell/B.E. processors with eight SPUs each, so you have a total number of 16 SPUs residing on the blade. As you can see, the /sys/devices/system/spu directory contains one directory for each SPU. The libspe2 makes use of these directory entries to count the number of available physical SPUs in the system.
Understanding the OpenVZ kernel
The OpenVZ kernel is a modified Linux kernel that introduces the capability of having isolated operating system container environments. In addition, it offers resource management and checkpointing to the containers. Keep in mind that each container and even the host system use the same shared and virtualized kernel.
Each container has its own set of resources that are provided by the kernel, such as:
- Files: system libraries and applications.
- Virtualized filesystems: procfs or sysfs.
- Users and groups: each container has its own root user, as well as other users and groups.
- Process tree: a container can only see its own set of processes with virtualized PIDs (init PID is 1).
- Network: virtual network devices with own IP addresses, routing, and filter rules.
- Devices: some devices are virtualized. If there is a need, any container can have granted access to a real (non-virtualized) device.
- IPC objects: semaphores, messages, or shared memory.
The resource management is done on different types of resources:
- Disk quota: OpenVZ introduces a two-level disk quota that makes it possible to limit the disk space to the container, and the container can have quotas in its environment again.
- CPU scheduler: the Fair CPU Scheduler is also using a two-level mechanism. It is a per-container, configurable scheduler in the first level on which you can define how much of the CPU time is used by a certain container. The second level of the scheduler takes care of the process scheduling inside the container environment.
- User beancounters: this is a set of counters, limits, and guarantees for container resources. There are about 20 parameters that take care of memory and various in-kernel objects, such as IPC shared memory segments and network buffers.
Checkpointing is another main function. Checkpointing and restoring is necessary for live migration. Checkpointing is the process of freezing a container and saving its complete state to a disk file afterwards. Restoring is the counterpart. Live migration of a container is the process of checkpointing a container on one host system and restoring it on another host system.
Figure 7 shows the differences between a live migration and a checkpointing and restoring process.
Figure 7. Differences between a live migration and a checkpointing and restoring process
Figure 7 demonstrates that the live migration is one single action whereas checkpointing and restoring are two different actions that need an extra storage unit that is accessible from both hardware nodes.
Using vzctl and other OpenVZ tools
The main OpenVZ Tool is vzctl, which is the high level command-line interface to manage container environments. Vzctl can be used to create, start, stop, and destroy a virtual operating system environment. This called a container lifecycle.
Vzctl can also be used to change various container resources such as an IP address, memory, or CPU time that a container environment can use. Most of these parameters can be set and changed during runtime of the container. This is usually impossible with other virtualization technologies, such as platform virtualization.
You can only launch the vzctl tool from the host system and not from inside the container.
Besides vzctl, there are many more tools to manage OpenVZ containers. The tools are not needed for OpenVZ regarding the virtualization of the Cell/B.E., so you can find more details about the general management of container environments in the OpenVZ User's Guide (see Resources). Coincidentally, the authors of this series wrote the OpenVZ on POWER™ handbook.
Part 2 describes only the implementation for the concept of dedicated virtualization (partitioning) shown in Figure 3.
Much thanks to the authors of "Mehrarbeit fur CPUs" (Linux Magazin, April 2006) for the use of the image in Figure 1.
Learn
- Use an
RSS
feed to request notification for the upcoming articles in this series. (Find out more about RSS feeds of developerWorks content.)
- For OpenVZ and virtualization, try
"Virtualization in Linux"
(September 2006), which ties the three main virtualization approaches (emulation,
paravirtualization, and OS-level virtualization) to OpenVZ.
- Check out Arnd Bergmann's
"Spufs: The Cell Synergistic Processing Unit as a virtual file system"
(developerWorks, June 2005) for details about SPU filesystem interface that allows Linux
to run on the Cell/B.E. platform. Bergmann's "How to not invent kernel interfaces"
(paper to LinuxConf Europe, July 2007) explains how to choose the form of user
space interface that your kernel code should get. Bergmann also has a
virtual
class on Linux on Cell/B.E. platforms
that covers the threading model, Linux runtime strategy, PPU/SPU runtime
requirements, spufs, signal handling, and more.
- Find Daniel Hackenberg's
"Performance Measurements on Cell SMP Systems"
presentation (for the Center for Information Services and High Performance
Computing's Cell/B.E. cluster meeting, May 2007) for such performance analysis
measures as matrix multiplication, XDR DMA bandwidth, and SPE-to-SPE DMA
bandwidth.
- Read Duc Vianney's
"Cell Software Solutions Programming Model"
presentation (March 2006) for such Cell/B.E. programming model issues as PPE-
versus SPE-centric, function offload, overlapping DMA and computation, and
heterogeneous multi-thread.
- Try
"Virtualization in a nutshell"
(developerWorks, June 2006) as an introduction to the topic of basic
virtualization concepts by means of common patterns.
"Virtual Linux"
(December 2006) explains the various forms of virtualization (and current
virtualization projects) from a Linux perspective.
- See
"Virtualization with coLinux"
(developerWorks, March 2007) and
"System emulation with QEMU"
(September 2007) for more on paravirtualization.
- Explore a new quick-read jumpstart series in
the blog that covers SDK 3.0 topics:
- Introducing the Accelerated Library Framework (ALF)
- Illustrating the 10 most important concepts of ALF
- Introducing the Data Communication and Synchronization (DaCS) library services
- Learn about
"Changes in libspe: How libspe2 affects Cell Broadband Engine programming"
(developerWorks, July 2007) for the libspe2 concepts and how to do basic
SPE process management and communication with libspe2.
- Refer to
"Introduction to the Cell Multiprocessor"
(IBM Journal of Research and Development, 2005) for an introductory
overview of the Cell/B.E. multiprocessor's history, the program objectives and
challenges, the design concept, the architecture and programming models, and the
implementation.
- Explore other developerWorks resources you may be
interested in because of their connection to virtualization, including the
Linux zone and the
Open source zone.
- To learn more on Cell/B.E. programming, try the
developerWorks series:
- "Programming high-performance applications on the Cell/B.E. processor"
- "PS3 fab-to-lab"
- "The little broadband engine that could"
- Refer to the Cell
Broadband Engine documentation section of the IBM Semiconductor Solutions Technical Library for a wealth of downloadable manuals,
specifications, and more.
- Sign up for the developerWorks newsletter
and get the latest developer news and Cell/B.E. happenings delivered to your inbox each week.
Check Power Architecture when you sign up to receive Cell/B.E. news in your newsletter.
Get products and technologies
- Get the
"OpenVZ User's Guide"
Version 2.7.0-8 from SWsoft.
- Try out the
LMbench suite of tools for
performance analysis, including portable benchmarks that compare different UNIX systems
performance.
- Use the Power.org code
sample that performs a 4-way SIMD single-precision complex FFT
within a Cell/B.E. environment.
- See the centerpiece
of Cell/B.E. development, including the
latest
Cell/B.E. SDK release: the SDK for Multicore Acceleration 3.0.
There's even a
documentation library
to support it.
- Find all Cell/B.E.-related articles, discussion forums, downloads,
and more at the IBM developerWorks Cell
Broadband Engine resource center: your definitive resource for all
things Cell/B.E.
- Contact IBM about custom
Cell/B.E.-based or custom-processor based solutions.
Discuss
- Participate in the discussion forum.
- Post questions on the
OpenVZ forum.
- Check out the Cell Broadband
Engine Architecture forum to get your technical questions about the processor answered.
Juicy problems and answers from the forums are rounded up periodically and highlighted
in the "Forum watch" blog series.
- Go to the Power Architecture blog for news, downloads,
instructional resources, and event notifications for Cell/B.E. and other Power Architecture-related technologies. You can find
the popular "Forum watch" blog series (Q&A roundup), the "FixIt" technology updates, and the Infobomb quick-read
technology introductions.
Christian Kaiser studies Computer Engineering at RWTH University in Aachen, Germany. In 2007, he was an intern at IBM Germany Research Lab in Boeblingen, Germany. While doing his internship at IBM, he researched virtualization methods for the Cell Broadband Engine processor. After the internship, Christian Kaiser went to work on his thesis at the Chair for Operating Systems of the RWTH Aachen. His thesis is about "Analysis of asynchronous collective communication in memory-coupled high-speed networks."
Christian Rund is in the IBM Development Laboratory in Boeblingen, Germany. He studied Computer Science at the University of Stuttgart and at Uppsala Universitet, Sweden, graduating in 1997. During his studies, he worked as a student trainee at IBM in Herrenberg and Stuttgart, Germany. In 1998, he joined the system development department of the Landeszentralbank in Stuttgart (Deutsche Bundesbank). In 2001, Christian joined IBM as an research and development engineer on the development team for the zSeries FCP channel. Since mid-2006, he has been a research and development engineer for host firmware for Cell/B.E.-based servers.
Comments (Undergoing maintenance)




