 | Level: Intermediate Christian Kaiser (christian.kaiser1@rwth-aachen.de), Research Intern, IBM Christian Rund (Christian.Rund@de.ibm.com), Research and Development Engineer, IBM
11 Dec 2007 This three-part series illustrates a
hardware-resource-focused form of software virtualization known as container
virtualization (or operating system virtualization), demonstrated through the open
source project OpenVZ. The series provides a comprehensive overview of all the
components and techniques needed to virtualize the Cell/B.E. processor with software
methods. This first article of the series discusses the basic concepts
involved, illustrates the salient points of the OpenVZ and Cell/B.E. architectures
and how they work together, and describes some of the OpenVZ tools.
Introduction
Virtualization is a revolutionary technology. Software virtualization is one
of the most discussed IT topics in 2007. In this short series, you will learn
about an efficient virtualization approach for
the Cell Broadband Engine processor regarding hardware resources called container
virtualization (also known as operating system virtualization). You
will also learn about the OpenVZ project: an open source
software project that brings the capability of container virtualization to
Linux®. The commercial counterpart of OpenVZ is SWsoft Inc.'s
Virtuozzo.
This article series uncovers what's needed to
virtualize the Cell/B.E. processor with software methods. You will read an
introduction to the interfaces of an OpenVZ Linux kernel that need to be modified
to use the Cell/B.E. platform inside a container environment. You will also see a
measurement report to demonstrate the efficiency of container virtualization in
the Cell/B.E environment.
Understanding OpenVZ and the
Cell/B.E. processor
OpenVZ is an open source software implementation that brings container
virtualization support to Linux. OpenVZ consists of two components:
- OpenVZ Linux kernel
- OpenVZ user space tools
The main goal of the OpenVZ project is to produce isolated containers that run a
virtual Linux operating system instance. All the containers have access to one
single, virtualized kernel. The host system that provides this kernel handles
the virtualized, so-called core four devices: CPU, memory, disk space, and
network. You can also pass one device directly to the container so that
it becomes inaccessible to the host system and to all other containers except the
one that owns the device (in other words, dedicated device allocation).
Figure 1 shows the isolation of three container instances (labeled VS-1, VS-2,
and VS-3). All three containers share the same OpenVZ Linux kernel. See Resources.
Figure 1. Isolation of three
container instances
By now, you are probably quite familiar with the Cell/B.E. processor, which you
can find in
third-party products that run devices ranging from hybrid supercomputers to specialized
rugged CABs to the Sony Playstation 3 game console. The processor combines a general purpose
Power Architecture™ core with streamlined co-processing elements (SPUs)
that greatly accelerate multimedia and vector applications. With a
single-precision floating-point performance of 25.6 GFLOPS per SPU, the Cell/B.E.
processor is often called a supercomputer on a chip.
Figure 2 outlines the basic architecture of a Cell/B.E. processor.
Figure 2. Basic Cell/B.E.
processor architecture
The basic architecture consists of a Power Processing Element (PPE) that introduces a PowerPC core
with a traditional memory subsystem and eight Synergistic Processing Elements
(SPEs) that are connected using a high-bandwidth internal connector called the Element
Interconnect Bus (EIB). Each SPE consists of a Synergistic Processing Unit (SPU),
a 256KB local store (LS) that holds the instructions and the data of a SPE, and a
memory flow controller (MFC) that handles the DMA transfers between the system's
main memory and the LS of the SPEs. The PPE serves only the operating system. At
this time, Linux is the only operating system that makes use
of the Cell/B.E. features.
Understanding container virtualization
and the Cell/B.E. processor
Figure 3 depicts the partitioning of the Cell/B.E. processor.
Figure 3. Partitioning the
Cell/B.E. processor
The containers have granted access to the dedicated physical SPUs that are
available in the system. The SPUs that are accessible in the containers are not
accessible in any other container or the host system for the time of the dedicated
allocation. Each container uses only the SPUs that it owns. To implement
this, there are four things for you to do:
- Virtualize the SPU filesystem (spuf). The spuf must become accessible
inside the containers, but each container should only see the SPE threads that
it created itself.
- Adjust the virtual filesystem provided by the 2.6
Linux kernel (sysfs). The sysfs inside the container must contain the correct directory entries in
/sys/devices/system/spu for the SPUs that are allocated to it. The libspe2 uses
these directory entries to count the number of available SPUs inside the
container.
- Modify the SPU scheduler. The SPU scheduling must be modified so
that SPE threads that are created inside a container run only on the
SPUs that are dedicated to the same container.
- Modify the OpenVZ tools. The vzctl tool of the OpenVZ tools must
support the allocation of SPUs to containers and the counterpart in order to
free SPUs of a container. The allocation and freeing should work during the
runtime of the container.
The concept of shared virtualized devices means that the device is accessible by
all the containers, as shown in Figure 4.
Figure 4. Shared virtualization of
the SPUs
You need a mechanism that controls how much of the device or for how long
the whole device is accessible by the container. Because a single SPU is not
shareable by amount, the containers share the SPU by the duration that they have access
to it.
The implementation should follow these four steps:
- Virtualize the spufs. The spufs must become
accessible inside the containers, but each container should only see the SPE
threads that it created itself.
- Adjust the sysfs. The sysfs inside the container must contain the
correct directory entries in /sys/devices/system/spu. A better solution has the containers have access to all the SPUs that are installed in the system
so that the /sys/devices/system/spu directory has the same entries on the host
system and inside all containers.
- Modify the SPU scheduler. The SPU scheduler must be modified so
that SPE threads that are created inside a container get only access on the SPUs
that are available in the system for a certain amount of time. A scheduling
algorithm similar to the two-level Fair CPU scheduler that the OpenVZ
team implemented for ordinary processes can be designed for SPE threads. Even a
completely new scheduling algorithm can be deployed.
- Modify the OpenVZ tools. The vzctl tool of the OpenVZ tools must
support the setting of per-container SPU execution time. This setting should be
changeable during the runtime of the container.
Now, a bit about spufs.
Using spufs
Spufs is comparable to another well-known virtual filesystem (VFS): procfs.
Procfs was the first abstract filesystem in Linux, and it was meant to represent
processes in a simple way. While procfs represents processes running on the CPU,
spufs represents threads running on the SPUs of a Cell/B.E. processor.
The spufs is a special-purpose filesystem. It is a VFS developed for
controlling the SPUs on the Cell/B.E. processor. Each directory in the spufs
refers to a logical SPE context. Such an SPE context is treated like a physical
SPE. The context properties are represented as files inside the directory.
Accesses to SPE contexts either manipulate a real SPE or the saved state of it in
memory. To start an SPU program, copy the SPU ELF executable into the SPE's LS and
then execute the spu_run system call.
If you want to run Cell/B.E.-specific code on Linux, such as programs that run on
the SPU of a Cell/B.E., you have to use the spufs; there is no other way to
start processes that use the SPUs of a Cell/B.E. on Linux.
As you can see in Listing 1, before executing the ls
command, the fft sample program of the Cell SDK was run with two SPE threads.
/spu is the mount point of the spufs.
Listing 1. Executing Cell SDK Fast-Fourier Transform sample with two SPE threads
[root@c02b12-0 ~]# ls /spu
spethread-20914-25296904 spethread-20914-25297464
[root@c02b12-0 ~]# ls /spu/*
/spu/spethread-20914-25296904:
cntl decr_status event_mask fpcr ibox_info lslr mbox_info mem mss
object-id proxydma_info regs signal1_type signal2_type wbox wbox_stat
decr dma_info event_status ibox ibox_stat mbox mbox_stat mfc npc physid
psmap signal1 signal2 srr0 wbox_info
/spu/spethread-20914-25297464:
cntl decr_status event_mask fpcr ibox_info lslr mbox_info mem mss
object-id proxydma_info regs signal1_type signal2_type wbox wbox_stat
decr dma_info event_status ibox ibox_stat mbox mbox_stat mfc npc physid
psmap signal1 signal2 srr0 wbox_info
|
The output of the first command shows that there is one directory (kobject) for
each SPE thread with the process ID (PID) and the thread ID of the SPE thread in
the name of the directory. The second output shows the files inside both
directories.
Figure 5 shows how the two different types of ELF executables are handled.
Figure 5. How two different types
of ELF executables are handled
An SPE executable in the form of an SPE object file is bound to a SPE thread that
is executed on a physical SPE. A special SPE scheduler decides when and which SPE
thread is executed on a physical SPE. PPE object files are treated like they are
on standard PowerPC® architecture systems such as JS21 Blade Servers. The
PPE object files are bound to standard Linux threads that are scheduled by the
Linux scheduler that is implemented in the kernel. So there is really nothing
unique about applications that run on the PPE.
Understanding the
Cell/B.E. SDK and the libspe
The software development kit (SDK) for the Cell/B.E. is available in
version 3.0 now (get
the latest copy at
Cell Resource Center Downloads).
The Cell/B.E. SDK includes all the tools you need to develop applications that make use of
the Cell/B.E. processor. Among other things, the SDK contains different compiler
and linker suites, libraries that simplify development (like libspe), and
sample programs such as SPU, MFC, and LS that demonstrate how to use the Cell/B.E.
architecture-specific feature.
At this time, the libspe (in versions libspe1 and libspe2) is the
only user of the spufs. It is a framework that makes it easy for you to
develop code for the Cell/B.E. architecture that runs on the SPEs. It copies the
ELF executable into the LS of an SPE, and it calls the
spu_run system call so you don't have to
worry about all the internals of launching an executable on an SPE.
Figure 6 shows the hierarchical extensions that are implemented in Linux to use
the Cell/B.E. processor.
Figure 6. Hierarchical extensions
implemented in Linux to use Cell/B.E. processors
The graphic shows the SPE Management Runtime Library (libspe) in user space
making use of the spufs filesystem implemented in the kernel space.
Understanding sysfs
Sysfs was originally introduced as driverfs into the Linux kernel
with the intention of having an overview of all the devices and drivers the kernel
knows about. It was designed to be a much cleaner way to access devices and
drivers than in procfs. The sysfs shows a hierarchy of kobject data structures
(each of them as directory) and a set of attributes (of the kobject structure) that
are files typically containing one single value encoded in a text string.
For example, Listing 2 is a listing of the /sys/devices/system/spu directory on
a QS21 Cell/B.E. blade.
Listing 2. One directory for each SPU
[root@c02b12-0 ~]# ls /sys/devices/system/spu/
spu0 spu1 spu10 spu11 spu12 spu13 spu14 spu15 spu2 spu3 spu4 spu5 spu6
spu7 spu8 spu9
[root@c02b12-0 ~]#
|
A QS21 Cell/B.E. blade holds two Cell/B.E. processors with eight SPUs each, so
you have a total number of 16 SPUs residing on the blade. As you can see, the
/sys/devices/system/spu directory contains one directory for each SPU. The
libspe2 makes use of these directory entries to count the number of available
physical SPUs in the system.
Understanding the
OpenVZ kernel
The OpenVZ kernel is a modified Linux kernel that introduces the capability of
having isolated operating system container environments. In addition, it
offers resource management and
checkpointing
to the containers. Keep in mind that each container and even the host
system use the same shared and virtualized kernel.
Each container has its own set of resources that are provided by the
kernel, such as:
- Files: system libraries and applications.
- Virtualized filesystems: procfs or sysfs.
- Users and groups: each container has its own root user, as well as other users
and groups.
- Process tree: a container can only see its own set of processes with
virtualized PIDs (init PID is 1).
- Network: virtual network devices with own IP addresses, routing, and filter
rules.
- Devices: some devices are virtualized. If there is a need, any container can
have granted access to a real (non-virtualized) device.
- IPC objects: semaphores, messages, or shared memory.
Resource
management
The resource management is done on different types of resources:
- Disk quota: OpenVZ introduces a two-level disk quota that makes it
possible to limit the disk space to the container, and the container can have quotas in
its environment again.
- CPU scheduler: the Fair CPU Scheduler is also using a two-level mechanism.
It is a per-container, configurable scheduler in the first level on which you can
define how much of the CPU time is used by a certain container. The second
level of the scheduler takes care of the process scheduling inside the container
environment.
- User beancounters: this is a set of counters, limits, and guarantees for
container resources. There are about 20 parameters that take care of memory and
various in-kernel objects, such as IPC shared memory segments and network
buffers.
Checkpointing
Checkpointing is another main function. Checkpointing and restoring
is necessary for live migration.
Checkpointing is the process of freezing a container and saving its complete
state to a disk file afterwards. Restoring is the counterpart. Live migration of a
container is the process of checkpointing a container on one host system and
restoring it on another host system.
Figure 7 shows the differences between a live migration and a checkpointing and
restoring process.
Figure 7. Differences between a live migration and a checkpointing and
restoring process
Figure 7 demonstrates that the live migration is one single action whereas
checkpointing and restoring are two different actions that need an extra storage
unit that is accessible from both hardware nodes.
Using vzctl and other OpenVZ
tools
The main OpenVZ Tool is vzctl, which is the high level command-line interface to manage
container environments. Vzctl can be used to create, start, stop, and destroy a
virtual operating system environment. This called a container
lifecycle.
Vzctl can also be used to change various container resources such as an IP address,
memory, or CPU time that a container environment can use. Most of these
parameters can be set and changed during runtime of the container. This is usually
impossible with other virtualization technologies, such as platform virtualization.
You can only launch the vzctl tool from the host system and not from inside the
container.
Besides vzctl, there are many more tools to manage OpenVZ
containers. The tools are not needed for OpenVZ regarding the virtualization of the
Cell/B.E., so you can find more details about the general management of container
environments in the OpenVZ User's Guide (see Resources).
Coincidentally, the authors of this series wrote the OpenVZ on POWER™
handbook.
Getting ready for Part 2
Part 2 describes only the implementation for the concept of
dedicated virtualization (partitioning) shown in
Figure 3.
Acknowledgments
Much thanks to the authors of
"Mehrarbeit fur CPUs"
(Linux Magazin, April 2006) for the use of the image in
Figure 1.
Resources Learn
- Use an
RSS
feed to request notification for the upcoming articles in this series. (Find out more about RSS feeds of developerWorks content.)
- For OpenVZ and virtualization, try
"Virtualization in Linux"
(September 2006), which ties the three main virtualization approaches (emulation,
paravirtualization, and OS-level virtualization) to OpenVZ.
- Check out Arnd Bergmann's
"Spufs: The Cell Synergistic Processing Unit as a virtual file system"
(developerWorks, June 2005) for details about SPU filesystem interface that allows Linux
to run on the Cell/B.E. platform. Bergmann's "How to not invent kernel interfaces"
(paper to LinuxConf Europe, July 2007) explains how to choose the form of user
space interface that your kernel code should get. Bergmann also has a
virtual
class on Linux on Cell/B.E. platforms
that covers the threading model, Linux runtime strategy, PPU/SPU runtime
requirements, spufs, signal handling, and more.
- Find Daniel Hackenberg's
"Performance Measurements on Cell SMP Systems"
presentation (for the Center for Information Services and High Performance
Computing's Cell/B.E. cluster meeting, May 2007) for such performance analysis
measures as matrix multiplication, XDR DMA bandwidth, and SPE-to-SPE DMA
bandwidth.
- Read Duc Vianney's
"Cell Software Solutions Programming Model"
presentation (March 2006) for such Cell/B.E. programming model issues as PPE-
versus SPE-centric, function offload, overlapping DMA and computation, and
heterogeneous multi-thread.
- Try
"Virtualization in a nutshell"
(developerWorks, June 2006) as an introduction to the topic of basic
virtualization concepts by means of common patterns.
"Virtual Linux"
(December 2006) explains the various forms of virtualization (and current
virtualization projects) from a Linux perspective.
- See
"Virtualization with coLinux"
(developerWorks, March 2007) and
"System emulation with QEMU"
(September 2007) for more on paravirtualization.
- Explore a new quick-read jumpstart series in
the blog that covers SDK 3.0 topics:
- Learn about
"Changes in libspe: How libspe2 affects Cell Broadband Engine programming"
(developerWorks, July 2007) for the libspe2 concepts and how to do basic
SPE process management and communication with libspe2.
- Refer to
"Introduction to the Cell Multiprocessor"
(IBM Journal of Research and Development, 2005) for an introductory
overview of the Cell/B.E. multiprocessor's history, the program objectives and
challenges, the design concept, the architecture and programming models, and the
implementation.
- Explore other developerWorks resources you may be
interested in because of their connection to virtualization, including the
Linux zone and the
Open source zone.
- To learn more on Cell/B.E. programming, try the
developerWorks series:
- Refer to the Cell
Broadband Engine documentation section of the IBM Semiconductor Solutions Technical Library for a wealth of downloadable manuals,
specifications, and more.
- Sign up for the developerWorks newsletter
and get the latest developer news and Cell/B.E. happenings delivered to your inbox each week.
Check Power Architecture when you sign up to receive Cell/B.E. news in your newsletter.
Get products and technologies
Discuss
About the authors  | |  | Christian Kaiser studies Computer Engineering at RWTH University in Aachen, Germany. In 2007, he was an intern at IBM Germany Research Lab in Boeblingen, Germany. While doing his internship at IBM, he researched virtualization methods for the Cell Broadband Engine processor. After the internship, Christian Kaiser went to work on his thesis at the Chair for Operating Systems of the RWTH Aachen. His thesis is about "Analysis of asynchronous collective communication in memory-coupled high-speed networks." |
 | |  | Christian Rund is in the IBM Development Laboratory in Boeblingen, Germany. He studied
Computer Science at the University of Stuttgart and at Uppsala Universitet, Sweden, graduating in 1997. During his studies, he worked as a student trainee at IBM in Herrenberg and Stuttgart, Germany. In 1998, he joined the system development department of the Landeszentralbank in Stuttgart (Deutsche Bundesbank). In 2001, Christian joined IBM as an research and development engineer on the development team for the zSeries FCP channel. Since mid-2006, he has been a research and development engineer for host firmware for Cell/B.E.-based servers. |
Rate this page
|  |