Skip to main content

Cell/B.E. container virtualization, Part 1: Concepts, architectures, and tools

Learn the concepts of software-based container virtualization on the Cell/B.E. platform using the open source software project OpenVZ

Christian Kaiser studies Computer Engineering at RWTH University in Aachen, Germany. In 2007, he was an intern at IBM Germany Research Lab in Boeblingen, Germany. While doing his internship at IBM, he researched virtualization methods for the Cell Broadband Engine processor. After the internship, Christian Kaiser went to work on his thesis at the Chair for Operating Systems of the RWTH Aachen. His thesis is about "Analysis of asynchronous collective communication in memory-coupled high-speed networks."
Christian Rund (Christian.Rund@de.ibm.com), Research and Development Engineer, IBM
Christian Rund is in the IBM Development Laboratory in Boeblingen, Germany. He studied Computer Science at the University of Stuttgart and at Uppsala Universitet, Sweden, graduating in 1997. During his studies, he worked as a student trainee at IBM in Herrenberg and Stuttgart, Germany. In 1998, he joined the system development department of the Landeszentralbank in Stuttgart (Deutsche Bundesbank). In 2001, Christian joined IBM as an research and development engineer on the development team for the zSeries FCP channel. Since mid-2006, he has been a research and development engineer for host firmware for Cell/B.E.-based servers.

Summary:  This three-part series illustrates a hardware-resource-focused form of software virtualization known as container virtualization (or operating system virtualization), demonstrated through the open source project OpenVZ. The series provides a comprehensive overview of all the components and techniques needed to virtualize the Cell/B.E. processor with software methods. This first article of the series discusses the basic concepts involved, illustrates the salient points of the OpenVZ and Cell/B.E. architectures and how they work together, and describes some of the OpenVZ tools.

View more content in this series

Date:  11 Dec 2007
Level:  Intermediate
Activity:  5317 views

Introduction

Virtualization is a revolutionary technology. Software virtualization is one of the most discussed IT topics in 2007. In this short series, you will learn about an efficient virtualization approach for the Cell Broadband Engine processor regarding hardware resources called container virtualization (also known as operating system virtualization). You will also learn about the OpenVZ project: an open source software project that brings the capability of container virtualization to Linux®. The commercial counterpart of OpenVZ is SWsoft Inc.'s Virtuozzo.

This article series uncovers what's needed to virtualize the Cell/B.E. processor with software methods. You will read an introduction to the interfaces of an OpenVZ Linux kernel that need to be modified to use the Cell/B.E. platform inside a container environment. You will also see a measurement report to demonstrate the efficiency of container virtualization in the Cell/B.E environment.


Understanding OpenVZ and the Cell/B.E. processor

OpenVZ is an open source software implementation that brings container virtualization support to Linux. OpenVZ consists of two components:

  • OpenVZ Linux kernel
  • OpenVZ user space tools

The main goal of the OpenVZ project is to produce isolated containers that run a virtual Linux operating system instance. All the containers have access to one single, virtualized kernel. The host system that provides this kernel handles the virtualized, so-called core four devices: CPU, memory, disk space, and network. You can also pass one device directly to the container so that it becomes inaccessible to the host system and to all other containers except the one that owns the device (in other words, dedicated device allocation).

Figure 1 shows the isolation of three container instances (labeled VS-1, VS-2, and VS-3). All three containers share the same OpenVZ Linux kernel. See Resources.


Figure 1. Isolation of three container instances
Isolation of three container instances

By now, you are probably quite familiar with the Cell/B.E. processor, which you can find in third-party products that run devices ranging from hybrid supercomputers to specialized rugged CABs to the Sony Playstation 3 game console. The processor combines a general purpose Power Architecture™ core with streamlined co-processing elements (SPUs) that greatly accelerate multimedia and vector applications. With a single-precision floating-point performance of 25.6 GFLOPS per SPU, the Cell/B.E. processor is often called a supercomputer on a chip.

Figure 2 outlines the basic architecture of a Cell/B.E. processor.


Figure 2. Basic Cell/B.E. processor architecture
Basic Cell/B.E. processor architecture

The basic architecture consists of a Power Processing Element (PPE) that introduces a PowerPC core with a traditional memory subsystem and eight Synergistic Processing Elements (SPEs) that are connected using a high-bandwidth internal connector called the Element Interconnect Bus (EIB). Each SPE consists of a Synergistic Processing Unit (SPU), a 256KB local store (LS) that holds the instructions and the data of a SPE, and a memory flow controller (MFC) that handles the DMA transfers between the system's main memory and the LS of the SPEs. The PPE serves only the operating system. At this time, Linux is the only operating system that makes use of the Cell/B.E. features.


Understanding container virtualization and the Cell/B.E. processor

Figure 3 depicts the partitioning of the Cell/B.E. processor.


Figure 3. Partitioning the Cell/B.E. processor
Partitioning the Cell/B.E. processor

The containers have granted access to the dedicated physical SPUs that are available in the system. The SPUs that are accessible in the containers are not accessible in any other container or the host system for the time of the dedicated allocation. Each container uses only the SPUs that it owns. To implement this, there are four things for you to do:

  1. Virtualize the SPU filesystem (spuf). The spuf must become accessible inside the containers, but each container should only see the SPE threads that it created itself.

  2. Adjust the virtual filesystem provided by the 2.6 Linux kernel (sysfs). The sysfs inside the container must contain the correct directory entries in /sys/devices/system/spu for the SPUs that are allocated to it. The libspe2 uses these directory entries to count the number of available SPUs inside the container.

  3. Modify the SPU scheduler. The SPU scheduling must be modified so that SPE threads that are created inside a container run only on the SPUs that are dedicated to the same container.

  4. Modify the OpenVZ tools. The vzctl tool of the OpenVZ tools must support the allocation of SPUs to containers and the counterpart in order to free SPUs of a container. The allocation and freeing should work during the runtime of the container.

The concept of shared virtualized devices means that the device is accessible by all the containers, as shown in Figure 4.


Figure 4. Shared virtualization of the SPUs
Shared virtualization of the SPUs

You need a mechanism that controls how much of the device or for how long the whole device is accessible by the container. Because a single SPU is not shareable by amount, the containers share the SPU by the duration that they have access to it.

The implementation should follow these four steps:

  1. Virtualize the spufs. The spufs must become accessible inside the containers, but each container should only see the SPE threads that it created itself.

  2. Adjust the sysfs. The sysfs inside the container must contain the correct directory entries in /sys/devices/system/spu. A better solution has the containers have access to all the SPUs that are installed in the system so that the /sys/devices/system/spu directory has the same entries on the host system and inside all containers.

  3. Modify the SPU scheduler. The SPU scheduler must be modified so that SPE threads that are created inside a container get only access on the SPUs that are available in the system for a certain amount of time. A scheduling algorithm similar to the two-level Fair CPU scheduler that the OpenVZ team implemented for ordinary processes can be designed for SPE threads. Even a completely new scheduling algorithm can be deployed.

  4. Modify the OpenVZ tools. The vzctl tool of the OpenVZ tools must support the setting of per-container SPU execution time. This setting should be changeable during the runtime of the container.

Now, a bit about spufs.

Using spufs

Spufs is comparable to another well-known virtual filesystem (VFS): procfs. Procfs was the first abstract filesystem in Linux, and it was meant to represent processes in a simple way. While procfs represents processes running on the CPU, spufs represents threads running on the SPUs of a Cell/B.E. processor.

The spufs is a special-purpose filesystem. It is a VFS developed for controlling the SPUs on the Cell/B.E. processor. Each directory in the spufs refers to a logical SPE context. Such an SPE context is treated like a physical SPE. The context properties are represented as files inside the directory. Accesses to SPE contexts either manipulate a real SPE or the saved state of it in memory. To start an SPU program, copy the SPU ELF executable into the SPE's LS and then execute the spu_run system call.

If you want to run Cell/B.E.-specific code on Linux, such as programs that run on the SPU of a Cell/B.E., you have to use the spufs; there is no other way to start processes that use the SPUs of a Cell/B.E. on Linux.

As you can see in Listing 1, before executing the ls command, the fft sample program of the Cell SDK was run with two SPE threads. /spu is the mount point of the spufs.


Listing 1. Executing Cell SDK Fast-Fourier Transform sample with two SPE threads
                
[root@c02b12-0 ~]# ls /spu
spethread-20914-25296904 spethread-20914-25297464

[root@c02b12-0 ~]# ls /spu/*
/spu/spethread-20914-25296904:
cntl decr_status event_mask fpcr ibox_info lslr mbox_info mem mss
object-id proxydma_info regs signal1_type signal2_type wbox wbox_stat
decr dma_info event_status ibox ibox_stat mbox mbox_stat mfc npc physid
psmap signal1 signal2 srr0 wbox_info

/spu/spethread-20914-25297464:
cntl decr_status event_mask fpcr ibox_info lslr mbox_info mem mss
object-id proxydma_info regs signal1_type signal2_type wbox wbox_stat
decr dma_info event_status ibox ibox_stat mbox mbox_stat mfc npc physid
psmap signal1 signal2 srr0 wbox_info

The output of the first command shows that there is one directory (kobject) for each SPE thread with the process ID (PID) and the thread ID of the SPE thread in the name of the directory. The second output shows the files inside both directories.

Figure 5 shows how the two different types of ELF executables are handled.


Figure 5. How two different types of ELF executables are handled
How two different types of ELF executables are handled

An SPE executable in the form of an SPE object file is bound to a SPE thread that is executed on a physical SPE. A special SPE scheduler decides when and which SPE thread is executed on a physical SPE. PPE object files are treated like they are on standard PowerPC® architecture systems such as JS21 Blade Servers. The PPE object files are bound to standard Linux threads that are scheduled by the Linux scheduler that is implemented in the kernel. So there is really nothing unique about applications that run on the PPE.


Understanding the Cell/B.E. SDK and the libspe

The software development kit (SDK) for the Cell/B.E. is available in version 3.0 now (get the latest copy at Cell Resource Center Downloads). The Cell/B.E. SDK includes all the tools you need to develop applications that make use of the Cell/B.E. processor. Among other things, the SDK contains different compiler and linker suites, libraries that simplify development (like libspe), and sample programs such as SPU, MFC, and LS that demonstrate how to use the Cell/B.E. architecture-specific feature.

At this time, the libspe (in versions libspe1 and libspe2) is the only user of the spufs. It is a framework that makes it easy for you to develop code for the Cell/B.E. architecture that runs on the SPEs. It copies the ELF executable into the LS of an SPE, and it calls the spu_run system call so you don't have to worry about all the internals of launching an executable on an SPE.

Figure 6 shows the hierarchical extensions that are implemented in Linux to use the Cell/B.E. processor.


Figure 6. Hierarchical extensions implemented in Linux to use Cell/B.E. processors
Hierarchical extensions implemented in Linux to use Cell/B.E. processors

The graphic shows the SPE Management Runtime Library (libspe) in user space making use of the spufs filesystem implemented in the kernel space.


Understanding sysfs

Sysfs was originally introduced as driverfs into the Linux kernel with the intention of having an overview of all the devices and drivers the kernel knows about. It was designed to be a much cleaner way to access devices and drivers than in procfs. The sysfs shows a hierarchy of kobject data structures (each of them as directory) and a set of attributes (of the kobject structure) that are files typically containing one single value encoded in a text string.

For example, Listing 2 is a listing of the /sys/devices/system/spu directory on a QS21 Cell/B.E. blade.


Listing 2. One directory for each SPU
                
[root@c02b12-0 ~]# ls /sys/devices/system/spu/
spu0  spu1  spu10  spu11  spu12  spu13  spu14  spu15  spu2  spu3  spu4  spu5  spu6
spu7  spu8  spu9
[root@c02b12-0 ~]#

A QS21 Cell/B.E. blade holds two Cell/B.E. processors with eight SPUs each, so you have a total number of 16 SPUs residing on the blade. As you can see, the /sys/devices/system/spu directory contains one directory for each SPU. The libspe2 makes use of these directory entries to count the number of available physical SPUs in the system.


Understanding the OpenVZ kernel

The OpenVZ kernel is a modified Linux kernel that introduces the capability of having isolated operating system container environments. In addition, it offers resource management and checkpointing to the containers. Keep in mind that each container and even the host system use the same shared and virtualized kernel.

Each container has its own set of resources that are provided by the kernel, such as:

  • Files: system libraries and applications.
  • Virtualized filesystems: procfs or sysfs.
  • Users and groups: each container has its own root user, as well as other users and groups.
  • Process tree: a container can only see its own set of processes with virtualized PIDs (init PID is 1).
  • Network: virtual network devices with own IP addresses, routing, and filter rules.
  • Devices: some devices are virtualized. If there is a need, any container can have granted access to a real (non-virtualized) device.
  • IPC objects: semaphores, messages, or shared memory.

Resource management

The resource management is done on different types of resources:

  • Disk quota: OpenVZ introduces a two-level disk quota that makes it possible to limit the disk space to the container, and the container can have quotas in its environment again.
  • CPU scheduler: the Fair CPU Scheduler is also using a two-level mechanism. It is a per-container, configurable scheduler in the first level on which you can define how much of the CPU time is used by a certain container. The second level of the scheduler takes care of the process scheduling inside the container environment.
  • User beancounters: this is a set of counters, limits, and guarantees for container resources. There are about 20 parameters that take care of memory and various in-kernel objects, such as IPC shared memory segments and network buffers.

Checkpointing

Checkpointing is another main function. Checkpointing and restoring is necessary for live migration. Checkpointing is the process of freezing a container and saving its complete state to a disk file afterwards. Restoring is the counterpart. Live migration of a container is the process of checkpointing a container on one host system and restoring it on another host system.

Figure 7 shows the differences between a live migration and a checkpointing and restoring process.


Figure 7. Differences between a live migration and a checkpointing and restoring process
Differences between a live migration and a checkpointing and         restoring process

Figure 7 demonstrates that the live migration is one single action whereas checkpointing and restoring are two different actions that need an extra storage unit that is accessible from both hardware nodes.


Using vzctl and other OpenVZ tools

The main OpenVZ Tool is vzctl, which is the high level command-line interface to manage container environments. Vzctl can be used to create, start, stop, and destroy a virtual operating system environment. This called a container lifecycle.

Vzctl can also be used to change various container resources such as an IP address, memory, or CPU time that a container environment can use. Most of these parameters can be set and changed during runtime of the container. This is usually impossible with other virtualization technologies, such as platform virtualization.

You can only launch the vzctl tool from the host system and not from inside the container.

Besides vzctl, there are many more tools to manage OpenVZ containers. The tools are not needed for OpenVZ regarding the virtualization of the Cell/B.E., so you can find more details about the general management of container environments in the OpenVZ User's Guide (see Resources). Coincidentally, the authors of this series wrote the OpenVZ on POWER™ handbook.


Getting ready for Part 2

Part 2 describes only the implementation for the concept of dedicated virtualization (partitioning) shown in Figure 3.


Acknowledgments

Much thanks to the authors of "Mehrarbeit fur CPUs" (Linux Magazin, April 2006) for the use of the image in Figure 1.


Resources

Learn

Get products and technologies

Discuss

About the authors

Christian Kaiser studies Computer Engineering at RWTH University in Aachen, Germany. In 2007, he was an intern at IBM Germany Research Lab in Boeblingen, Germany. While doing his internship at IBM, he researched virtualization methods for the Cell Broadband Engine processor. After the internship, Christian Kaiser went to work on his thesis at the Chair for Operating Systems of the RWTH Aachen. His thesis is about "Analysis of asynchronous collective communication in memory-coupled high-speed networks."

Christian Rund is in the IBM Development Laboratory in Boeblingen, Germany. He studied Computer Science at the University of Stuttgart and at Uppsala Universitet, Sweden, graduating in 1997. During his studies, he worked as a student trainee at IBM in Herrenberg and Stuttgart, Germany. In 1998, he joined the system development department of the Landeszentralbank in Stuttgart (Deutsche Bundesbank). In 2001, Christian joined IBM as an research and development engineer on the development team for the zSeries FCP channel. Since mid-2006, he has been a research and development engineer for host firmware for Cell/B.E.-based servers.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration, Open source, Linux
ArticleID=275761
ArticleTitle=Cell/B.E. container virtualization, Part 1: Concepts, architectures, and tools
publish-date=12112007
author1-email=christian.kaiser1@rwth-aachen.de
author1-email-cc=
author2-email=Christian.Rund@de.ibm.com
author2-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers