Skip to main content

Cell/B.E. container virtualization, Part 2: Implementation issues

Learn to implement software-based container virtualization on the Cell/B.E. platform via the open source software project OpenVZ

Christian Kaiser studies Computer Engineering at RWTH University in Aachen, Germany. In 2007, he was an intern at IBM Germany Research Lab in Boeblingen, Germany. While doing his internship at IBM, he researched virtualization methods for the Cell Broadband Engine processor. After the internship, Christian Kaiser went to work on his thesis at the Chair for Operating Systems of the RWTH Aachen. His thesis is about "Analysis of asynchronous collective communication in memory-coupled high-speed networks."
Christian Rund (Christian.Rund@de.ibm.com), Research and Development Engineer, IBM
Christian Rund is in the IBM Development Laboratory in Boeblingen, Germany. He studied Computer Science at the University of Stuttgart and at Uppsala Universitet, Sweden, graduating in 1997. During his studies, he worked as a student trainee at IBM in Herrenberg and Stuttgart, Germany. In 1998, he joined the system development department of the Landeszentralbank in Stuttgart (Deutsche Bundesbank). In 2001, Christian joined IBM as an research and development engineer on the development team for the zSeries FCP channel. Since mid-2006, he has been a research and development engineer for host firmware for Cell/B.E.-based servers.

Summary:  This three-part series illustrates a hardware-resource-focused form of software virtualization known as container virtualization (or operating system virtualization), demonstrated through the open source project OpenVZ. The series provides a comprehensive overview of all the components and techniques needed to virtualize the Cell/B.E. processor with software methods. This second article of the series details the implementation of dedicated virtualization and partitioning that was described in Part 1 of the series.

View more content in this series

Date:  08 Jan 2008
Level:  Intermediate
Activity:  4490 views

This article describes the implementation for the concept of dedicated virtualization (partitioning) demonstrated in the first article (Part 1, Figure 3). This article does not address the development of the concept of shared devices (as shown in Part 1, Figure 4).

To demonstrate implementation of the system, this article covers the following:

  • Virtualization of the spufs: stop it from using the same root-inode for each mount point
  • Adjustments to the sysfs: entries must be adapted during runtime
  • Modification of the SPU scheduler: add a step for virtualization realization
  • Modification of the OpenVZ tools: make use of the new feature

Virtualizing the spufs

By default, the spufs uses the same root-inode each time you mount it. For example, you might have two chrooted environments (A and B), both with a mounted spufs to /spu. If you create SPE threads in environment A and then do a listing of /spu, you see the SPE threads that you just created. But in environment B, you would see the same listing of /spu, because you are accessing the same root-inode as environment A.

To change this behavior so that both environments see only the SPE threads that they create, you have to change the fact that the spufs always uses the same root-inode for each mount point. When the spufs is mounted, it internally calls a kernel function that returns a super block. Normally it uses the function get_sb_single(), which always returns the same super block. The function get_sb_nodev() always returns a different super block, and it leads to the desirable behavior. (Figure 1 illustrates spufs before and after virtualization has taken place.)


Figure 1. Spufs before and after virtualization
Spufs before and after virtualization

OpenVZ allows only a few file systems to be mounted inside containers. Each filesystem implements a struct file_system_type instance that defines file-specific aspects. OpenVZ extends this structure (among others) with a member that defines whether the file system is mountable. So the .fs_flags member must be set to FS_VIRTUALIZED.

A process that is created inside a container has two PIDs. (Remember that Part 1 described that each container has its own set of resources that the kernel provides. The process tree was one set of resources: a container can see only its own set of processes with virtualized PIDs.) Inside a container is a virtualized PID, and on the host system the same process appears with a different PID, called a global PID. As mentioned, the spufs directories are named in the form spethread-<PID>-<thread-ID>, in which <PID> is the process ID of the process that holds the SPE thread and <thread-ID>> is the corresponding thread ID. As a result of virtualized PIDs in the container environment, the listing of the spufs inside the container must show the virtualized PIDs and not the global PID. This behavior is already implemented this way in OpenVZ, fortunately.

Change the file arch/powerpc/platforms/cell/spufs/inode.c in the Linux® kernel.


Adjusting the sysfs

The sysfs is already virtualized so that it can be used inside containers. It is not a copy of the sysfs that is visible in the host system but only a part of it. The goal is to allocate and free SPUs to and from a container during runtime. That means that the sysfs entries must be adapted during runtime, too. Before doing this, create the directory where the SPUs are listed.

This is the /sys/devices/system/spu directory. By default the sysfs in a container environment has no /sys/devices/system/spu, no /sys/devices/system, and no /sys/devices directory. These three entries have to be created during the initialization (start up) of the container. The subdirectories of /sys/devices/system/spu can be spu0 to spu<N>, where <N> is the number of available SPUs in the system minus 1. The directories must be created when the SPUs are assigned to the container, and they must be deleted when they are freed from the container.

Each directory in the sysfs listing has a corresponding kobject instance in the kernel space. A kobject is a structure that defines the name of the directory and its parent kobject (in other words, its corresponding parent directory). For example, the kobject that is visible as the /sys/devices/system/spu/spu3 directory later has two settings to adjust:

  • The name spu3.
  • The parent (a pointer to the kobject that represents /sys/devices/system/spu).

After registering a kobject in the sysfs using a subsystem_register() call, it is visible in the user space. The counterpart is to delete a directory from the sysfs. You can do this by calling the subsystem_unregister() function with the corresponding kobject that should be deleted. Figure 2 shows an example.


Figure 2. Sysfs and kobjects
Sysfs and kobjects

The kobject with the name spu3 has a pointer to its parent kobject with the name spu.

The main changes must be done in the kernel file kernel/ve/vecalls.c. This file is an OpenVZ-specific file where most of the functions are implemented that are called during the initialization and the setting of parameters of containers.


Modifying the SPU scheduler

Each physical SPU in the system is represented by an instance of an spu structure in the kernel space. This structure holds several things, including:

  • The ID (number) of the SPU
  • Which Cell/B.E. node it belongs to
  • A pointer to the LS of the SPU

A new member variable saves the owner of the SPU in the form of the container ID. The SPU scheduler implements a spu_alloc() function that searches for a free SPU to execute an SPE thread on it. Therefore, it searches in a list of available SPUs (no SPE thread is executed on it instantly) in the system.

Normal behavior takes the first SPU in the list and executes the SPE thread on it. To achieve virtualization behavior, the function must check whether the free SPU has the same container ID as the SPE thread that should be executed on it. Figure 3 shows how spu_alloc() works before and after the modification process.


Figure 3. Spu_alloc() before and after modification
Spu_alloc() before and after modification

If this additional check is not true, the function checks the next element in the list of free SPUs. If there is no free SPU available for the container that launches the SPE thread, the SPU scheduler behaves as if the list was empty, and it waits until an SPU becomes free.

The spu_alloc() function is implemented in the arch/powerpc/platforms/cell/spu_base.c Linux kernel source file.


Modifying the OpenVZ tools

Most of the required function already exists, but in order to use the new feature, you must modify the OpenVZ tools. The vzctl tool manages the SPU allocation during runtime. It is the main tool for setting container parameters in OpenVZ. The new parameter for setting the number of SPUs assigned to a container is --spus <nr_spus>.

The <nr_spus> value represents the number of SPUs assigned to the container. It is an absolute value, so if eight SPUs with the value of 6 are assigned to the container, then 2 SPUs are freed from the container instead of adding 6 more SPUs (8 - 6 = 2).

For example, here is the command-line output where the container with the ID of 101 gets eight SPUs:

[root@c02b12-0 ~]# vzctl set 101 --spus 8
Setting SPUs: 8
Configure meminfo: 1024000
WARNING: Settings were not saved and will be reset to original
 values on next start (use --save flag)
[root@c02b12-0 ~]#

To complete this behavior, the vzctl tool must cross the user space barrier and do some management in the kernel space. The tool must find SPUs that are not yet used by other containers. The vzctl tool searches through a list of available SPUs and checks the newly implemented container ID value in the spu structures (described in the section on modifying the SPU scheduler). If the value is 0, the SPU can be assigned to the demanding container. The value 0 is used because a container ID value must be greater than 0 so that the value 0 signifies that the SPU is not assigned to any container. If the function cannot find enough free SPUs to complete the request, the procedure ends and does not assign any SPU to the container. If the number of SPUs that are already assigned to the container is higher than the requested number of SPUs, the difference frees up.

To cross the barrier of user space and kernel space, you can use different implementation models. (Refer to Arnd Bergmann's "How to not invent kernel interfaces" in Resources for more information about implementation models.) The simplest way is to implement a new system call that maps the parameters <containerID> and <nr_spus> on the parameters of the system call.

The functions that handle the setting of the SPU parameter of the containers must be implemented in a part of the kernel that can be built as a kernel module. That presents a big problem. If the kernel module is not loaded, the system call handler function in the kernel space should do nothing. But if the module is loaded, it calls the functions that are implemented inside the module. That is not a trivial task, because the system call table (where the function pointers to the system call handler function reside) is part of the static kernel build.

The module is not part of the static function, and that is why the static built-in system call handler function cannot call the functions that are part of the module. The solution is to implement a function wrapper that copies a pointer to the functions in the module into a variable of the static built-in system call handler function so that the statically built-in system call handler can call the functions in the module. The function wrapper is called during the module initialization and cleanup. The black arrows in Figure 4 show the function wrapper method.


Figure 4. System call and module functions
System call and module functions

You can see how the function pointer of the function that is implemented inside a module is copied into the static, built-in kernel space. The dashed arrows show that the user space application calls the function inside the module by passing the static, built-in system call handler function.

Kernel source code files that require changes:

  • include/asm-powerpc/systbl.h
  • include/asm-powerpc/unistd.h
  • include/linux/syscalls.h
  • kernel/sys.c
  • kernel/sys_ni.c
  • kernel/ve/vecalls.c

Also, update files in the OpenVZ vzctl sources:

  • include/res.h
  • include/vzctl_param.h
  • include/vzsyscalls.h
  • src/lib/config.c
  • src/lib/res.c
  • src/vzctl.c

Introduce and include two more files in the OpenVZ build system:

  • include/spu.h
  • src/lib/spu.c

Getting ready for Part 3

Part 3 describes using and testing the system and analyzes the performance of container virtualization against other software virtualization methods, such as paravirtualization or full virtualization.


Resources

Learn

Get products and technologies

Discuss

About the authors

Christian Kaiser studies Computer Engineering at RWTH University in Aachen, Germany. In 2007, he was an intern at IBM Germany Research Lab in Boeblingen, Germany. While doing his internship at IBM, he researched virtualization methods for the Cell Broadband Engine processor. After the internship, Christian Kaiser went to work on his thesis at the Chair for Operating Systems of the RWTH Aachen. His thesis is about "Analysis of asynchronous collective communication in memory-coupled high-speed networks."

Christian Rund is in the IBM Development Laboratory in Boeblingen, Germany. He studied Computer Science at the University of Stuttgart and at Uppsala Universitet, Sweden, graduating in 1997. During his studies, he worked as a student trainee at IBM in Herrenberg and Stuttgart, Germany. In 1998, he joined the system development department of the Landeszentralbank in Stuttgart (Deutsche Bundesbank). In 2001, Christian joined IBM as an research and development engineer on the development team for the zSeries FCP channel. Since mid-2006, he has been a research and development engineer for host firmware for Cell/B.E.-based servers.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration, Open source, Linux
ArticleID=279981
ArticleTitle=Cell/B.E. container virtualization, Part 2: Implementation issues
publish-date=01082008
author1-email=christian.kaiser1@rwth-aachen.de
author1-email-cc=
author2-email=Christian.Rund@de.ibm.com
author2-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers