 | Level: Intermediate Christian Kaiser (christian.kaiser1@rwth-aachen.de), Research Intern, IBM Christian Rund (Christian.Rund@de.ibm.com), Research and Development Engineer, IBM
08 Jan 2008 This three-part series illustrates a
hardware-resource-focused form of software virtualization known as container
virtualization (or operating system virtualization), demonstrated through the open
source project OpenVZ. The series provides a comprehensive overview of all the
components and techniques needed to virtualize the Cell/B.E. processor with software
methods. This second article of the series details the implementation of
dedicated virtualization and partitioning that was described in Part 1 of the
series.
This article describes the implementation for the concept of
dedicated virtualization (partitioning) demonstrated in the first article
(Part 1, Figure 3).
This article does not address the development of the concept of shared devices (as shown in
Part 1, Figure 4).
To demonstrate implementation of the system, this article covers the following:
- Virtualization of the spufs: stop it from using the same root-inode for
each mount point
- Adjustments to the sysfs: entries must be adapted during runtime
- Modification of the SPU scheduler: add a step for virtualization
realization
- Modification of the OpenVZ tools: make use of the new feature
Virtualizing the
spufs
By default, the spufs uses the same root-inode each
time you mount it. For example, you might have two
chrooted environments (A and B), both with
a mounted spufs to /spu. If you create SPE threads in environment A and then do
a listing of /spu, you see the SPE threads that you just created. But in
environment B, you would see the same listing of /spu, because you are accessing the
same root-inode as environment A.
To change this behavior so that both environments see only the SPE threads that
they create, you have to change the fact that the spufs always uses the same
root-inode for each mount point. When the spufs is
mounted, it internally calls a kernel function that returns a super block.
Normally it uses the function get_sb_single(), which
always returns the same super block. The function
get_sb_nodev() always returns a different super block,
and it leads to the desirable behavior. (Figure 1 illustrates spufs before and after
virtualization has taken place.)
Figure 1. Spufs before and after
virtualization
OpenVZ allows only a few file systems to be mounted inside containers. Each
filesystem implements a struct file_system_type
instance that defines file-specific aspects. OpenVZ extends this structure (among
others) with a member that defines whether the file system is mountable. So the
.fs_flags member must be set to
FS_VIRTUALIZED.
A process that is created inside a container has two PIDs. (Remember that Part 1
described that each container has its own set of resources that the kernel
provides. The process tree was one set of resources: a container can
see only its own set of processes with virtualized PIDs.) Inside a container is a
virtualized PID, and on the host system the same process appears with a different
PID, called a global PID. As mentioned, the spufs directories
are named in the form spethread-<PID>-<thread-ID>,
in which <PID> is the process ID of the
process that holds the SPE thread and
<thread-ID>> is the corresponding thread
ID. As a result of virtualized PIDs in the container environment, the listing of
the spufs inside the container must show the virtualized PIDs and not the global
PID. This behavior is already implemented this way in OpenVZ, fortunately.
Change the file arch/powerpc/platforms/cell/spufs/inode.c in the Linux® kernel.
Adjusting the sysfs
The sysfs is already virtualized so that it can be used inside containers. It is
not a copy of the sysfs that is visible in the host system but only a part of it.
The goal is to allocate and free SPUs to and from a container during runtime.
That means that the sysfs entries must be adapted during runtime, too. Before
doing this, create the directory where the SPUs are listed.
This is the /sys/devices/system/spu directory. By default the sysfs in a
container environment has no /sys/devices/system/spu, no /sys/devices/system, and
no /sys/devices directory. These three entries have to be created during the
initialization (start up) of the container. The subdirectories of
/sys/devices/system/spu can be spu0 to spu<N>, where
<N> is the number of available SPUs in the system minus
1. The directories must be created when the SPUs are assigned to the container,
and they must be deleted when they are freed from the container.
Each directory in the sysfs listing has a corresponding kobject instance in the
kernel space. A kobject is a structure that defines the name of
the directory and its parent kobject (in other words, its corresponding parent
directory). For example, the kobject that is visible as the
/sys/devices/system/spu/spu3 directory later has two settings to adjust:
- The name spu3.
- The parent (a pointer to the kobject that represents
/sys/devices/system/spu).
After registering a kobject in the sysfs using a
subsystem_register() call, it is visible in the user
space. The counterpart is to delete a directory from the sysfs. You can do this by calling the subsystem_unregister() function
with the corresponding kobject that should be deleted. Figure 2 shows an example.
Figure 2. Sysfs and kobjects
The kobject with the name spu3 has a pointer to its parent kobject with the
name spu.
The main changes must be done in the kernel file kernel/ve/vecalls.c. This
file is an OpenVZ-specific file where most of the functions are implemented that
are called during the initialization and the setting of parameters of containers.
Modifying the SPU
scheduler
Each physical SPU in the system is represented by an instance of an
spu structure in the kernel space. This structure holds
several things, including:
- The ID (number) of the SPU
- Which Cell/B.E. node it belongs to
- A pointer to the LS of the SPU
A new member variable saves the owner of the SPU in the form of the container ID.
The SPU scheduler implements a spu_alloc() function
that searches for a free SPU to execute an SPE thread on it. Therefore, it searches
in a list of available SPUs (no SPE thread is executed on it instantly) in the system.
Normal behavior takes the first SPU in the list and executes the SPE
thread on it. To achieve virtualization behavior, the function must check
whether the free SPU has the same container ID as the SPE thread that
should be executed on it. Figure 3 shows how
spu_alloc() works before and after the modification
process.
Figure 3. Spu_alloc() before and
after modification
If this additional check is not true, the function checks the next element in the
list of free SPUs. If there is no free SPU available for the
container that launches the SPE thread, the SPU scheduler behaves as if the list
was empty, and it waits until an SPU becomes free.
The spu_alloc() function is implemented in the
arch/powerpc/platforms/cell/spu_base.c Linux kernel source file.
Modifying the
OpenVZ tools
Most of the required function already exists, but in order to use
the new feature, you must modify the OpenVZ tools. The vzctl tool manages the SPU allocation during
runtime. It is the main tool for setting container
parameters in OpenVZ. The new parameter for setting the number of SPUs assigned to
a container is --spus <nr_spus>.
The <nr_spus> value represents the
number of SPUs assigned to the container. It is an absolute value, so if eight
SPUs with the value of 6 are assigned to the container,
then 2 SPUs are freed from the container instead of adding 6 more
SPUs (8 - 6 = 2).
For example, here is the command-line output where the container with the ID of 101
gets eight SPUs:
[root@c02b12-0 ~]# vzctl set 101 --spus 8
Setting SPUs: 8
Configure meminfo: 1024000
WARNING: Settings were not saved and will be reset to original
values on next start (use --save flag)
[root@c02b12-0 ~]#
|
To complete this behavior, the vzctl tool must cross the user space barrier and do
some management in the kernel space. The tool must find SPUs that are not yet used by
other containers. The vzctl tool searches through a list of available SPUs and checks
the newly implemented container ID value in the spu structures (described in the
section on modifying the SPU scheduler). If the value is 0, the SPU can be
assigned to the demanding container. The value 0 is used because a container
ID value must be greater than 0 so that the value 0 signifies that the SPU is not
assigned to any container. If the function cannot find enough free SPUs to
complete the request, the procedure ends and does not assign any SPU to the
container. If the number of SPUs that are already assigned to the container is
higher than the requested number of SPUs, the difference frees up.
To cross the barrier of user space and kernel space, you can use different
implementation models. (Refer to Arnd Bergmann's "How to not invent kernel
interfaces" in Resources for more information about
implementation models.) The simplest way is to implement a new system call that
maps the parameters <containerID> and
<nr_spus> on the parameters of the system
call.
The functions that handle the setting of the SPU parameter of the containers
must be implemented in a part of the kernel that can be built as a kernel
module. That presents a big problem. If the kernel module is not loaded, the system
call handler function in the kernel space should do nothing. But if the module is
loaded, it calls the functions that are implemented inside the module. That
is not a trivial task, because the system call table (where the function pointers
to the system call handler function reside) is part of the static kernel build.
The module is not part of the static function, and that is why the static
built-in system call handler function cannot call the functions that are part of
the module. The solution is to implement a function wrapper that copies a pointer
to the functions in the module into a variable of the static built-in system call
handler function so that the statically built-in system call handler can call the
functions in the module. The function wrapper is called during the module
initialization and cleanup. The black arrows in Figure 4 show the function wrapper
method.
Figure 4. System call and module
functions
You can see how the function pointer of the function that is implemented inside
a module is copied into the static, built-in kernel space. The dashed arrows show
that the user space application calls the function inside the module
by passing the static, built-in system call handler function.
Kernel source code files that require changes:
- include/asm-powerpc/systbl.h
- include/asm-powerpc/unistd.h
- include/linux/syscalls.h
- kernel/sys.c
- kernel/sys_ni.c
- kernel/ve/vecalls.c
Also, update files in the OpenVZ vzctl sources:
- include/res.h
- include/vzctl_param.h
- include/vzsyscalls.h
- src/lib/config.c
- src/lib/res.c
- src/vzctl.c
Introduce and include two more files in the OpenVZ build system:
- include/spu.h
- src/lib/spu.c
Getting ready for Part 3
Part 3 describes using and testing the system and analyzes
the performance of container virtualization against other software virtualization
methods, such as paravirtualization or full virtualization.
Resources Learn
- Use an
RSS
feed to request notification for the upcoming articles in this series. (Find out more about RSS feeds of developerWorks content.)
- Try "Virtualization in Linux"
(September 2006) for OpenVZ and virtualization. It ties the three main
virtualization approaches—emulation,
paravirtualization, and OS-level virtualization—to OpenVZ.
- Refer to Arnd Bergmann's
"Spufs: The Cell Synergistic Processing Unit as a virtual file system"
(developerWorks, June 2005) for details about the SPU file system interface that allows
Linux to run on the Cell/B.E. platform. Bergmann's
"How to not invent kernel interfaces"
(paper to LinuxConf Europe, July 2007) explains how to choose the form of user
space interface your kernel code should get. Bergmann also has a "virtual"
class about Linux on Cell/B.E. platforms
that covers the threading model, Linux runtime strategy, PPU/SPU runtime
requirements, spufs, signal handling, and more.
- Read Daniel Hackenberg's
"Performance Measurements on Cell SMP Systems"
presentation (from the Center for Information Services and High Performance
Computing's Cell/B.E. cluster meeting, May 2007) for such performance analysis
measures as matrix multiplication, XDR DMA bandwidth, and SPE-to-SPE DMA
bandwidth.
- Check out Duc Vianney's
"Cell Software Solutions Programming Model"
presentation (March 2006) for such Cell/B.E. programming model issues as PPE-
vs. SPE-centric, function offload, overlapping DMA and computation, and
heterogeneous multi-thread.
- Read
"Virtualization in a nutshell"
(developerWorks, June 2006) as an introduction to the topic of basic
virtualization concepts by means of common patterns.
"Virtual Linux"
(developerWorks, December 2006) explains the various forms of virtualization (and current
virtualization projects) from a Linux perspective.
- See
"Virtualization with coLinux"
(developerWorks, March 2007) and
"System emulation with QEMU"
(September 2007) for more on paravirtualization.
- Check out a new quick-read jumpstart series in
the blog that covers SDK 3.0 topics—
Infobombs. The first three
introduce the Accelerated Library Framework (ALF),
illustrate the 10 most important
concepts of ALF, and introduce the Data Communication and Synchronization (DaCS) library services.
- Find
"Changes in libspe: How libspe2 affects Cell Broadband Engine programming"
(developerWorks, July 2007) for details about the libspe2 concepts and to see how to do basic
SPE process management and communication with libspe2.
- Refer to
"Introduction to the Cell Multiprocessor"
(IBM Journal of Research and Development, 2005) for an introductory
overview of the Cell/B.E. multiprocessor's history, the program objectives and
challenges, the design concept, the architecture and programming models, and the
implementation.
- To learn more on Cell/B.E. programming, try the
developerWorks series:
- Refer to the Cell
Broadband Engine documentation section of the IBM Semiconductor Solutions Technical Library for a wealth of downloadable manuals,
specifications, and more.
- Sign up for the developerWorks newsletter
and get the latest developer news and Cell/B.E. happenings delivered to your inbox each week.
Check Power Architecture when you sign up to receive Cell/B.E. news in your newsletter.
- Visit these other developerWorks resources you might be
interested in because of their connection to virtualization: the
Linux zone and the
Open source zone.
Get products and technologies
Discuss
About the authors  | |  | Christian Kaiser studies Computer Engineering at RWTH University in Aachen, Germany. In 2007, he was an intern at IBM Germany Research Lab in Boeblingen, Germany. While doing his internship at IBM, he researched virtualization methods for the Cell Broadband Engine processor. After the internship, Christian Kaiser went to work on his thesis at the Chair for Operating Systems of the RWTH Aachen. His thesis is about "Analysis of asynchronous collective communication in memory-coupled high-speed networks." |
 | |  | Christian Rund is in the IBM Development Laboratory in Boeblingen, Germany. He studied
Computer Science at the University of Stuttgart and at Uppsala Universitet, Sweden, graduating in 1997. During his studies, he worked as a student trainee at IBM in Herrenberg and Stuttgart, Germany. In 1998, he joined the system development department of the Landeszentralbank in Stuttgart (Deutsche Bundesbank). In 2001, Christian joined IBM as an research and development engineer on the development team for the zSeries FCP channel. Since mid-2006, he has been a research and development engineer for host firmware for Cell/B.E.-based servers. |
Rate this page
|  |