This article describes the implementation for the concept of dedicated virtualization (partitioning) demonstrated in the first article (Part 1, Figure 3). This article does not address the development of the concept of shared devices (as shown in Part 1, Figure 4).
To demonstrate implementation of the system, this article covers the following:
- Virtualization of the spufs: stop it from using the same root-inode for each mount point
- Adjustments to the sysfs: entries must be adapted during runtime
- Modification of the SPU scheduler: add a step for virtualization realization
- Modification of the OpenVZ tools: make use of the new feature
By default, the spufs uses the same root-inode each
time you mount it. For example, you might have two
chrooted environments (A and B), both with
a mounted spufs to /spu. If you create SPE threads in environment A and then do
a listing of /spu, you see the SPE threads that you just created. But in
environment B, you would see the same listing of /spu, because you are accessing the
same root-inode as environment A.
To change this behavior so that both environments see only the SPE threads that
they create, you have to change the fact that the spufs always uses the same
root-inode for each mount point. When the spufs is
mounted, it internally calls a kernel function that returns a super block.
Normally it uses the function get_sb_single(), which
always returns the same super block. The function
get_sb_nodev() always returns a different super block,
and it leads to the desirable behavior. (Figure 1 illustrates spufs before and after
virtualization has taken place.)
Figure 1. Spufs before and after virtualization
OpenVZ allows only a few file systems to be mounted inside containers. Each
filesystem implements a struct file_system_type
instance that defines file-specific aspects. OpenVZ extends this structure (among
others) with a member that defines whether the file system is mountable. So the
.fs_flags member must be set to
FS_VIRTUALIZED.
A process that is created inside a container has two PIDs. (Remember that Part 1
described that each container has its own set of resources that the kernel
provides. The process tree was one set of resources: a container can
see only its own set of processes with virtualized PIDs.) Inside a container is a
virtualized PID, and on the host system the same process appears with a different
PID, called a global PID. As mentioned, the spufs directories
are named in the form spethread-<PID>-<thread-ID>,
in which <PID> is the process ID of the
process that holds the SPE thread and
<thread-ID>> is the corresponding thread
ID. As a result of virtualized PIDs in the container environment, the listing of
the spufs inside the container must show the virtualized PIDs and not the global
PID. This behavior is already implemented this way in OpenVZ, fortunately.
Change the file arch/powerpc/platforms/cell/spufs/inode.c in the Linux® kernel.
The sysfs is already virtualized so that it can be used inside containers. It is not a copy of the sysfs that is visible in the host system but only a part of it. The goal is to allocate and free SPUs to and from a container during runtime. That means that the sysfs entries must be adapted during runtime, too. Before doing this, create the directory where the SPUs are listed.
This is the /sys/devices/system/spu directory. By default the sysfs in a container environment has no /sys/devices/system/spu, no /sys/devices/system, and no /sys/devices directory. These three entries have to be created during the initialization (start up) of the container. The subdirectories of /sys/devices/system/spu can be spu0 to spu<N>, where <N> is the number of available SPUs in the system minus 1. The directories must be created when the SPUs are assigned to the container, and they must be deleted when they are freed from the container.
Each directory in the sysfs listing has a corresponding kobject instance in the kernel space. A kobject is a structure that defines the name of the directory and its parent kobject (in other words, its corresponding parent directory). For example, the kobject that is visible as the /sys/devices/system/spu/spu3 directory later has two settings to adjust:
- The name spu3.
- The parent (a pointer to the kobject that represents /sys/devices/system/spu).
After registering a kobject in the sysfs using a
subsystem_register() call, it is visible in the user
space. The counterpart is to delete a directory from the sysfs. You can do this by calling the subsystem_unregister() function
with the corresponding kobject that should be deleted. Figure 2 shows an example.
Figure 2. Sysfs and kobjects
The kobject with the name spu3 has a pointer to its parent kobject with the name spu.
The main changes must be done in the kernel file kernel/ve/vecalls.c. This file is an OpenVZ-specific file where most of the functions are implemented that are called during the initialization and the setting of parameters of containers.
Each physical SPU in the system is represented by an instance of an
spu structure in the kernel space. This structure holds
several things, including:
- The ID (number) of the SPU
- Which Cell/B.E. node it belongs to
- A pointer to the LS of the SPU
A new member variable saves the owner of the SPU in the form of the container ID.
The SPU scheduler implements a spu_alloc() function
that searches for a free SPU to execute an SPE thread on it. Therefore, it searches
in a list of available SPUs (no SPE thread is executed on it instantly) in the system.
Normal behavior takes the first SPU in the list and executes the SPE
thread on it. To achieve virtualization behavior, the function must check
whether the free SPU has the same container ID as the SPE thread that
should be executed on it. Figure 3 shows how
spu_alloc() works before and after the modification
process.
Figure 3. Spu_alloc() before and after modification
If this additional check is not true, the function checks the next element in the list of free SPUs. If there is no free SPU available for the container that launches the SPE thread, the SPU scheduler behaves as if the list was empty, and it waits until an SPU becomes free.
The spu_alloc() function is implemented in the
arch/powerpc/platforms/cell/spu_base.c Linux kernel source file.
Most of the required function already exists, but in order to use
the new feature, you must modify the OpenVZ tools. The vzctl tool manages the SPU allocation during
runtime. It is the main tool for setting container
parameters in OpenVZ. The new parameter for setting the number of SPUs assigned to
a container is --spus <nr_spus>.
The <nr_spus> value represents the
number of SPUs assigned to the container. It is an absolute value, so if eight
SPUs with the value of 6 are assigned to the container,
then 2 SPUs are freed from the container instead of adding 6 more
SPUs (8 - 6 = 2).
For example, here is the command-line output where the container with the ID of 101 gets eight SPUs:
[root@c02b12-0 ~]# vzctl set 101 --spus 8 Setting SPUs: 8 Configure meminfo: 1024000 WARNING: Settings were not saved and will be reset to original values on next start (use --save flag) [root@c02b12-0 ~]# |
To complete this behavior, the vzctl tool must cross the user space barrier and do some management in the kernel space. The tool must find SPUs that are not yet used by other containers. The vzctl tool searches through a list of available SPUs and checks the newly implemented container ID value in the spu structures (described in the section on modifying the SPU scheduler). If the value is 0, the SPU can be assigned to the demanding container. The value 0 is used because a container ID value must be greater than 0 so that the value 0 signifies that the SPU is not assigned to any container. If the function cannot find enough free SPUs to complete the request, the procedure ends and does not assign any SPU to the container. If the number of SPUs that are already assigned to the container is higher than the requested number of SPUs, the difference frees up.
To cross the barrier of user space and kernel space, you can use different
implementation models. (Refer to Arnd Bergmann's "How to not invent kernel
interfaces" in Resources for more information about
implementation models.) The simplest way is to implement a new system call that
maps the parameters <containerID> and
<nr_spus> on the parameters of the system
call.
The functions that handle the setting of the SPU parameter of the containers must be implemented in a part of the kernel that can be built as a kernel module. That presents a big problem. If the kernel module is not loaded, the system call handler function in the kernel space should do nothing. But if the module is loaded, it calls the functions that are implemented inside the module. That is not a trivial task, because the system call table (where the function pointers to the system call handler function reside) is part of the static kernel build.
The module is not part of the static function, and that is why the static built-in system call handler function cannot call the functions that are part of the module. The solution is to implement a function wrapper that copies a pointer to the functions in the module into a variable of the static built-in system call handler function so that the statically built-in system call handler can call the functions in the module. The function wrapper is called during the module initialization and cleanup. The black arrows in Figure 4 show the function wrapper method.
Figure 4. System call and module functions
You can see how the function pointer of the function that is implemented inside a module is copied into the static, built-in kernel space. The dashed arrows show that the user space application calls the function inside the module by passing the static, built-in system call handler function.
Kernel source code files that require changes:
- include/asm-powerpc/systbl.h
- include/asm-powerpc/unistd.h
- include/linux/syscalls.h
- kernel/sys.c
- kernel/sys_ni.c
- kernel/ve/vecalls.c
Also, update files in the OpenVZ vzctl sources:
- include/res.h
- include/vzctl_param.h
- include/vzsyscalls.h
- src/lib/config.c
- src/lib/res.c
- src/vzctl.c
Introduce and include two more files in the OpenVZ build system:
- include/spu.h
- src/lib/spu.c
Part 3 describes using and testing the system and analyzes the performance of container virtualization against other software virtualization methods, such as paravirtualization or full virtualization.
Learn
- Use an
RSS
feed to request notification for the upcoming articles in this series. (Find out more about RSS feeds of developerWorks content.)
- Try "Virtualization in Linux"
(September 2006) for OpenVZ and virtualization. It ties the three main
virtualization approaches—emulation,
paravirtualization, and OS-level virtualization—to OpenVZ.
- Refer to Arnd Bergmann's
"Spufs: The Cell Synergistic Processing Unit as a virtual file system"
(developerWorks, June 2005) for details about the SPU file system interface that allows
Linux to run on the Cell/B.E. platform. Bergmann's
"How to not invent kernel interfaces"
(paper to LinuxConf Europe, July 2007) explains how to choose the form of user
space interface your kernel code should get. Bergmann also has a "virtual"
class about Linux on Cell/B.E. platforms
that covers the threading model, Linux runtime strategy, PPU/SPU runtime
requirements, spufs, signal handling, and more.
- Read Daniel Hackenberg's
"Performance Measurements on Cell SMP Systems"
presentation (from the Center for Information Services and High Performance
Computing's Cell/B.E. cluster meeting, May 2007) for such performance analysis
measures as matrix multiplication, XDR DMA bandwidth, and SPE-to-SPE DMA
bandwidth.
- Check out Duc Vianney's
"Cell Software Solutions Programming Model"
presentation (March 2006) for such Cell/B.E. programming model issues as PPE-
vs. SPE-centric, function offload, overlapping DMA and computation, and
heterogeneous multi-thread.
- Read
"Virtualization in a nutshell"
(developerWorks, June 2006) as an introduction to the topic of basic
virtualization concepts by means of common patterns.
"Virtual Linux"
(developerWorks, December 2006) explains the various forms of virtualization (and current
virtualization projects) from a Linux perspective.
- See
"Virtualization with coLinux"
(developerWorks, March 2007) and
"System emulation with QEMU"
(September 2007) for more on paravirtualization.
- Check out a new quick-read jumpstart series in
the blog that covers SDK 3.0 topics—
Infobombs. The first three
introduce the Accelerated Library Framework (ALF),
illustrate the 10 most important
concepts of ALF, and introduce the Data Communication and Synchronization (DaCS) library services.
- Find
"Changes in libspe: How libspe2 affects Cell Broadband Engine programming"
(developerWorks, July 2007) for details about the libspe2 concepts and to see how to do basic
SPE process management and communication with libspe2.
- Refer to
"Introduction to the Cell Multiprocessor"
(IBM Journal of Research and Development, 2005) for an introductory
overview of the Cell/B.E. multiprocessor's history, the program objectives and
challenges, the design concept, the architecture and programming models, and the
implementation.
- To learn more on Cell/B.E. programming, try the
developerWorks series:
- "Programming high-performance applications on the Cell/B.E. processor"
- "PS3 fab-to-lab"
- "The little broadband engine that could"
- Refer to the Cell
Broadband Engine documentation section of the IBM Semiconductor Solutions Technical Library for a wealth of downloadable manuals,
specifications, and more.
- Sign up for the developerWorks newsletter
and get the latest developer news and Cell/B.E. happenings delivered to your inbox each week.
Check Power Architecture when you sign up to receive Cell/B.E. news in your newsletter.
- Visit these other developerWorks resources you might be
interested in because of their connection to virtualization: the
Linux zone and the
Open source zone.
Get products and technologies
- Get the
"OpenVZ User's Guide"
Version 2.7.0-8 from SWsoft.
- Try out the
LMbench suite of tools for
performance analysis. They offer portable benchmarks that compare different UNIX® systems
performance.
- Find out that Power.org has a
code
sample that performs a 4-way SIMD single-precision complex FFT
within a Cell/B.E. environment.
- Find all Cell/B.E.-related articles, discussion forums, downloads,
and more at the IBM developerWorks Cell
Broadband Engine resource center: your definitive resource for all
things Cell/B.E.
- Contact IBM about custom
Cell/B.E.-based or custom-processor based solutions.
Discuss
- Participate in the discussion forum.
- Check out the Cell Broadband
Engine Architecture forum to get your technical questions about the processor answered.
Juicy problems and answers from the forums are rounded up periodically and highlighted
in the "Forum watch" blog series.
- Go to the Power Architecture blog for news, downloads,
instructional resources, and event notifications for Cell/B.E. and other Power Architecture-related technologies. You can find
the popular "Forum watch" blog series (Q&A roundup), the "FixIt" technology updates, and the Infobomb quick-read
technology introductions.
Christian Kaiser studies Computer Engineering at RWTH University in Aachen, Germany. In 2007, he was an intern at IBM Germany Research Lab in Boeblingen, Germany. While doing his internship at IBM, he researched virtualization methods for the Cell Broadband Engine processor. After the internship, Christian Kaiser went to work on his thesis at the Chair for Operating Systems of the RWTH Aachen. His thesis is about "Analysis of asynchronous collective communication in memory-coupled high-speed networks."
Christian Rund is in the IBM Development Laboratory in Boeblingen, Germany. He studied Computer Science at the University of Stuttgart and at Uppsala Universitet, Sweden, graduating in 1997. During his studies, he worked as a student trainee at IBM in Herrenberg and Stuttgart, Germany. In 1998, he joined the system development department of the Landeszentralbank in Stuttgart (Deutsche Bundesbank). In 2001, Christian joined IBM as an research and development engineer on the development team for the zSeries FCP channel. Since mid-2006, he has been a research and development engineer for host firmware for Cell/B.E.-based servers.
Comments (Undergoing maintenance)




