Cluster computing has made significant progress since its inception in the early 90s. The trend to use PC clusters to achieve supercomputing speed with an affordable budget has accelerated with the rising use of GNU/Linux. The easy availability of inexpensive hardware (processors, memory, hard disks, Ethernet cards) combined with open source software quickly drove people to adopt Linux-based Beowulf Clusters. ("Beowulf" refers to the name Dr. Donald Becker and Dr. Thomas Sterling, the first researchers in the area of PC-based clusters, gave their initial cluster; many people use this name to refer to a similar technology, the PC-based cluster.)
This blurring of terminology raises the question: "Do people who want to build a cluster have to use Linux?" (Perhaps the heart of that question, and the more practical point, is "We have bunch of idle PCs using Windows. Do we need to migrate them all into Linux just to borrow their CPU cycles?")
These questions led us to experiment with how to build a hybrid cluster, a cluster consisting of heterogenous operating systems. For this article, both Windows and Linux are used; other operating systems are ignored for the sake of simplicity.
Nowadays, local area or campus-type networks consist of PCs using different operating systems. Some use Windows, others Linux, still others BSD, and so on. With an average number of PCs in a LAN being between 100 and 300, this presents enormous potential computing power, especially if you're planning a high-perfomance computing cluster (HPC) where the main focus is on gaining the highest speed.
In this type of cluster, the challenge is that in most cases, we can't make the individual PCs 100 percent dedicated to the cluster. As compared with a traditional cluster, one that is fully dedicated to run number-crunching programs in a 24/7 operation, the hybrid cluster is mostly useful for extending the nodes of the existing Linux cluster. There are two strategies to implement this:
- Install a cluster agent on each Windows PC. You can think of this agent as a small application sitting on system tray or as a Windows service. It works in unattended mode so its operation is fully controlled by the master node (using Linux).
- Use dual-boot installation or Linux LiveCD. LTSP (Linux Thin Server Project) can also be clasified as part of this strategy too. The basic idea is to temporarily convert the nodes into a Linux system and from there they join the master node as member of a cluster.
We did not explore a full or partial migration of Windows-to-Linux as a solution because we wanted to focus on the third prong of the trident for this article. Instead of a full or partial migration to Linux, we wanted to examine the mechanisms and benefits of joining both Windows and Linux as a cluster without the need to force full migration. However, this doesn't mean that a full migration is the wrong solution.
We chose the cluster agent approach as a solution because of these advantages:
- The Windows user is able to continue working in a familiar environment, using the office suite, drawing diagrams, or performing other tasks. The user can instruct the cluster agent to run in low priority or run whenever the screen saver is activated. The Seti@home project used this type of strategy and is quite effective.
- By avoiding dual booting, there is no need to repartition (to install Linux). You can install Linux atop the Windows filesystem (like ZipSlack), but this also forces you to quit Windows.
With the cluster agent approach, you just need to spare some space inside the FAT32 or NTFS filesystem to host Linux agent binaries.
The roadmap for building a cluster requires choosing between three types of clusters: the high availability cluster (HA), the high-perfomance computing cluster (HPC), and the high throughput cluster (HTC). We chose the HPC, the most commonly used cluster, which has these consequences:
- The failures that might happen on nodes are ignored. These failures include power failures, damage to network cables, and other kinds of hardware-related problems such as hard disk damage, CPU lockup due to overheating, and bad memory.
- There is a slight chance that each node will need to be rebooted while working on a job submitted by the master node. In this case, the applications themselves or the administrator must restart the necessary applications or checkpoint at certain intervals.
Let's step back a moment and ask this question: "Why do we need to implant a cluster agent? Wouldn't porting the Linux-based application into Windows yield better perfomance?" Certainly with the availability of such cross-platform compilers as MinGW or Cygwin, this could be easy.
To answer this question, we prefer the cluster agent for these reasons:
- Porting software to another platform doesn't happen as smoothly as predicted. Tackling issues such as differences in system calls, timing, direct hardware access, etc. can delay the implementation.
- Hybrid clusters tend to be used as a testbed for various new applications or as an extension of an existing cluster. Putting a lot of effort into porting won't bring much benefit to an environment that's designed to be constantly changing.
- In the real world, many people use commercial or proprietary software. This software is sold or published without their source code. The consequence is that porting becomes impossible.
Our proposed solution won't yield as fast an execution speed as porting it into a native Windows application. But keep in mind that with this experiment, we're trying to achieve several key goals:
- Flexibility. Windows user can work as usual while the agent works under a low priority or runs just in the idle cycles of the CPU.
- Performance. When running the cluster agent, the execution should be nearly as fast as a native port.
- Efficiency. We just need to install the binaries of the cluster agent like a normal Windows program and run them unattended.
- Manageability. Combined with mass-deployment software (such as LanDesk suite or Microsoft Software Management Server), we can install and remove the agent package in minutes. There is also an extension of a remote X client such as VNC to make remote administration easier if required.
To start, we need to choose two pieces of software:
- An emulator or virtual machine
- The cluster middleware
This software will act as a bridge so the Linux kernel can run on top of Windows; this is its primary function.
An emulator or virtual machine since it is hard to port the Single System Image (SSI) middleware into the Windows kernel architecture. SSI solutions modify almost every area of kernel: process management, filesystem, memory management, scheduler, etc. Emulators simplify the deployment by letting the kernel processes run unchanged.
With middleware, we pick solutions that work as extensions to the Linux kernel so we can build an SSI cluster. There are other choices (such as a batch scheduler or a pure remote execution wrapper), but we chose the SSI model because of following advantages:
- SSI middleware hides the complexity of how we access resources (files, memory, CPU) scattered among the nodes. We simply run the application and the middleware will take care of resource management transparently.
- SSI middleware provides load leveling and transparent process features. By utilizing these feature, administrators won't need to manually tune the load unless the situation heavily demands it or the algorithm in the middleware is unable to achieve fairness. Process migration also eases execution because we don't need explicit remote execution.
This will be relatively easy. What are our choices of SSI Middleware? We have three open source GPL-licensed solutions (in no particular order):
Detailed information about each solution is beyond the scope of this article but can be found in Resources.
We're going to choose openMosix for two reasons:
- openMosix possesses the simplest configuration when compared with the other solutions. This will reduce the complexity of the cluster deployment.
- openMosix provides us with load leveling and transparent process migration in its simplest form. By "simplest," we mean that the size of the kernel patch is relatively small and doesn't carry many features. This is suitable for our experiment, although if desired, you could extend the experiment using the other SSI middlewares.
The choice for this item is a bit complicated. We have two possibilities, using a commercial version such as VMWare or using the open source solution such as coLinux (guess which one we'll choose!). Again, we choose the open source-based solution. But let's discuss the pros and the cons of each solution first and you can see the thought processes involved in analyzing our choices.
For the commercial product, let's take VMWare as the example. You're probably familiar with this product; VMWare has been a long-time player in the VM arena. Here is our preliminary analysis of VMWare.
On the pro side:
- VMWare is a full system emulator (in this article, we are freely interchanging the term "emulator" with "virtualizer" to refer "software that does emulation of a system"). Therefore, it can emulate a complete system (memory, storage, CPU, Ethernet cart, sound card, and so on). The system running inside the emulator (called the guest system) won't need any modification (or at most, a few patches to guarantee that the OS works smoothly).
- The emulator runs in a memory address space separate from the other host applications and with the same privilege level as user space application, so any crash on the guest system won't affect the host.
On the con side:
- Since VMWare is a commercial product, it's not free. Licenses for all Windows computers in a network can easily exceed the costs for a new dedicated cluster which would deliver more performance for lesser money.
- The emulation drains much of host system's resources. The emulator has to intercept most of the instructions used for guest-kernel-to-application communications and has to model a complete set of virtual hardware devices. This emulation implies several layers imposing a natural lower bound for the network latency which cannot be undercut.
- VMWare itself (guest system not included) requires a huge amount of memory. So, the total amount of memory that is allocated will be a bit bloated.
Now let's analyze coLinux. coLinux is a new open source solution that lets you run a Linux kernel on top of a Windows kernel.
- coLinux allows the Linux kernel to run natively (as a Windows kernel driver). Because there is no "bridge" between the host kernel and the guest kernel, Linux guest can run at relatively near-native speeds. The emulation layer is directly inserted as codes in many different Linux subsystems, so they turn into direct requests to the Windows system. Internally, a context switch is performed between these two kernels, but this can be done quite fast.
- The host system only needs to allocate memory purely for the need of guest system (and its forked processes).
- coLinux itself is still in development, so you may get some glitches when running it. While all the developer's efforts are focused to make coLinux better, they might fail to catch some bugs. Note: The supported Windows version are Windows 2000 and Windows XP.
- coLinux needs some manual configuration to get it to run. Currently there is no automatic tool to help create a complete coLinux system, so you have to be able to build or tweak the coLinux configuration.
So how is coLinux's performance when compared with VMWare? We ran a rough benchmark. As a test, we chose the POV-Ray 3.6.1 precompiled binary from Povray's Web site. (POV-Ray is one of the grandfather's of realistic 3D graphics creation software, full of those massive numbers-crunching tasks and perfect for our test.) The binary was executed with the default options found in benchmark.ini (included in the povray package):
# povray benchmark.ini.
This POV-Ray is run on coLinux for Linux kernel 2.4.26. The root filesystem is using the Gentoo distribution. We tested the same POV-Ray binary on VMWare version 4.5.2 using the same distribution. The following table demonstrated how long our testing machine needed to execute the predefined scene along with the neccesary options (as noted in benchmark.ini file).
Table 1. POV-Ray runtime result
|Platform (where POV-Ray is executed)||Time (in minutes:seconds)|
Reading the data sheet, you see that coLinux is tightly matching the native OS speed. VMWare, as predicted, is slower than coLinux with a more or less one-minute difference. VMWare can achieve nearly the native speed by translating the virtual machine's (VM) instruction stream on the fly to the host machine, but since VMWare itself runs in user space, this can cause problems. Such as when the VM executes code in kernel mode; then VMWare has to carefully translate the memory mapping and permissions in order to simulate the VM's virtual CPU correctly.
Now, let's look at how coLinux, our choice for emulator, and openMosix, our choice for middleware, work together.
Figure 1 shows us how openMosix works.
Figure 1. How openMosix works
openMosix works inside the Linux kernel. It has several parameters that can be controlled via the /proc interface. To make the actual operation easier, several user space applications have been created. openMosix works in a decentralized manner, so there is no master or slave in this topology. This decentralized strategy is applied in the load information exchange and at the process-migration stage.
openMosix transparently migrates a process to another node if the load-leveling algorithm thinks the destination has a smaller work load and it is feasible to migrate the process. The reasons why a process cannot or should not migrate vary (using pthreads, doing significant amount of disk operation, very short runtime duration). From the user space perspective, there is no need to modify the application code. All the work is transparently done in the kernel space.
Now, let's examine how coLinux works (the following diagram is from coLinux.org).
Figure 2. Description of coLinux internal (by permission of Dan Aloni, the coLinux lead developer and doc writer)
Dan Aloni describes coLinux as a cooperative and paravirtualized Linux virtual machine. Cooperative means that it gives control back to the host OS at will. Paravirtualized means that the coLinux kernel has no notion of the real hardware except that of the CPU and memory. This is unlike VMware which intercepts I/O to hardware devices and emulates it. coLinux feels like a separate Linux box -- the guest kernel's internals are separated from the host kernel's internals.
Notice that coLinux consists of two parts:
- The coLinux kernel driver that operates in the host kernel space.
- Several user space daemons.
The coLinux kernel driver's main jobs are to:
- Load the Linux guest kernel image on start. You can imagine this functionality being similar to a bootloader (like LILO or GRUB).
ioctl()requests as instructed by the colinux-daemon process. This
ioctl()call is responsible for doing context switches between guest OS and host OS.
- Act as a forwarder of interrupts and requests from several virtual drivers: cobd (block device), conet (network), and cocon (console).
In the user space, the most important part is the colinux-daemon-process. Besides being responsible for triggering context switches, it works as the "manager" of several other daemons such as colinux-console-nt and colinux-net-daemon. For example, via colinux-console, users can see the current display of the active console of the Linux guest. When the user types or issues a command inside the shell of this colinux-console, the colinux-daemon will "wrap" it and forward it to colinux-driver. For complete understanding of coLinux internals, go to coLinux.org.
When openMosix and coLinux are merged, there is no significant change. Because openMosix lives entirely in kernel space, coLinux simply loads the guest kernel image as usual and works as described above. When openMosix has to communicate with other nodes, the guest kernel invokes several system calls. coLinux intercepts these and passes the data to the coLinux-net-daemon which ultimately sends it out via the Windows API.
The following delineates the coLinux network layers and the data flow of network traffic:
- Source Application
- Linux Kernel
- coLinux Kernel Driver
- coLinux Network Driver
- Windows NIC Driver
Now we see how to join coLinux and openMosix.
For this article, we combine the coLinux and openMosix patch for the 2.4.26 Linux kernel. We pick 2.4.26 because this was the latest stable kernel version when the experiment was conducted and is the highest 2.4.x kernel series that is supported for both openMosix and coLinux. (The description mentioned here is only an overview. For a complete explanation, refer to the step-by-step guide in Resources.)
Following are the steps to make your first coLinux/openMosix kernel:
- You will need the 2.4.26 Linux vanilla kernel as well as the openMosix for this and the coLinux release 0.6.1. Download the needed archives and untar them to a suitable working directory.
- Apply the coLinux kernel patch to the kernel sources and copy the config file (conf/linux-config, which comes with coLinux) to the kernel source tree. You should name it ".config" of course.
- Now apply the openMosix patch. One file will fail, but this is trivial since it's just the Makefile the patch is complaining about.
- Finally you can build the kernel by executing these commands:
# make oldconfig
# make dep
# make vmlinux
This will deliver the file vmlinux, the new kernel. We suggest you use gcc 3.3.x for all the building processes of coLinux (kernel image and the user space tools) since it is proven to deliver the most stable binaries. Do not use a different gcc when compiling the kernel image and the user space tools because it can result in a system lock up. To use the finished kernel as new virtual machine, you also need the user space tools, the image of base filesystem, and TAP-32 win32 network driver. To shorten the entire test cycle, you can download a ready-to-use package containing both user space tool and kernel image. (All these downloads can be found in the Resources section.)
The only thing left is to create your own filesystem image or download them from coLinux.org. You need to put the openMosix user space tool inside the filesystem image so the openMosix functionality is ready to go. Follow the directions on the openmosix.org Web site on how to compile the user space tool.
The steps for putting everything together (Linux kernel image, user space tools, root filesystem) into a working system can be found in the step-by-step guide.
Here's a basic, preliminary benchmark to illustrate the ease of use and performance of the approach we suggest. Our testbed consists of just two machines: Amun is a native Linux box running the Debian flavor of linux. Ipc256 is running Windows 2000 Professional.
The hardware specifications for each node are as follows:
Table 2. Ipc256 specifications
|Processor:||P4 1.70 GHz|
|Operating system:||Windows 2000 Professional|
|Hard disk:||40 GB IDE|
|Network card:||Realtek Semiconductor Co. Ltd. RTL-8139 Fast Ethernet|
Table 3. Amun specifications
|Processor:||P4 2.40 GHz Hyper Threading|
|Operating system:||Debian Woody|
|Hard disk:||2x approx. 40 GB IDE, several NFS mounts|
|Network card 1:||Syskonnect (Schneider & Koch) SK-98xx V2.0 Gigabit Ethernet|
|Network card 2:||Realtek Semiconductor Co., Ltd. RTL-8139 Fast Ethernet|
As a benchmark we've chosen a multi-process (we use term "process" because it use
fork()) application which does nothing but consecutively calculate the Fibonacci numbers with different seed values. Its purpose is not to do a meaningful calculation, but to provide an easy-to-understand illustration of our solution. You can check out the source code by downloading the zip file referenced in the Download section.
The program is split into two parts. The first one spawns the 2^5 = 32 child processes. For that purpose it uses the system call
fork(). This enables openMosix to distribute the processes over all nodes participating in the cluster. In the second part the actual calculation is performed.
We ran the program four times and took the average. The following table shows the result.
Table 4. Multi-process program runtime result
|Host||Execution time (in seconds)||Comment|
|Amun||ta = 436.25||running locally|
|Ipc256||ti = 778.75||running locally|
|Both||tb = 285.25||running on Amun and then migrated to Ipc256|
Looking at these results, you can observe how long it takes for an individual PC to accomplish the job. Amun is a bit faster than Ipc256 but the use of both PCs results in the by far highest throughput. This could be even improved if not just one but many Windows PCs would join the cluster.
The overhead is calculated using following formula:
Overhead = tb/ta + tb/ti - 1
In this case, our experiment yields 2.01 percent overhead which is close to theoretical optimum.
From this benchmark, we can conclude that the best way to take maximum benefit from this kind of cluster is to partition our job or application. Or a more common strategy is to use parameterized invocation, spawning multiple instances of same program with different parameters.
An example of parameterized invocation is renderfarm from POV-Ray. Each instance of the renderer program fetches a partial scene and builds a finished frame. This is repeated until all scenes are fetched.
For users with sufficient budget, running a commercial virtualizer such as VMWare could be a safe way to achieve a hybrid Windows-Linux cluster using an open source Linux-based SSI, but for those who are looking for an affordable solution with an optimal MIPS-per-dollar value, coLinux is the ideal choice.
The benefit of this solution is clear. You don't need to do massive Windows-to-Linux migration -- simply stick to your well-known Linux environment. This will shorten the deployment time and still allow Windows-based users to work as usual. The idle power of Windows PCs can be exploited to help Linux machines running CPU-consumptive program.
There are many things which can be improved in the future, including
- Writing kernel-to-kernel network-packet injection between the host kernel and the guest kernel. Currently these packets must travel from the host kernel driver, passing the TAP32 driver, colinux-net daemon, coLinux daemon, and then ending up in colinux kernel driver. By bypassing these steps and replacing them with direct guest kernel driver/colinux driver link, we can reduce the communication latency and thus enhance the cluster performance.
- Create a more advanced cluster agent package. There are a few special coLinux distributions that allow us to do instant deployment (CosMos, mentioned in the next section, is an example). Community help is needed to create such a package, preferably with an easy-to-use configuration manager to allow users or administrators to do custom configuration in a few click of mouse.
- Porting coLinux into Linux kernel version 2.6. While we were writing this article, there was a development release of coLinux kernel patch for 2.6. Linux kernel 2.6 which provides many features such as O(1) scheduler and I/O scheduler which can improve cluster perfomance on the kernel side.
- Increase the compatibility of coLinux kernel driver with Windows 2000 and XP. Needless to say, there are still some incompatibility issues between coLinux and the Windows kernel. (Note: coLinux doesn't support Windows 95/98/ME and that is an issue too.)
- Make coLinux use the host IP. So far, the coLinux instance is using its own IP address. This can waste lots of IP addresses on a large pool of cluster. IPSec tunneling can help by wrapping the whole guest network data into a single IPSec stream (again, as used in CosMos), but IPSec itself adds quite a tremendous amount of latency. Another solution would be IP-over-IP tunneling which could avoid the time penalty of the encryption but requires more implementation effort.
One project which is determined to offer ready-to-use coLinux/openMosix binaries with a compact filesystem is CosMos, developed by Ian Latter. A nice feature of CosMos is that it uses an IPSec tunnel to create secure end-to-end communication between the cluster nodes. (You can contact Ian at Ian.Latter@mq.edu.au.
Another project (mainly developed by Christian Kauhaus) doesn't shoot for HPCs but like the well-known Condor Project, is designed for high throughput computing clusters. What it has in common with CosMos is the goal of creating a ready-to-use package, but it differs in other areas such as:
- Network connection; for performance reasons, it doesn't use an IPSec tunnel.
- The Configuration Server; its task is to handle the configuration, joining, and leaving of the non-dedicated nodes to the cluster.
This project is still in an early alpha stage -- it should be fascinating to watch it evolve.
- Download the source code for the benchmarks in this article.
- For more information on the coLinux internals, try this in an OpenOffice Presentation format and Adobe Acrobat Format.
- You can download the the Linux kernel, coLinux 0.6.1 sources, and openMosix 2.4.26 patch to build your own custom kernel image.
- Detailed steps to build coLinux/openMosix kernel image can be found KernelHowTo.
- Specific coLinux documentations on building both kernel part and user space tool are in this coLinux Building document.
- ClusterKnoppix (a Knoppix-remastered LiveCD with openMosix-enabled kernel) can be grabbed on the ClusterKnoppix homepage.
- CosMos (coLinux/openMosix kernel packaged with small root filesystem and an installer for Windows) can be downloaded at the CosMos homepage.
- On the official openMosix site you can find latest news, patch releases, documentation, community contributions, and mailing list archives.
- Get the WinPCap driver here.
- This step-by-step guide shows you how to build a running system based on a openMosix/coLinux kernel.
- The author offers a complete ready-to-use coLinux/openMosix development test package (user space tool and kernel image, but not the filesystem image).
- The Condor Project is a high-throughput computing cluster, an example of a batch scheduler.
- Visit the site of the coLinux/openMosix integration project.
- Advantages of openMosix on IBM xSeries (developerWorks, October 2002) is a three-part series that shows you how to design, set up, and run an openMosix mini-cluster.
- Linux clustering cornucopia (developerWorks, May 2000) can still answer the question: "Which cluster is for you?"
IBM alphaWorks offers a Cluster Starter Kit for Linux, an application that enables creation and monitoring of a cluster of up to six nodes from a single point of control.
The Redbook Linux Handbook: A Guide to IBM Linux Solutions and Resources offers a guide to understanding the interlocking Linux solutions from IBM.
- Craft a load-balancing cluster with ClusterKnoppix (developerWorks, December 2004) demonstrates planning for load balancing in a cluster and using LiveCDs.
- And in case you want to go ahead with a partial or full migration to Linux, try the Redbook Linux Client Migration Cookbook: A Practical Planning and Implementation Guide for Migrating to Desktop Linux.
Mulyadi Santosa lives on East Java, Indonesia, and works as freelance writer and IT consultant. His interests are SSI clustering and networking technology. Open source projects on virtualization such as Qemu and coLinux help Mulyadi experiment with clustering without using a real PC. In his spare time, he likes to watch movies and pratice martial arts.