Attending college in the North Country is an excellent way to receive a good education, especially at Clarkson University, because of how isolated the region is from the rest of the world. It's nearly two hours to a major metropolitan area (depending on how fast you drive). This situation is conducive both to study and to Linux development. The conditions here are startlingly similar to those that Linus Torvalds described when he began writing Linux: the weather is invariably cold and dreary, because it is winter during most of a student's time here. Wind chills of 20 to 30 degrees below zero make locking yourself in a computer lab for an all-night coding session sound inviting. In fact, it's because of these all-night coding sessions that I've become very partial to Linux development. One specific coding session is what really got me started on Linux related projects.
I was working on a small personal project that used Linux to cluster some machines. I wanted to boot the machines from the network so that I could centralize the kernel images, but I wasn't having much luck. After many daytime attempts, I decided to come back in the evening after dinner, certain that I could figure out the problem in a couple hours. I was a poor judge of time; 3 in the morning came, and I was still unable to get the machines to boot. However, I learned that I do my best work in the wee hours of the morning -- I was about to have a breakthrough. After reading all of the documentation and double-checking my work, I finally realized why the average person doesn't read the directions: they are often wrong. Instead of giving up, I dove headfirst into the kernel source and found a couple of bugs. I can't remember where they were exactly, but it was a simple fix somewhere in the IP autoconfigure source.
After making the kernel conform to the documentation, my problem was solved. Once I had worked this out, I was able to finally concentrate on what I really wanted to do: set up a small cluster. This small bug was probably already fixed in the newest kernel source, but it started me down a long and often sleepless road, playing with the kernel that would begin with the Advanced Operating Systems class and progress to the IBM Linux Scholar Challenge. Oh, and in-between I helped build the Clarkson Open Source Institute (COSI), ate, went to my other classes, and caught a few hours of sleep if I was lucky.
Advanced Operating Systems: Papers, papers, and still more papers
The idea to develop projects for the IBM Linux Scholar Challenge came out of a presentation at the first Association for Computing Machinery/Linux Users' Group meeting on campus. A few influential attendees expressed interest in the contest. One of those was Dr. Jeanna Matthews, who was teaching my Advanced Operating Systems class. She asked if I thought the IBM Linux Scholar Challenge would be a better project for our class to work on than the proposed project, which involved implementing a fair-share scheduler for Linux. I immediately said that it would be better; after all, who would pass up doing something cool when the alternative is a fair-share scheduler? (I apologize to those who think that implementing a fair-share scheduler would be fun, but I'd rather devise my own project.) With no time to spare (the deadline was less than a week away), the class voted to work on Linux Challenge projects. However, that left us all with the same problem: what do we do for projects?
Lack of ideas didn't last long: how could they in an operating systems class that studied past and current OS research papers? Each day in class we discussed operating system topics ranging from the days of UNIX and Multics to the best paper for the Symposium on Operating System Principles 2001. The amount of information we were consuming was staggering, and from these papers grew our IBM Linux Scholar Challenge projects. My original idea was to implement partial file caching for the Andrew File System. Unfortunately, I could never figure out whether the newly open-sourced AFS, OpenAFS, implemented it. Fortunately for me, there was another project that Bryan Clark had come up with.
Bryan had been scanning the Linux kernel source code during an undergraduate operating systems class we were both in during the spring semester. He found that kernel developers had set up a framework for global user resource tracking, but hadn't quite found the time to start it. His project was plain and simple: implement global user resource tracking. The main goal of the project was to replace the getrlimit and setrlimit system calls with a system that was more flexible and manageable. I liked his idea so much that I abandoned the AFS partial file caching project to work with him on resource tracking.
I believe that programming is a state of mind. The programmer must carefully think through the problem at hand to code a solution that is both functional and manageable. Bryan and I decided that it would be best to begin programming at 8 p.m. or later. This is an unspoken rule among computer scientists: no coding during daylight hours. (In fact, the real coding doesn't begin until 2 a.m. -- which is still very early for productive work to be done.) Now, take this and multiply it by seven and you'll have what we experienced for a week.
With all this talk of how difficult coding can be, you'd think that we were a couple of freshmen writing our first "hello world" program. Well, you're wrong. We had it all planned out. Limiting a user or group to a specific amount of resources was trivial, all we had to do was find where that resource was consumed or released and wrap that in a conditional to account for the user's actions. Here's the process accounting:
Linux process accounting
int do_fork(unsigned long clone_flags, unsigned long stack_start,
struct pt_regs *regs, unsigned long stack_size)
{
int retval;
struct task_struct *p;
.
.
/* Cut to relavent portion of the code */
.
.
atomic_inc(&p->user->__count);
/* User Resource Tracking
*/
if(p->euid != 0) {
/* This is a little bit more complicated with resource groups
* I must first check to see if the group has enough resources
* to allow the user
* to continue. Then I must see if the user can continue
* because the user may
* have been capped or something else might be limiting the user.
* If the user can't allocate the resources, we back the resource
* out of the group
*/
if(atomic_read(&get_current_resource_group()->processes) >=
atomic_read(&get_current_resource_group()->max_processes)) {
goto bad_fork_free;
}
atomic_inc(&get_current_resource_group()->processes);
if(atomic_read(&get_current_user()->processes) >=
atomic_read(&get_current_user()->max_processes)) {
/* Back the newly allocated resource out of the group
*/
atomic_dec(&get_current_resource_group()->processes);
goto bad_fork_free;
}
}
.
.
.
}
from kernel/fork.c
|
In this example most of my work was done already because it is much more efficient to count the number of processes as the user creates them than walk the process tree each time the data is needed. This made my job easy: the only thing I had to add was the per-group conditional around Bryan's per-user conditional. When the process exited, get_current_user()->processes was decremented. One evening I thought of a possible race condition (I don't recall the situation right now) which made us think we had to lock these variables. Other than that, it was simple. Really it was. In fact, it was so simple the whole process should have taken a couple of days.
Resource tracking: group style
As with any project, dividing the workload is key, as is buying the correct caffeinated beverage and sugar source. For Bryan and me, this entailed a short drive to the local grocery store for some Coke and chewy chocolate chip cookies. I highly recommend both for long programming sessions. Since Bryan thought of the idea to do per-user resource tracking, I decided that I would work on per-group resource tracking. The theory behind my group tracking scheme is best summed up in the following hypothetical situation.
Suppose you run a business that has just invested a lot of money in a fairly powerful server. This machine is capable of serving your entire business's needs and you want to maximize its efficacy. To do so, you analyze the computing resources needed for each of your departments. The finance department only needs to store some records and occasionally needs access to do some database work: everyone in the department is put into a group and they collectively receive 15% of the machine's resources. The engineers are the ones that really need the computing power: they are grouped accordingly and given 75% of the machine. The remaining 10% is reserved for when the system administrator logs on to make sure everything is running smoothly.
This is a simplified and somewhat incorrect portrayal of a business situation, but it shows what I wanted to implement. I was also in charge of the user space utility, chres, used to load the user and group resource limits in to the kernel as well as monitor a specific user or group's resource consumption. Without it, we wouldn't have been able to test the kernel thoroughly. That part of the project was tedious,so I'll leave it your imagination (HINT: It looks and acts like chmod and chgrp and all those tools).
The first lesson I learned as programmer is that when someone else writes good code use it whenever possible. As the following shows, the kernel already had the data structure and functions necessary to store per-user resource information:
Per-user datatype
/*
* UID task count cache, to get fast user lookup in "alloc_uid"
* when changing user ID's (ie setuid() and friends).
*/
#define UIDHASH_BITS 8
#define UIDHASH_SZ (1 << UIDHASH_BITS)
#define UIDHASH_MASK (UIDHASH_SZ - 1)
#define __uidhashfn(uid) (((uid >> UIDHASH_BITS) ^ uid) & UIDHASH_MASK)
#define uidhashentry(uid) (uidhash_table + __uidhashfn(uid))
struct user_struct root_user = {
__count: ATOMIC_INIT(1),
processes: ATOMIC_INIT(1),
files: ATOMIC_INIT(0),
sockets: ATOMIC_INIT(0),
mem: ATOMIC_INIT(0),
max_files: ATOMIC_INIT(SYSTEM_GROUP_FILE_MAX),
max_sockets: ATOMIC_INIT(SYSTEM_GROUP_SOCKET_MAX),
max_processes: ATOMIC_INIT(SYSTEM_GROUP_PROCESS_MAX),
max_mem: ATOMIC_INIT(SYSTEM_GROUP_MEM_MAX),
};
static inline void uid_hash_insert(struct user_struct *up, struct user_struct **hashent);
static inline void uid_hash_remove(struct user_struct *up);
static inline struct user_struct *uid_hash_find(uid_t uid, struct user_struct **hashent);
void free_uid(struct user_struct *up);
from kernel/user.c
|
Naturally I wanted to find a way to reuse as much of the existing code for the data structure and functions as possible, so that I didn't have to rewrite a lot of bug-tested code. The best way to accomplish this is to think of a group as a normal user with an extremely large pool of resources and then assign each user a group identifier that is stored in their user struct, as seen here:
Modifications for resource groups
struct user_struct root_user = {
__count: ATOMIC_INIT(1),
processes: ATOMIC_INIT(1),
files: ATOMIC_INIT(0),
sockets: ATOMIC_INIT(0),
mem: ATOMIC_INIT(0),
max_files: ATOMIC_INIT(SYSTEM_GROUP_FILE_MAX),
max_sockets: ATOMIC_INIT(SYSTEM_GROUP_SOCKET_MAX),
max_processes: ATOMIC_INIT(SYSTEM_GROUP_PROCESS_MAX),
max_mem: ATOMIC_INIT(SYSTEM_GROUP_MEM_MAX),
group_uid: 0
};
/* System group; any non-assigned users go in to this group
* including root
*/
struct user_struct system_group = {
__count: ATOMIC_INIT(1),
processes: ATOMIC_INIT(1),
files: ATOMIC_INIT(0),
sockets: ATOMIC_INIT(0),
mem: ATOMIC_INIT(0),
max_files: ATOMIC_INIT(SYSTEM_GROUP_FILE_MAX),
max_sockets: ATOMIC_INIT(SYSTEM_GROUP_SOCKET_MAX),
max_processes: ATOMIC_INIT(SYSTEM_GROUP_PROCESS_MAX),
max_mem: ATOMIC_INIT(SYSTEM_GROUP_MEM_MAX)
};
/* Resource group hash table information
*/
static kmem_cache_t *resource_group_cachep;
static struct user_struct *grouphash_table[UIDHASH_SZ];
static spinlock_t grouphash_lock = SPIN_LOCK_UNLOCKED;
#define grouphashentry(uid) (grouphash_table + __uidhashfn(uid))
from kernel/user.c
|
Then, whenever the user logs into the system the user's information and group association are loaded into the kernel using the same functions with nearly zero modifications. The group_uid in the user_struct is important. Every user is a member of the default system_group until their group assignment is determined. This is an excellent example of a design where we preserved backwards compatibility, provided new functionality, and didn't introduce untested code. By forcing each user into the system_group by default, the system administrator can compile in support for resource groups and implement them at a later time making the system immediately usable. There is only one minor difference in the user and group access functions, as shown below:
The single function modified for resource groups
/* User Resource Tracking - Resource Group search
* Find the group specified by the uid passed, if
* the group cannot be found then return the system
* group.
*/
struct user_struct * find_group_uid(uid_t uid)
{
struct user_struct **hashent = grouphashentry(uid);
struct user_struct *up;
spin_lock(&grouphash_lock);
up = uid_hash_find(uid, hashent);
spin_unlock(&grouphash_lock);
if (!up) {
/* User Resource Tracking
* This should never happen because everyone is a member
* of the system group by default
*/
BUG();
}
return up;
}
|
This extra function was necessary due to the way the existing uid_hash_find worked. All of the user hash functions are reused without modification which means the only potential problems arise from the functions' use and not their implementation. One final problem remained: the initalization of the group hash table. The best place for this is where the user hash table is initialized, as shown here:
Initialization of the uid cache
static int __init uid_cache_init(void)
{
.
.
/* Cut to relevant portion of the code */
.
.
resource_group_cachep = kmem_cache_create("res_group_cache", sizeof(struct user_struct),
0,
SLAB_HWCACHE_ALIGN, NULL, NULL);
if(!resource_group_cachep)
panic("Cannot create resource group taskcount SLAB cache\n");
uid_hash_insert(&system_group, grouphashentry(0));
return 0;
}
from kernel/user.c
|
In reusing the same functions and data structures I was able to concentrate on tracking resource use. This kind of code reuse is something that programmers dream about -- I certainly do when working on a project. The Linux kernel is beginning to evolve into a system that has many pre-built, bug-free functions or macros at your disposal that can be used to add more features without duplicating any code. This advancement should have made my job easier because a lot of code could have been reused. If this were true, the bean counting would have been easy: find the function or macro that everything else calls for allocation and deallocation and add my conditionals there. Remember, I want to insert conditionals, like the process conditional, not rewrite entire blocks of kernel code so that I can instrument it.
Unfortunately the single allocation function didn't exist. In fact, it seemed like the exception to the rule. Most kernel programmers were inclined to rewrite many of the pre-existing, tested functions into the particular module or subsystem they were working on. The best example I can give is when an mm_struct needs to be allocated. This well-tested (but not quite generic enough) function works great:
A function with great potential
struct mm_struct * mm_alloc(void)
{
struct mm_struct * mm;
mm = allocate_mm();
if (mm) {
memset(mm, 0, sizeof(*mm));
return mm_init(mm);
}
return NULL;
}
from kernel/fork.c
|
But you'll see that kernel programmers don't like to use this function much (it's only used twice in 2.4.12 and 2.4.17) because it zeros out the mm_struct before returning it. This is bad because there are times when the newly zeroed memory is going to contain something immediately after the mm_alloc call returns. Kernel programmers can avoid that extra memset by inlining all the steps necessary in allocating an mm_struct with the exception of the line that zeros the memory. In case you are wondering: THIS IS BAD! Imagine tracking a resource that was allocated or deallocated in such a manner, and you'll understand where I am coming from. The best way to handle this is to have two functions or macros or some combination of function and macro, where one calls the other, that is used to allocate an mm_struct. Not only does this make my job easier, but it cuts down on the amount of duplicate code within the kernel. I mention this because I found a couple of instances where the same lines of code found in mm_alloc were interspersed within another function's code. We had been having this sort of trouble all night while trying to figure out where three file handles were being "magically" released.
Why you should never limit root's resources
Contrary to popular belief, you should never limit root's resources when the machine is coming up. Especially to only five processes. I don't know how many times I forgot to exclude user ID 0 from the quotas and had to recompile the kernel. This small conditional probably cost me a few hours each evening. The simple fact is: kernel programming requires you to actually think about what you've just written and decide if it really works, or if you are just forcing it to work in your mind. Since Bryan and I were working into the wee hours of the morning, we probably weren't able to think clearly about our code; perhaps we needed more Chinese food or caffeinated beverages. In any case, we ended up writing a couple of tools once the machine was booted so that we could test the limits in a simulated environment. The userspace tool chres was designed to load the initial quotas into the kernel and to be able to change them at run time. It worked remarkably well, considering it was more of an afterthought than a pillar of our grand scheme of limitations.
When Bryan and I finally submitted our entries to the contest, we had been without sleep for nearly a week. It was a fun experience actually; I got quite a bit of work done besides my entry, which I'll attribute to my need to not fall asleep. The next step was to wait for the winners to be announced; a group of people I wasn't among but Bryan was. It was great to have so many from Clarkson University win, even if one of them wasn't me. Not winning actually helped me become motivated to do more for the Linux movement -- I ended up writing a small decoder for my remote control to my soundcard. I was finally able to use all of my hardware in Linux, which means I haven't been back to the darkside since winter break. I think for my next project, I'll get force feedback working for my mouse, but that will have to wait until I have some free time. The best piece of news I have now is that I'll be in Cambridge, Massachusetts this summer with the Extreme Blue program, which is going to be a lot of fun! It's been a wild ride this year and I'm glad I was in the middle of everything.
- Read about the other Clarkson entries in our series.
- View more Linux kernel information.
- Learn about the Clarkson Open Source Institute.
- Get information on the IBM Linux Scholar Challenge.
- Get more Linux articles from the developerWorks
Linux zone.
- Are you a student or a teacher? Visit the IBM Student Portal or the IBM Faculty Portal to find information and offers tailored to your needs.
Stephen Evanchik is currently finishing a Bachelor's Degree in Computer Science at Clarkson University. He is the Technical Lead in the newly founded Clarkson Open Source Institute, a student-driven organization that explores open source software solutions and development. You can contact Stephen at evanchsa@clarkson.edu.
