All about pseudo, Part 2: Under the hood

Part 2 of this series details how pseudo's root emulation works by tracking the path of an intercepted call to the database and back. Get a detailed explanation of various mechanisms in pseudo; the basic IPC model, interactions with the database, and an analysis of what exactly happens when the client needs to talk to the server. If you want to replace open(2) with your own code, this is where you find out how.

Peter Seebach (dw-nospam@seebs.net), member of technical staff, Wind River Systems

Author photoPeter Seebach has been messing about with source code since before it was fashionable. He has worked on everything from language standardization to mouse drivers.



17 May 2011

Also available in Japanese

Overview

This article, the second in a series about the pseudo utility, goes into more comprehensive technical detail.

You could just read the source code, you don't have to read this. However, because the source is full of interesting special cases and quirks that might be tricky to understand, what follows is more detail about how and why it works.

The first thing to know about how pseudo works

You know all those times people have told you not to do something, because it's "undefined behavior", or that you don't really need to know the internals? This is the exception. The stuff pseudo does is, in many cases, very intentional and premeditated abuse of the host system's C implementation. The portability issues are by intent reasonably specific, but the code in its current form might have problems on Linux™ for some architectures, however, it works decently on 32-bit and 64-bit x86 Linux. In addition, recently we've been successful with it working, at least in part, under Mac OS X.


More on dynamic linking

Pseudo redefines open(2). That is a pretty scary concept, and it doesn't get any less scary after you've done it and proven that it works. When you have a program linked dynamically against the C library, most of the things you think of as "system calls" are really function calls into wrappers in the C library. Those wrappers may be very trivial, or they may do some real setup work to handle special cases or set things up in a particular way before transferring control to the kernel.

This matters, because it's much more difficult to intercept actual raw system call operations. You can do it (see strace(1) for some information about what this would look like), but it's hard, and it requires a separate process that runs yours (much like a debugger). That's a fair bit of overhead, and the interception isn't cheap at all. So instead of the system calls, we intercept the function calls that actually get made.

When a library is in LD_PRELOAD, it is loaded after the main binary, but prior to the libraries that are explicitly linked into the target binary. If you look at the output of ldd on a given binary, that shows the libraries that would normally be loaded when running it; all of those are loaded after the libraries listed in LD_PRELOAD. These details are different on Mac OS X, which uses a different dynamic loader, but the rest of the behavior is, surprisingly enough, the same.

With dynamic linking, it's not a problem to have multiple libraries that define the same symbol. The "first" library that defines the symbol is the one that gets used. Now, on some systems, that means that you get the top-level one. Whereas on Linux, and some other systems, there's an additional feature allowing you to request the "next" symbol in turn. You call dlsym(RTLD_NEXT, "open").

This call, should it succeed, yields the address of the first copy of the function found in a library after the one in which the call occurs. (If you want to find the version from a specific library, you have to open that library with dlopen() and use the resulting handle instead of RTLD_NEXT.) In pseudo, the address returned for dlsym(RTLD_NEXT, "open") is stored in a function pointer named real_open().

The RTLD_NEXT feature is an extension, and other systems may not have it or implement it differently. The Linux version is intended specifically for creating wrappers.

Clashes with the user program

When a symbol is looked up, even the LD_PRELOAD library is looked at after the main executable image. During development, there was a brief period when we had functions named getvar() and setvar(). This, however, produced a surprising behavior; those functions did not work as expected when the executable being run was the awk utility. Renaming the functions fixed it. This reinforced my belief that pseudo has to try hard to avoid clashing with names used by anything else.

Because of a similar clash, a particularly odious workaround remains elsewhere in the code; on one of the target systems pseudo is obliged to run on, /usr/bin/find provides its own local implementations of regcomp() and regexec(), which happen to be incompatible with the versions in the C library. I may yet replace the calls in question with hand-coded string operations, since they're only used with a few very specific regular expressions.

Direct and variant system calls

A lot of glibc internals that make system calls make them directly inline rather than calling through to the standard "wrappers". For example, if you call fopen(3), there may be no point at which any of the open(2) variants in the library get called. Because of this, pseudo has to intercept fopen(3) and use fileno(3) on the resulting FILE * in order to find out the file descriptor (if any) associated with a given name. This wasn't needed in fakeroot, which didn't track names, and thus didn't care about file opening.

Additionally, I ran into variants. Many syscalls have more than one variant. For example, to intercept calls to open(2), I have to intercept the following syscalls:

      open64
      __openat_2
      __openat64_2
      openat64
      openat
      open

Each of these can be generated by at least one program that uses it as the only way a file gets created. Failing to intercept these makes it very easy for people to end up with files that should have been noticed as being created and owned by virtual "root", but which are not actually recorded. Whoops.


What really happens

Now let's discuss what happens during a call through a wrapped function. Although there are a few special cases and exceptions I don't cover here, this is the basic design.

What happens when your program calls stat(2)? Your call ends up directed by glibc to a wrapper function with a name like __xstat(). However, because the pseudo library provides this function, you end up calling the pseudo library's function named __xstat(). That function, in turn, ensures that the pseudo wrappers are set up, and that access to the underlying system calls is available, and then it runs a wrapper function. The wrapper function does whatever it thinks it needs to in order to pretend to have made the right underlying system call. For __xstat(), this means that the wrapper actually uses real___xstat() to get a real struct stat buffer describing the file, which is used to send a query to the server.

The query to the server goes through client code that assembles a message in the internal format used by pseudo, including the fully canonicalized path name and information from the underlying real___xstat() call. The server receives the message, and searches the database for a matching entry. If it finds one, it populates a message with the recorded mode, owner, and group, and returns a message indicating successful operation. If it doesn't, it returns a message indicating a failed operation.

The client code hands this message up to the wrapper function. If the message indicates success, information from the returned message gets stuffed into the struct stat object, replacing the real data with the recorded data from the virtual file system. The return code is handed back to the wrapper, which hands it back to your code. With a slightly larger latency than you might have expected, you've got a return from your stat(2) call, and it has information that looks exactly like what you'd have received had you been running with real root privileges the entire time.

Structure of a wrapper

For each intercepted call, there is a corresponding wrapper with the same name and signature as the intercepted call. The basic wrapper, expressed as pseudocode [sic.], began roughly as shown in Listing 1.

Listing 1. A replacement system call
int
fchmod(int fd, mode_t mode) {
    int rc;
    if (!setup_wrappers()) {
        errno = ENOSYS;
        return -1;
    }
    if (already_in_pseudo) {
        rc = real_fchmod(fd, mode);
    } else {
        rc = wrap_fchmod(fd, mode);
    }
    return rc;
}

In this example, wrap_fchmod() is another function, defined internally in pseudo, which actually provides the implementation of the modified fchmod(). By contrast, real_fchmod() is a function pointer, which is populated by setup_wrappers(). The setup_wrappers() function only does this setup once; it then sets a static variable to indicate that it's already done, so future calls can return quickly.

The "already_in_pseudo" test addresses another concern; what happens if, during the implementation of a given call, pseudo needs to make other system calls? It doesn't want those calls intercepted, so it sets the already_in_pseudo flag (actually spelled "antimagic") when it's about to do something that might require such handling. (In fact, it's a counter, although as far as I know it's never exceeded 1.)

The wrap_fchmod() function comes in two parts; a generic preamble and epilogue, and a #include to grab the code for that specific function; the internals of each function are stored in a subdirectory called "guts", which holds the actual extra work for wrapping. The preamble and epilogue provide standard setup. Listing 2 shows the current implementation of wrap_fchmod():

Listing 2. A sample wrapper
static int
wrap_fchmod(int fd, mode_t mode) {
    int rc = -1;

#include "guts/fchmod.c"

    return rc;
}

This is one of the few times you will ever see an experienced C programmer write a #include directive naming a .c file. This design allows additional work to be done in these functions, and not duplicated across multiple function implementations. For example, for functions that take optional arguments, the wrap_*() function extracts the optional argument. This is one of the many things pseudo does that is technically undefined behavior, because it's done unconditionally, and it is an error to call va_arg() when no additional arguments were passed. Listing 3 shows the wrap_open() function:

Listing 3. A wrapper with an optional argument
static int
wrap_open(const char *path, int flags, ... /* mode_t mode */) {
    int rc = -1;
    va_list ap;
    mode_t mode;

    va_start(ap, flags);
    mode = va_arg(ap, mode_t);
    va_end(ap);

#include "guts/open.c"

    return rc;
}

This way, the internal function can refer to the "mode" argument unconditionally, without having to use the va_*() macros. (We can't just declare open() as taking three arguments, because that declaration clashes with the standard header.)

Paths, locks, and signals

Remember how I said the wrapper began a particular way? There have been three major changes since then. First, paths are now canonicalized in the wrapper. In the original implementation, pseudo didn't really canonicalize paths, it just fixed up .. and . path entries. This led to occasional issues involving symbolic links in paths, so now pseudo fully canonicalizes paths (mostly; when pseudo wraps functions that operate on links, the final component of the name is not dereferenced). Since this is done consistently, it is also usually done in the wrapper.

In what seemed clever at the time (I now almost regret it), the canonicalization code is automatically invoked whenever an argument has a name ending in "path". This means that a few functions are defined in pseudo with argument names differing from the canonical names in the man pages. For many of the *at() calls, such as openat(2), there is an additional "flags" argument that tells the call whether to follow symlinks or not. The pathname canonicalizer also takes such an argument. By default, if the function has an argument named flags, this is simply passed on to the canonicalizer. This decision also seemed clever at the time, but caused some strange bugs (such as when used with the plain open(2) call, which has a "flags" argument with different semantics).

We added locking for multi-threaded programs. In the original design, without locking, any syscalls made by one thread while another thread was talking to the pseudo server bypassed the pseudo design. There is now a locking mechanism to ensure that only one thread at a time goes through any of the wrapped calls.

This led to a third problem: Programs that take signals for which the signal handlers invoke wrapped calls. To handle this, pseudo blocks certain signals when starting a wrapper, and unblocks them at the end of the wrapper. This worked fairly well, except for one particularly pernicious bug. Because the signals are unblocked at the end of the wrapper, and pseudo wraps execve(2) and friends, a new process invoked by execve(2) would start with those signals blocked, and some programs never reset or fixed up their signals. This has been corrected.

The resulting wrapper function is too long to present inline as a code example, but it seems to work pretty well. You can read it in the pseudo source tree, online on Github (see Resources).


Getting into the guts

Implementations in the guts directory can be trivial. For example, Listing 4 has the implementation provided for the acct(2) syscall:

Listing 4. The simplest case for guts
/*
 * Copyright (c) 2010 Wind River Systems; see
 * guts/COPYRIGHT for information.
 *
 * static int
 * wrap_acct(const char *path) {
 *      int rc = -1;
 */

    rc = real_acct(path);

/*      return rc;
 * }
 */

Note the similarity between the commented code and the wrapper; that's to give you a sense of what code you're trying to write. However, some are a lot more complicated. The guts code has to figure out if anything needs to be asked of the server, and if so, what. In some cases, the underlying system call is completely omitted, or replaced by a completely different call. For example, the mnkod(2) syscall is implemented by creating a plain file, then recording the requested file type in the pseudo database.

Something must be done about this

If a wrapper function needs to record or take note of an action, or needs to query the pseudo database, it uses the pseudo_client_op() function to perform an operation. The signature of this function is shown in Listing 5.

Listing 5. Signature of pseudo_client_op()
pseudo_msg_t *
pseudo_client_op(
    op_id_t op,
    int access,
    int fd,
    int dirfd,
    const char *path,
    const struct stat64 *buf,
    ...);

The "..." is a historical quirk. Before rename operations were supported, the function took only the fixed arguments. When I added rename operations, at the time it seemed like a good idea to make the last argument (which would otherwise have been const char *oldpath) optional. I'm still not sure whether that was the right call.

The op_id_t type is an enumeration containing the list of supported operations, such as OP_CHROOT, OP_OPEN, OP_CLOSE, or OP_CHMOD. The "access" parameter denotes the type of access (read, write, or execute). The "fd" parameter is the file descriptor being operated on. The "dirfd" parameter was originally used for canonicalizing paths being used with the *at() syscalls such as openat(2). Since then, path canonicalization has moved out of pseudo_client_op(), and now that field is used only for handling OP_DUP operations, such as calls to dup(2). The name is now clearly wrong. I will fix it in my copious free time. The path name is the name available to the call being wrapped, and the stat64 buffer is stat information about the file being operated on.

In many specific cases, some of these fields may be set to sentinel values because they are not relevant to a particular operation.

Some messages are handled entirely within the client. For example, the client maintains a table of the current paths of file descriptors, so a path can be sent to the server along with an operation such as fchmod(2). The OP_CLOSE operation updates that internal table, but is not sent to the server. Operations are sent to the server only when they need to interact with the database or be logged. (In a future revision, I hope to separate those two questions more clearly.)

Actually sending messages to the server

IPC between the client and server is done through a UNIX®-domain socket. If the socket does not exist, or cannot be opened, the client attempts to spawn a server. Messages consist of a standard header plus optional path information. Messages have a number of types, such as PSEUDO_MSG_PING, PSEUDO_MSG_OP, or PSEUDO_MSG_ACK. The PSEUDO_MSG_OP message type indicates that an operation, such as OP_CHMOD or OP_STAT, is being submitted to the server. The client writes a message to the socket, the server processes the message and writes its response back with type PSEUDO_MSG_ACK. Messages referring to a file will generally contain the file's filesystem mode (including both permissions and file type), and the device and inode numbers, and the path name. For items that have two paths to send, there is a null byte after the first path, and it is followed by the second path, with the pathlen field of the message being set to the total length.

The server performs multiple sanity checks on incoming requests. It first looks to see whether it has a database entry matching both the path and device and inode number from the request. If not, it checks both for matches by path and matches by device and inode number. If these match, but the exact match fails, a diagnostic is often reported. (It's not for cases, such as a rename, where we expect there to be a name mismatch.) If file types mismatch in an unexpected way, such as one being a directory when the other isn't, diagnostics are recorded and the existing database entry is destroyed. However, a file type mismatch where the filesystem contains a plain file and the database records a device node is harmless; this is how pseudo emulates creation of device nodes.


Coming up next in Part 3

Usually, I find that I learn more from my mistakes than from what works the first time. So, in the next installment of this series, I will go through some of the interesting bugs and failures we've had since starting this project, and how we've handled and learned from them.

Resources

Learn

Get products and technologies

  • IBM trial software: Innovate your next open source development project using trial software, available for download or on DVD.

Discuss

  • developerWorks community: Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Open source on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Open source, Linux
ArticleID=658059
ArticleTitle=All about pseudo, Part 2: Under the hood
publish-date=05172011