This article, the second in a series about the pseudo utility, goes into more comprehensive technical detail.
You could just read the source code, you don't have to read this. However, because the source is full of interesting special cases and quirks that might be tricky to understand, what follows is more detail about how and why it works.
The first thing to know about how pseudo works
You know all those times people have told you not to do something, because it's "undefined behavior", or that you don't really need to know the internals? This is the exception. The stuff pseudo does is, in many cases, very intentional and premeditated abuse of the host system's C implementation. The portability issues are by intent reasonably specific, but the code in its current form might have problems on Linux™ for some architectures, however, it works decently on 32-bit and 64-bit x86 Linux. In addition, recently we've been successful with it working, at least in part, under Mac OS X.
Pseudo redefines open(2). That is a pretty scary
concept, and it doesn't get any less scary after you've done it and proven
that it works. When you have a program linked dynamically against the C
library, most of the things you think of as "system calls" are really
function calls into wrappers in the C library. Those wrappers may be very
trivial, or they may do some real setup work to handle special cases or
set things up in a particular way before transferring control to the
kernel.
This matters, because it's much more difficult to intercept actual raw
system call operations. You can do it (see
strace(1) for some information about what this
would look like), but it's hard, and it requires a separate process that
runs yours (much like a debugger). That's a fair bit of overhead, and the
interception isn't cheap at all. So instead of the system calls, we
intercept the function calls that actually get made.
When a library is in LD_PRELOAD, it is loaded
after the main binary, but prior to the libraries that are explicitly
linked into the target binary. If you look at the output of
ldd on a given binary, that shows the libraries
that would normally be loaded when running it; all of those are loaded
after the libraries listed in
LD_PRELOAD. These details are different on Mac
OS X, which uses a different dynamic loader, but the rest of the behavior
is, surprisingly enough, the same.
With dynamic linking, it's not a problem to have multiple libraries that
define the same symbol. The "first" library that defines the symbol is the
one that gets used. Now, on some systems, that means that you get the
top-level one. Whereas on Linux, and some other systems, there's an
additional feature allowing you to request the "next" symbol in turn. You
call dlsym(RTLD_NEXT, "open").
This call, should it succeed, yields the address of the first copy of the
function found in a library after the one in which the call occurs. (If
you want to find the version from a specific library, you have to open
that library with dlopen() and use the
resulting handle instead of RTLD_NEXT.) In
pseudo, the address returned for
dlsym(RTLD_NEXT, "open") is stored in a
function pointer named real_open().
The RTLD_NEXT feature is an extension, and other
systems may not have it or implement it differently. The Linux version is
intended specifically for creating wrappers.
When a symbol is looked up, even the LD_PRELOAD
library is looked at after the main executable image.
During development, there was a brief period when we had functions named
getvar() and
setvar(). This, however, produced a surprising
behavior; those functions did not work as expected when the executable
being run was the awk utility. Renaming the
functions fixed it. This reinforced my belief that pseudo has to try hard
to avoid clashing with names used by anything else.
Because of a similar clash, a particularly odious workaround remains
elsewhere in the code; on one of the target systems pseudo is obliged to
run on, /usr/bin/find provides its own local
implementations of regcomp() and
regexec(), which happen to be incompatible with
the versions in the C library. I may yet replace the calls in question
with hand-coded string operations, since they're only used with a few very
specific regular expressions.
Direct and variant system calls
A lot of glibc internals that make system calls make them directly inline
rather than calling through to the standard "wrappers". For example, if
you call fopen(3), there may be no point at
which any of the open(2) variants in the
library get called. Because of this, pseudo has to intercept
fopen(3) and use
fileno(3) on the resulting
FILE * in order to find out the file descriptor
(if any) associated with a given name. This wasn't needed in fakeroot,
which didn't track names, and thus didn't care about file opening.
Additionally, I ran into variants. Many syscalls have more than one
variant. For example, to intercept calls to
open(2), I have to intercept the following
syscalls:
open64
__openat_2
__openat64_2
openat64
openat
open |
Each of these can be generated by at least one program that uses it as the only way a file gets created. Failing to intercept these makes it very easy for people to end up with files that should have been noticed as being created and owned by virtual "root", but which are not actually recorded. Whoops.
Now let's discuss what happens during a call through a wrapped function. Although there are a few special cases and exceptions I don't cover here, this is the basic design.
What happens when your program calls stat(2)?
Your call ends up directed by glibc to a wrapper function with a name like
__xstat(). However, because the pseudo library
provides this function, you end up calling the pseudo library's function
named __xstat(). That function, in turn,
ensures that the pseudo wrappers are set up, and that access to the
underlying system calls is available, and then it runs a wrapper function.
The wrapper function does whatever it thinks it needs to in order to
pretend to have made the right underlying system call. For
__xstat(), this means that the wrapper actually
uses real___xstat() to get a real
struct stat buffer describing the file, which
is used to send a query to the server.
The query to the server goes through client code that assembles a message
in the internal format used by pseudo, including the fully canonicalized
path name and information from the underlying
real___xstat() call. The server receives the
message, and searches the database for a matching entry. If it finds one,
it populates a message with the recorded mode, owner, and group, and
returns a message indicating successful operation. If it doesn't, it
returns a message indicating a failed operation.
The client code hands this message up to the wrapper function. If the
message indicates success, information from the returned message gets
stuffed into the struct stat object, replacing
the real data with the recorded data from the virtual file system. The
return code is handed back to the wrapper, which hands it back to your
code. With a slightly larger latency than you might have expected, you've
got a return from your stat(2) call, and it has
information that looks exactly like what you'd have received had you been
running with real root privileges the entire time.
For each intercepted call, there is a corresponding wrapper with the same name and signature as the intercepted call. The basic wrapper, expressed as pseudocode [sic.], began roughly as shown in Listing 1.
Listing 1. A replacement system call
int
fchmod(int fd, mode_t mode) {
int rc;
if (!setup_wrappers()) {
errno = ENOSYS;
return -1;
}
if (already_in_pseudo) {
rc = real_fchmod(fd, mode);
} else {
rc = wrap_fchmod(fd, mode);
}
return rc;
}
|
In this example, wrap_fchmod() is another
function, defined internally in pseudo, which actually provides the
implementation of the modified fchmod(). By
contrast, real_fchmod() is a function
pointer, which is populated by
setup_wrappers(). The
setup_wrappers() function only does this setup
once; it then sets a static variable to indicate that it's already done,
so future calls can return quickly.
The "already_in_pseudo" test addresses another concern; what happens if, during the implementation of a given call, pseudo needs to make other system calls? It doesn't want those calls intercepted, so it sets the already_in_pseudo flag (actually spelled "antimagic") when it's about to do something that might require such handling. (In fact, it's a counter, although as far as I know it's never exceeded 1.)
The wrap_fchmod() function comes in two parts; a
generic preamble and epilogue, and a #include
to grab the code for that specific function; the internals of each
function are stored in a subdirectory called "guts", which holds the
actual extra work for wrapping. The preamble and epilogue provide standard
setup. Listing 2 shows the current implementation
of wrap_fchmod():
Listing 2. A sample wrapper
static int
wrap_fchmod(int fd, mode_t mode) {
int rc = -1;
#include "guts/fchmod.c"
return rc;
}
|
This is one of the few times you will ever see an experienced C programmer
write a #include directive naming a
.c file. This design allows additional work to
be done in these functions, and not duplicated across multiple function
implementations. For example, for functions that take optional arguments,
the wrap_*() function extracts the optional
argument. This is one of the many things pseudo does that is technically
undefined behavior, because it's done unconditionally, and it is an error
to call va_arg() when no additional arguments
were passed. Listing 3 shows the
wrap_open() function:
Listing 3. A wrapper with an optional argument
static int
wrap_open(const char *path, int flags, ... /* mode_t mode */) {
int rc = -1;
va_list ap;
mode_t mode;
va_start(ap, flags);
mode = va_arg(ap, mode_t);
va_end(ap);
#include "guts/open.c"
return rc;
}
|
This way, the internal function can refer to the "mode" argument
unconditionally, without having to use the
va_*() macros. (We can't just declare
open() as taking three arguments, because that
declaration clashes with the standard header.)
Remember how I said the wrapper began a particular way? There have been
three major changes since then. First, paths are now canonicalized in the
wrapper. In the original implementation, pseudo didn't really canonicalize
paths, it just fixed up .. and
. path entries. This led to occasional issues
involving symbolic links in paths, so now pseudo fully canonicalizes paths
(mostly; when pseudo wraps functions that operate on links, the final
component of the name is not dereferenced). Since this is done
consistently, it is also usually done in the wrapper.
In what seemed clever at the time (I now almost regret it), the
canonicalization code is automatically invoked whenever an argument has a
name ending in "path". This means that a few functions are defined in
pseudo with argument names differing from the canonical names in the man
pages. For many of the *at() calls, such as
openat(2), there is an additional "flags"
argument that tells the call whether to follow symlinks or not. The
pathname canonicalizer also takes such an argument. By default, if the
function has an argument named flags, this is
simply passed on to the canonicalizer. This decision also seemed clever at
the time, but caused some strange bugs (such as when used with the plain
open(2) call, which has a "flags" argument with
different semantics).
We added locking for multi-threaded programs. In the original design, without locking, any syscalls made by one thread while another thread was talking to the pseudo server bypassed the pseudo design. There is now a locking mechanism to ensure that only one thread at a time goes through any of the wrapped calls.
This led to a third problem: Programs that take signals for which the
signal handlers invoke wrapped calls. To handle this, pseudo blocks
certain signals when starting a wrapper, and unblocks them at the end of
the wrapper. This worked fairly well, except for one particularly
pernicious bug. Because the signals are unblocked at the end of the
wrapper, and pseudo wraps execve(2) and
friends, a new process invoked by execve(2)
would start with those signals blocked, and some programs never reset or
fixed up their signals. This has been corrected.
The resulting wrapper function is too long to present inline as a code example, but it seems to work pretty well. You can read it in the pseudo source tree, online on Github (see Resources).
Implementations in the guts directory can be
trivial. For example, Listing 4 has the
implementation provided for the acct(2)
syscall:
Listing 4. The simplest case for guts
/*
* Copyright (c) 2010 Wind River Systems; see
* guts/COPYRIGHT for information.
*
* static int
* wrap_acct(const char *path) {
* int rc = -1;
*/
rc = real_acct(path);
/* return rc;
* }
*/
|
Note the similarity between the commented code and the wrapper; that's to
give you a sense of what code you're trying to write. However, some are a
lot more complicated. The guts code has to figure out if anything needs to
be asked of the server, and if so, what. In some cases, the underlying
system call is completely omitted, or replaced by a completely different
call. For example, the mnkod(2) syscall is
implemented by creating a plain file, then recording the requested file
type in the pseudo database.
Something must be done about this
If a wrapper function needs to record or take note of an action, or needs
to query the pseudo database, it uses the
pseudo_client_op() function to perform an
operation. The signature of this function is shown in Listing 5.
Listing 5. Signature of pseudo_client_op()
pseudo_msg_t *
pseudo_client_op(
op_id_t op,
int access,
int fd,
int dirfd,
const char *path,
const struct stat64 *buf,
...);
|
The "..." is a historical quirk. Before rename operations were supported,
the function took only the fixed arguments. When I added rename
operations, at the time it seemed like a good idea to make the last
argument (which would otherwise have been
const char *oldpath) optional. I'm still not
sure whether that was the right call.
The op_id_t type is an enumeration containing
the list of supported operations, such as
OP_CHROOT, OP_OPEN,
OP_CLOSE, or
OP_CHMOD. The "access" parameter denotes the
type of access (read, write, or execute). The "fd" parameter is the file
descriptor being operated on. The "dirfd" parameter was originally used
for canonicalizing paths being used with the
*at() syscalls such as
openat(2). Since then, path canonicalization
has moved out of pseudo_client_op(), and now
that field is used only for handling OP_DUP
operations, such as calls to dup(2). The name
is now clearly wrong. I will fix it in my copious free time. The path name
is the name available to the call being wrapped, and the stat64 buffer is
stat information about the file being operated on.
In many specific cases, some of these fields may be set to sentinel values because they are not relevant to a particular operation.
Some messages are handled entirely within the client. For example, the
client maintains a table of the current paths of file descriptors, so a
path can be sent to the server along with an operation such as
fchmod(2). The
OP_CLOSE operation updates that internal table,
but is not sent to the server. Operations are sent to the server only when
they need to interact with the database or be logged. (In a future
revision, I hope to separate those two questions more clearly.)
Actually sending messages to the server
IPC between the client and server is done through a UNIX®-domain
socket. If the socket does not exist, or cannot be opened, the client
attempts to spawn a server. Messages consist of a standard header plus
optional path information. Messages have a number of types, such as
PSEUDO_MSG_PING,
PSEUDO_MSG_OP, or
PSEUDO_MSG_ACK. The
PSEUDO_MSG_OP message type indicates that an
operation, such as OP_CHMOD or
OP_STAT, is being submitted to the server. The
client writes a message to the socket, the server processes the message
and writes its response back with type
PSEUDO_MSG_ACK. Messages referring to a file
will generally contain the file's filesystem mode (including both
permissions and file type), and the device and inode numbers, and the path
name. For items that have two paths to send, there is a null byte after
the first path, and it is followed by the second path, with the pathlen
field of the message being set to the total length.
The server performs multiple sanity checks on incoming requests. It first looks to see whether it has a database entry matching both the path and device and inode number from the request. If not, it checks both for matches by path and matches by device and inode number. If these match, but the exact match fails, a diagnostic is often reported. (It's not for cases, such as a rename, where we expect there to be a name mismatch.) If file types mismatch in an unexpected way, such as one being a directory when the other isn't, diagnostics are recorded and the existing database entry is destroyed. However, a file type mismatch where the filesystem contains a plain file and the database records a device node is harmless; this is how pseudo emulates creation of device nodes.
Usually, I find that I learn more from my mistakes than from what works the first time. So, in the next installment of this series, I will go through some of the interesting bugs and failures we've had since starting this project, and how we've handled and learned from them.
Learn
- "Dissecting shared libraries" (developerWorks, Jan 2005): Read
this overview of Linux shared libraries.
- The pseudo project was
developed entirely to meet internal needs, but was released as open
source.
- developerWorks
podcasts: Tune into interesting interviews and discussions for
software developers
- Technical events and webcasts: Stay current with developerWorks
Live! briefings.
- developerWorks on
Twitter: Follow us for the latest news.
- Events of interest: Check out upcoming conferences, trade shows,
and webcasts that are of interest to IBM open source
developers.
- developerWorks
Open source zone: Find extensive how-to information, tools, and
project updates to help you develop with open source technologies and use
them with IBM's products, as well as our most popular articles and tutorials.
- developerWorks On demand demos: Watch our no-cost demos and learn
about IBM and open source technologies and product functions.
Get products and technologies
- IBM trial
software: Innovate your next open source development project using
trial software, available for download or on DVD.
Discuss
- developerWorks
community: Connect with other developerWorks users while exploring
the developer-driven blogs, forums, groups, and wikis.




