Applying mount namespaces
Uncover practical applications for advanced Linux mounts features
For some time the filesystem in Linux® was a fairly simple tree. A process
could chroot()
itself as though the root of the
filesystem tree was a subdirectory of the system's filesystem root. At any node in
the tree, a filesystem could be overlaid from a new device.
In 2000, Al Viro introduced bind mounts and filesystem namespaces for Linux:
- A bind mount allows any file or directory to be accessible from any other location.
- Filesystem namespaces are completely separate filesystem trees associated with different processes.
A process requests a copy of its current filesystem tree at
clone(2)
(see Related topics for
more) after which the new process has an identical copy of the original process's
filesystem tree. After the copy is made, any mount action in either copy of the
tree is not reflected in the other copy.
While per-process filesystem namespaces were very useful in theory, in practice the complete isolation between them was too restrictive. Once a process clones its copy of the system's filesystem namespace, an already running system daemon cannot auto-mount a CD-ROM for the user because the mount, performed in the original filesystem namespace, cannot be reflected in the user's copy.
This situation was addressed in 2006 by the introduction of mount propagation, which defines a relationship between mount objects. This relationship is used to determine how mount events in any one mount object propagate to other mount objects in the system:
- A mount event in one mount object propagates to another mount object and vice versa if the two mount objects have a shared relationship with each other.
- A mount event in one mount object propagates to another mount object but not vice versa if the two share a slave relationship with each other where the slave is the event recipient.
A mount object that propagates events is called a shared mount; a mount object that receives a mount event is called a slave mount. A mount object that neither propagates nor receives a mount event is called a private mount. One other specialized mount object is called an unbindable mount, which is like a private mount but also disallows bind mounting itself. Unbindable mounts are particularly useful for containing explosive growth of mount objects (more on this concept later).
By default, all mounts are private. A mount object can be marked as a shared mount by explicitly marking it as shared through the following command:
mount --make-shared <mount-object>
For example, if the mount at /
has to be made shared,
execute the following command:
mount --make-shared /
A mount object cloned from a shared mount is also a shared mount; they both propagate to each other.
A mount object can be marked a slave mount by explicitly converting a shared mount to a slave mount by executing the following command:
mount --make-slave <shared-mount-object>
A mount object cloned from a slave mount is also a slave mount and is the slave of the same master as that of the original slave mount.
A mount object can be marked private by executing the following command:
mount --make-private <mount-object>
And a mount object can be marked unbindable by executing the following command:
mount --make-unbindable <mount-object>
Finally, any of these can be applied recursively, meaning that they will apply to every mount under the target mount.
For example:
mount --make-rshared /
will convert all the mounts under /
as shared mounts.
Per-login namespaces
Listing 1 shows the relevant part of a PAM (pluggable authentication module), which places every user except root into a private namespace. If a directory /tmp/priv/USER exists, then that directory will be bind mounted over /tmp in the user's private namespace.
Listing 1. PAM excerpt for per-login namespaces
#define DIRNAMSZ 200 int handle_login(const char *user) { int ret = 0; struct stat statbuf; char dirnam[DIRNAMSZ]; if (strcmp(user, "root") == 0) return PAM_SUCCESS; ret = unshare(CLONE_NEWNS); if (ret) { mysyslog(LOG_ERR, "failed to unshare mounts for %s\n", user); return PAM_SESSION_ERR; } snprintf(dirnam, DIRNAMSZ, "/tmp/priv/%s", user); ret = stat(dirnam, &statbuf); if (ret == 0 && S_ISDIR(statbuf.st_mode)) { ret = mount(dirnam, "/tmp", "none", MS_BIND, NULL); if (ret) { mysyslog(LOG_ERR, "failed to mount tmp for %s\n", user); return PAM_SESSION_ERR; } } else mysyslog(LOG_INFO, "No private /tmp for user %s\n", user); return PAM_SUCCESS; } int pam_sm_open_session(pam_handle_t *pamh, int flags, int argc, const char **argv) { const char *PAM_user = NULL; char *fnam; int ret; ret = pam_get_user(pamh, &PAM_user, NULL); if (ret != PAM_SUCCESS) { mysyslog(LOG_ERR, "PAM-NS: couldn't get user\n"); return PAM_SESSION_ERR; } return handle_login(PAM_user); }
To make use of this PAM module, download the full pam_ns.c file and its corresponding makefile from the Download section below. Compile it and copy the resulting pam_ns.so file into /lib/security/. Then add the following entry:
session required pam_ns.so
to /etc/pam.d/login and /etc/pam.d/sshd. Finally, for some user
USER
, create a private tmp.
mkdir /tmp/priv chmod 000 /tmp/priv mkdir /tmp/priv/USER chown -R USER /tmp/priv/USER
Now log in as root on one terminal and USER
on
another. As USER
, try:
touch /tmp/ab ls /tmp
Notice that USER
's /tmp contains only the newly
created file.
Next, get a listing of the /tmp contents in the root terminal; notice that there
are likely other files but no /tmp/ab. The /tmp directories are in fact separate
directories. To find USER
's /tmp from the root
terminal, type:
ls /tmp/priv/USER
You will see the file ab. Next, in the root terminal, mount something over /mnt:
mount --bind /dev /mnt
Notice that the contents of /dev appear under /mnt in the root terminal, but not
in USER
's terminal. The mount trees for these two
terminals are completely isolated. Using the mount(8)
command, you can give directives regarding mount propagation. By default, all
mounts are private. So before USER
logs in, you could
try:
mount --make-rshared /
After this, mount events will be propagated between subsequently unshared
namespaces. However, after USER
is logged in, the mount
of /tmp/priv/USER onto /tmp should not be propagated to the parent namespaces. To
resolve this, pam_ns.so could mark its filesystem as slave, as shown in Listing 2.
Listing 2. PAM module to mark user's namespace as slave
#define DIRNAMSZ 200 #ifndef MS_SLAVE #define MS_SLAVE 1<<19 #endif #ifndef MS_REC #define MS_REC 0x4000 #endif int handle_login(const char *user) { int ret = 0; struct stat statbuf; char dirnam[DIRNAMSZ]; if (strcmp(user, "root") == 0) return PAM_SUCCESS; ret = unshare(CLONE_NEWNS); if (ret) { mysyslog(LOG_ERR, "failed to unshare mounts for %s\n", user); return PAM_SESSION_ERR; } ret = mount("", "/", "dontcare", MS_REC|MS_SLAVE, "")); if (ret) { mysyslog(LOG_ERR, "failed to mark / rslave for %s\n", user); return PAM_SESSION_ERR; } snprintf(dirnam, DIRNAMSZ, "/tmp/priv/%s", user); ret = stat(dirnam, &statbuf); if (ret == 0 && S_ISDIR(statbuf.st_mode)) { ret = mount(dirnam, "/tmp", "none", MS_BIND, NULL); if (ret) { mysyslog(LOG_ERR, "failed to mount tmp for %s\n", user); return PAM_SESSION_ERR; } } else mysyslog(LOG_INFO, "No private /tmp for user %s\n", user); return PAM_SUCCESS; }
Per-user root
In the section on per-login namespaces, you saw a simple use of mount namespaces to provide users with a private namespace. Using mount propagation, the solution is perfectly adequate for providing users with private /tmp directories and by adding a configuration file parsed by pam_ns.c, additional directories can be redirected per user. This is how LSPP systems provide poly-instantiated home directories, mounting over /home/USER a directory dependent on the clearance of the login process.
However, each login for the same user will receive a private, slave filesystem. So mounts performed by the user in one login session will not be reflected in another.
There are several ways that non-administrator users can mount filesystems. For instance, using FUSE, a user can mount an sshfs (secure shell) filesystem or a loopback filesystem (see Related topics). Leaving the issue of sharing such mounts with other users aside until later in the article, it would be completely unintuitive for such mounts to appear in only one of the user's login terminals but not the others. But with the approach in the section on per-login namespaces, that is exactly what will happen.
Listing 3 shows the relevant excerpts of the pam_chroot.so pam module. Whereas
pam_ns.so clones the mount namespace upon login, the pam_chroot.so module expects
a per-user filesystem to be set up under /share/USER/root and simply uses
chroot()
to lock the user into his private filesystem.
Listing 3. PAM module using chroot()
int pam_sm_open_session(pam_handle_t *pamh, int flags, int argc, const char **argv) { const char *PAM_user = NULL; char fnam[400]; int ret, err, count, i; struct mount_entries *entries; struct stat statbuf; ret = pam_get_user(pamh, &PAM_user, NULL); if (ret != PAM_SUCCESS) { mysyslog(LOG_ERR, "PAM-MOUNT: couldn't get user\n"); return PAM_SESSION_ERR; } /* check whether /share/$pam_user/root exists. If so, chroot to it */ sprintf(fnam, "/share/%s/root", PAM_user); ret = stat(fnam, &statbuf); if (ret == 0 && S_ISDIR(statbuf.st_mode)) { ret = chroot(fnam); if (ret) { mysyslog(LOG_ERR, "PAM-MOUNT: unable to chroot to %s\n", fnam); return PAM_SESSION_ERR; } } return PAM_SUCCESS; }
In this case, all mounting is done in advance at system startup. For instance, after boot:
mkdir -p /share/USER/root mount --make-rshared / mount --rbind / /share/USER/root mount --make-rslave /share/USER/root mount --bind /share/USER/root/tmp/priv/USER /share/USER/root/tmp
Private namespaces are not being used. Rather, each of
USER
's logins are
chroot()
-ed under the same directory, /share/USER/root.
Therefore any mounting performed by any one of USER
's
logins will be seen in all of his logins. In contrast,
OTHERUSER
will be
chroot()
-ed under /share/OTHERUSER/root and therefore
will not see USER
's mount activity.
One shortcoming of this approach is that an ordinary
chroot()
can be escaped, although some privilege is
needed. For instance, when executed with certain capabilities including
CAP_SYS_CHROOT, the source for a program to break out of
chroot()
(see Related topics) will
cause the program to escape into the real filesystem root. Depending on the actual
motivation for and use of the per-user filesystem trees, this may be a problem.
We can address this problem by using pivot_root(2)
in
a private namespace instead of chroot(2)
to change the
login's root to /share/USER/root. Whereas chroot()
simply points the process's filesystem root to a specified new directory,
pivot_root()
detaches the specified new_root directory
(which must be a mount) from its mount point and attaches it to the process root
directory. Since the mount tree has no parent for the new root, the system cannot
be tricked into entering it like it can with chroot()
.
We will use the pivot_root()
approach.
System setup for per-user root
You've seen the details of how per-user private mount trees can be implemented, including what must be done at login. In this section, you'll see more complete scripts for use at user account creation and system bootup.
Listing 4 shows the script to be run when any user is created.
Listing 4. Script for user creation
create_user_tree { user = $1 mkdir /user/$user mount --rbind / /user/$user mount --make-rslave /user/$user mount --make-rshared /user/$user #create a private mount. This is to facilitate pivot_root #to temporarily place the old root here before detaching the #entire old root tree. NOTE: pivot_root will not allow old root #to be placed under a shared mount. pushd /user/$user/ mkdir -p __my_private_mnt__ mount --bind __my_private_mnt__ __my_private_mnt__ mount --make-private __my_private_mnt__ popd }
This script assumes that the init_per_user_namespace script, which we'll discuss next, has already been run. A directory is created for the account under /user/. The root directory is then recursively bind-mounted under /user/$user/. This recursively copied tree of the root filesystem will become a persistent (across logins, not across reboots) store for mount activity by the user.
The copied tree is made a slave of the root tree so that mount actions in the root tree are propagated to the copy but not the other way around. The same tree is marked shared so that subsequent copies—namely made by namespace cloning—will be mount peers; mount actions in any one copy will be propagated to all other copies.
Finally, a private mount named __my_private_mnt__ is created. This is done to
facilitate pivot_root()
(coming in Listing 6), to
temporarily stage the root mount before deleting the tree. Don't get too involved
in this step. The reasoning behind this step will be clear when the semantics of
pivot_root()
are clear. For now, just remember that
pivot mount won't succeed if the staging mount is of the type shared.
Listing 5 shows the script to be run at system boot.
Listing 5. Script for system initialization at boot
init_per_user_namespace { #start with a clean state by marking #all mounts as private. mount --make-rprivate / #create a unbindable mount called 'user' #and have all the users to bind the entire #system tree '/' under them. mkdir /user mount --bind /user /user mount --make-rshared / mount --make-unbindable /user foreach user in existing_user { create_user_tree $user } }
It creates the /user directory under which the per-user mount trees will be
kept. It then bind mounts /user onto itself. Mount propagation directives such as
--rshared
can only be specified for mount points. This
step ensures that such a mount point exists at /user.
Next the root of the filesystem is marked --rshared
so that subsequent copies (created either by bind mounting or by cloning the
mounts namespace) will be peers with this mount, and mount actions in any one tree
will be copied to all peers.
Next, the mount at /user is marked unbindable. The whole mount tree will be recursively copied once for each user, so ordinarily after the first user's copy is created under /user/$user_1, the copy created under /user/$user_2 would contain a recursive copy of /user/$user_1 under /user/$user_2/user/$user_1. As you can guess, this quickly consumes an unwieldy amount of memory. Marking /user unbindable prevents /user from being copied when / is being recursively bind mounted.
Finally, the script in Listing 4 is executed once per user, creating the /user/$user directory if it did not yet exist and setting up the proper mount propagation as previously discussed.
Listing 6 shows an excerpt of the PAM module to be executed whenever a user logs in.
Listing 6. PAM code excerpt for user login
#ifndef MNT_DETACH #define MNT_DETACH 0x0000002 #endif #ifndef MS_REC #define MS_REC 0x4000 #endif #ifndef MS_PRIVATE #define MS_PRIVATE 1<<18 /* Private */ #endif #define DIRNAMSZ 200 int handle_login(const char *user) { int ret = 0; struct stat statbuf; char dirnam[DIRNAMSZ], oldroot[DIRNAMSZ]; snprintf(dirnam, DIRNAMSZ, "/user/%s", user); ret = stat(dirnam, &statbuf); if (ret != 0 || !S_ISDIR(statbuf.st_mode)) return PAM_SUCCESS; ret = unshare(CLONE_NEWNS); if (ret) { mysyslog(LOG_ERR, "failed to unshare mounts for %s, error %d\n", user, errno); return PAM_SESSION_ERR; } ret = chdir(dirnam); if (ret) { mysyslog(LOG_ERR, "failed to unshare mounts for %s, error %d\n", user, errno); return PAM_SESSION_ERR; } snprintf(oldroot, DIRNAMSZ, "%s/__my_private_mnt__", dirnam); ret = pivot_root(dirnam, oldroot); if (ret) { mysyslog(LOG_ERR, "failed to pivot_root for %s, error %d\n", user, errno); mysyslog(LOG_ERR, "pivot_root was (%s,%s)\n", dirnam, oldroot); return PAM_SESSION_ERR; } ret = mount("", "/__my_private_mnt__", "dontcare", MS_REC|MS_PRIVATE, ""); if (ret) { mysyslog(LOG_ERR, "failed to mark /tmp private for %s, error %d\n", user, errno); return PAM_SESSION_ERR; } ret = umount2("/__my_private_mnt__", MNT_DETACH); if (ret) { mysyslog(LOG_ERR, "failed to umount old_root %s, error %d\n", user, ret); return PAM_SESSION_ERR; } return PAM_SUCCESS; }
The module begins by checking whether a /user/USER tree exists for the user logging in. If not, the module simply allows the user to log in without any further actions.
If a /user/USER tree does exist, then the first step is to clone a private namespace for the tasks under this login process. Those processes thus have their own copy of the system's initial mount tree. The trees are not disjointed, however; each mount node in the copied tree is shared with the corresponding mount node in the initial tree.
Next, the login process uses pivot_root()
to change
the filesystem root for the login processes to /user/$user. The original root is
left mounted under the new __my_private_mnt__.
The next step is to mark __my_private_mnt__ private so that the subsequent unmount will not propagate to other copies of the root mount tree, including the original.
Finally, the original root is unmounted from __my_private_mnt__.
In the user creation script (see Listing 4 ), we made the
__my_private_mnt__ directory a private mount, and said that it wold facilitate
pivot_root()
. The reason for this is in fact a poorly
documented limitation on pivot_root()
regarding mount
propagation status of the old and new roots. For
pivot_root()
to succeed, the following mounts must not
be shared objects:
- The target location for the old root
- The current parent of the new root (at the time of calling
pivot_root()
) - The targeted parent of the new root
The first condition above is satisfied by making __my_private_mount private near the end of Listing 4. The second condition is already satisfied, since the current parent of the new root is /user, and the mount at /user is an unbindable mount. The third condition is also already satisfied, since the targeted parent of the new root is also the parent of the current root. That mount is the invisible rootfs mount, which is already private.
In this section, we've discussed how to implement per-user mount trees where mount events are shared across all of a user's login sessions but hidden from other users. In the next section, you see how to allow users to share mount trees with each other.
Per-user mount trees with selective sharing
We've talked about providing per-user mount trees; now let's discuss how to provide user-directed partial sharing of mount trees across users. Here's a script for system boot.
Listing 7. Script for system boot
init_per_user_namespace { mkdir -p /user/slave_tree mkdir -p /user/share_tree #start with a clean state. Set all mounts to private. mount --make-rprivate / mount --bind /user /user mount --bind /user/share_tree /user/share_tree mount --bind /user/slave_tree /user/slave_tree mount --make-rshared / mount --make-unbindable /user for user in `cat /etc/user_list`; do sh /bin/create_user_tree $user done }
We create a /user mount that holds each user's root directory. Under /user, we also create another directory called /user/share_tree, which holds the mounts that each user can share with other users. We also create a /user/slave_tree directory that holds the mounts that each user can share without taking any modification from other users. Of course, to contain unbounded mount creation, we mark the mount at /user as unbindable. And finally, create_user_tree is called to create a mount tree for each user.
Listing 8 describes the steps involved in creating the mount tree as well as allowing the mounts to be shared with other users.
Listing 8. Script for user creation
create_user_tree { user = $1 mkdir -p /user/$user #copy over the entire mount tree under /user/$user mount --rbind / /user/$user make --make-rslave /user/$user make --make-rshared /user/$user cd /user/$user/home/$user #export my shared exports mkdir -p my_shared_exports chown $user my_shared_exports mount --bind my_shared_exports my_shared_exports mount --make-private my_shared_exports mount --make-shared my_shared_exports mkdir -p /user/share_tree/$user mount --bind my_shared_exports /user/share_tree/$user #export my slave exports mkdir -p my_slave_exports chown $user my_slave_exports mount --bind my_slave_exports my_slave_exports mount --make-private my_slave_exports mount --make-shared my_slave_exports mkdir -p /user/slave_tree/$user mount --bind my_slave_exports /user/slave_tree/$user cd /user/$user #import everybody's shared exports mkdir -p others_shared_exports mount --rbind /user/share_tree others_shared_exports #import everybody's slave exports mkdir -p others_slave_exports mount --rbind /user/slave_tree others_slave_exports mount --make-rslave others_slave_exports #setup a private mount in the user's tree, This is to facilitate # pivot_mount executed later, during new user-logins. mkdir -p __my_private_mnt__ mount --bind __my_private_mnt__ __my_private_mnt__ mount --make-private __my_private_mnt__ }
First, replicate the entire mount tree under /user/$user. In the user's tree, create a shared mount named my_shared_exports, which is exported to all users by replicating it under /user/share_tree/$user. Similarly, create my_slave_exports in the user's tree, which is also exported to all users by replicating it under /user/slave_tree/$user. The key idea here is that the user, by choosing to mount something under my_shared_tree, automatically shares the mount with all other users.
Next, import all other users' shared mounts by replicating the mount tree under /user/share_tree and mounting it under the logging-in user's others_shared_exports. Similarly, import all other users' slave mounts by replicating the mount tree under /user/slave_tree and mounting it under others_slave_exports. Of course, since these mounts are intended to be exported as slaves by the exporter, we convert the mounts into slaves.
Having done this initial setup with the correct settings for sharing, the user login algorithm is the same as in Listing 6. When logged in, the user will have an exact mount tree as all of its other logins; at the same time, the user will see all the exported shared and slave mounts from all other users under /others_shared_export and /others_slave_exports, respectively.
Anytime a user wants to export something to others, the user has to just mount the content under my_shared_exports and "magically" the content will be visible to all the users.
Conclusion
Bind mounts facilitate splicing any file or directory on top of any other. Namespaces allow processes to be cloned with isolated copies of their parents' mounts trees. Mount propagation allows otherwise isolated copies of filesystem trees to share mount events in one or both directions. These features give users their own quasi-private mount trees while allowing users to see systemwide mount events like a CD-ROM mount and to selectively share their own mount events with other users.
In other words, the techniques for mount propagation described in this article empower users to build a separate filesystem setup and to import and export parts of different filesystem trees into their private filesystem trees.
Downloadable resources
- PDF of this content
- Sample mount propagation code for this article (dw.mountscode.tgz | 3KB)
Related topics
- "System Administration Toolkit: Migrating and moving UNIX filesystems" (developerWorks, July 2006) explains how to transfer an entire filesystem on a live system, including how to create, copy, and re-enable.
- "Differentiating UNIX and Linux" (developerWorks, March 2006) gives you a quick lesson in the differences in filesystem support between Linux and UNIX.
- In the Linux man pages, learn more about
clone(2)
,unshare(2)
,mount(8)
,pivot_root(2)
, andchroot(2)
. - The Common Criteria Labeled Security Protection Profile (LSPP) specifies a set of security functional and assurance requirements—two classes of access control mechanisms—for IT products.
- FUSE (or Filesystem in Userspace) lets you implement a fully functional filesystem in a userspace program with the following features: simple API, no patching or recompiling the kernel, a secure implementation, proven stability, usable by non-privileged users, and it runs on 2.4.x and 2.6.x kernels.
- The sshfs is a filesystem client based on the SSH File Transfer Protocol; most SSH servers already support the protocol, so to set it up, you do nothing on the server side and lognin to the server with ssh on the client side.
- Get a
get out of
chroot()
jail card. - Linux PAM is a flexible mechanism for authenticating users that lets developers craft programs that are independent of authentication scheme (so, "new device" doesn't have to equal recoding of all the authentication support programs).
- The Linux Documentation Project has a variety of useful documents, especially its HOWTOs.
- In the developerWorks Linux zone, find more resources for Linux developers, and scan our most popular articles and tutorials.
- See all Linux tips and Linux tutorials on developerWorks.