Advanced filesystem implementor's guide, Part 1
Journalling and ReiserFS
What's in store
The purpose of this series is to give you a solid, practical introduction to Linux's various new filesystems, including ReiserFS, XFS, JFS, GFS, ext3 and others. I want to equip you with the necessary practical knowledge you need to actually start using these filesystems. My goal is to help you avoid as many potential pitfalls as possible; this means that we're going to take a careful look at filesystem stability, performance issues (both good and bad), any negative application interactions that you should be aware of, the best kernel/patch combinations, and more. Consider this series an "insider's guide" to these next-generation filesystems.
So, that's what's in store. But to begin this series, I'm going to diverge from this plan for just one article and prepare you for the journey ahead. I'll cover two topics very important to the Linux development community -- journalling, and the design vision behind ReiserFS. Journalling is very important because it's a technology that we've been anticipating for a long time, and it's finally here. It's used in ReiserFS, XFS, JFS, ext3 and GFS. It's important to understand exactly what journalling does and why Linux needs it. Even if you have a good grasp of journalling, I hope that my journalling intro will serve as a good model for explaining the technology to others, something that'll be common practice as departments and organizations worldwide begin transitioning to these new journalling filesystems. Often, this process begins with a "Linux guy/gal" such as yourself convincing others that it's the right thing to do.
In the second half of this article, we're going to take a look at the design vision behind ReiserFS. By doing so, we're going to get a good grasp on the fact that these new filesystems aren't just about doing the same old thing a bit faster. They also allow us to do things in ways that simply weren't possible before. Developers, keep this in mind as you read this series. The capabilities of these new filesystems will likely affect how you code your future Linux software development projects.
Understanding journalling: meta-data
As you well know, filesystems exist to allow you to store, retrieve and manipulate data. And, in order to do this, a filesystem needs to maintain an internal data structure that keeps all your data organized and readily accessible. This internal data structure (literally, "the data about the data") is called meta-data. It is the structure of this meta-data that gives a filesystem its particular identity and performance characteristics.
Normally, we don't interact with a filesystem's meta-data directly. Instead, a specific Linux filesystem driver takes care of that job for us. A Linux filesystem driver is specially written to manipulate this maze of meta-data. However, in order for the filesystem driver to work properly, it has one important requirement; it expects to find the meta-data in some kind of reasonable, consistent, non-corrupted state. Otherwise, the filesystem driver won't be able to understand or manipulate the meta-data, and you won't be able to access your files.
Understanding journalling: fsck
This is where fsck comes in. When a Linux system boots, fsck starts up and scans all local filesystems listed in the system's /etc/fstab file. fsck's job is to ensure that the to-be-mounted filesystems' meta-data is in a usable state. Most of the time, it is. When Linux shuts down, it carefully flushes all cached data to disk and ensures that the filesystem is cleanly unmounted, so that it's ready for use when the system starts up again. Typically, fsck scans the to-be-mounted filesystems and finds that they were cleanly unmounted, and makes the reasonable assumption that all meta-data is OK.
However, we all know that every now and then, something atypical happens, such as an unexpected power failure or system lock-up. When these unfortunate situations occur, Linux doesn't have the opportunity to cleanly unmount the filesystem. When the system is rebooted and fsck starts its scan, it detects that these filesystems were not cleanly unmounted and makes a reasonable assumption that the filesystems probably aren't ready to be seen by the Linux filesystem drivers. It's very likely that the meta-data is messed up in some way.
So, to fix this situation, fsck will begin an exhaustive scan and sanity check on the meta-data, correcting any errors that it finds along the way. Once fsck is complete, the filesystem is ready for use. Although some recently-modified data may have been lost due to the unexpected power failure or system lockup, since the meta-data is now consistent, the filesystem is ready to be mounted and be put to use.
The problem with fsck
So far, this may not sound like a bad approach to ensuring filesystem consistency, but the solution isn't optimal. Problems arise from the fact that fsck must scan a filesystem's entire meta-data in order to ensure filesystem consistency. Doing a complete consistency check on all meta-data is a time-consuming task in itself, normally taking at least several minutes to complete. Even worse, the bigger the filesystem, the longer this exhaustive scan takes. This is a big problem, because while fsck is doing its thing, your Linux system is effectively offline, and if you have a large amount of filesystem storage, your system could be fsck-ing for half an hour or more. Of course, standard fsck behavior can have devastating results in mission-critical datacenter environments where system uptime is extremely important. Fortunately, there's a better solution.
Journalling filesystems solve this fsck problem by adding a new data structure, called a journal, to the mix. This journal is an on-disk structure. Before the filesystem driver makes any changes to the meta-data, it writes an entry to the journal that describes what it's about to do. Then, it goes ahead and modifies the meta-data. By doing so, a journalling filesystem maintains a log of recent meta-data modifications, and this comes in handy when it comes time to check the consistency of a filesystem that wasn't cleanly unmounted.
Think of journalling filesystems this way -- in addition to storing data (your stuff) and meta-data (the data about the stuff), they also have a journal, which you could call meta-meta-data (the data about the data about the stuff).
Journalling in action
So, what does fsck do with a journalling filesystem? Actually, normally, it does nothing. It simply ignores the filesystem and allows it to be mounted. The real magic behind quickly restoring the filesystem to a consistent state is found in the Linux filesystem driver. When the filesystem is mounted, the Linux filesystem driver checks to see whether the filesystem is OK. If for some reason it isn't, then the meta-data needs to be fixed, but instead of performing an exhaustive meta-data scan (like fsck) it instead takes a look at the journal. Since the journal contains a chronological log of all recent meta-data changes, it simply inspects those portions of the meta-data that have been recently modified. Thus, it is able to bring the filesystem back to a consistent state in a matter of seconds. And unlike the more traditional approach that fsck takes, this journal replaying process does not take longer on larger filesystems. Thanks to the journal, hundreds of Gigabytes of filesystem meta-data can be brought to a consistent state almost instantaneously.
Now, we come to ReiserFS, the first of several journalling filesystems we're going to be investigating. ReiserFS 3.6.x (the version included as part of Linux 2.4) is designed and developed by Hans Reiser and his team of developers at Namesys. Hans and his team share the philosophy that the best filesystems are those that help create a single shared environment, or namespace, where applications can interact more directly, efficiently and powerfully. To do this, a filesystem should meet the performance and feature needs of its users. That way, users can continue using the filesystem directly rather than building special-purpose layers that run on top of the filesystem, such as databases and the like.
Small file performance
So, how does one go about making the filesystem more accommodating? Namesys has decided to focus on one aspect of the filesystem, at least initially -- small file performance. In general, filesystems like ext2 and ufs don't do very well in this area, often forcing developers to turn to databases or special organizational hacks to get the kind of performance they need. Over time, this kind of "I'll code around the problem" approach encourages code bloat and lots of incompatible special-purpose APIs, which isn't a good thing.
Here's an example of how ext2 can tend to encourage this kind of programming. ext2 is good at storing lots of twenty-plus k files, but isn't an ideal technology for storing 2,000 50-byte files. Not only does performance drop significantly when ext2 has to deal with extremely small files, but storage efficiency drops as well, since ext2 allocates space in either one or four k chunks (configurable when the filesystem is created).
Now, conventional wisdom would say that you aren't supposed to store that many ridiculously small files on a filesystem. Instead, they should be stored in some kind of database that runs above the filesystem. In reply, Hans Reiser would point out that whenever you need to build a layer on top of the filesystem, it means that the filesystem isn't meeting your needs. If the filesystem met your needs, then you could avoid using a special-purpose solution in the first place. You would thus save development time and eliminate the code bloat that you would have created by hand-rolling your own proprietary storage or caching mechanism, interfacing with a database library, etc.
Well, that's the theory. But how good is ReiserFS' small file performance in practice? Amazingly good. In fact, ReiserFS is around eight to fifteen times faster than ext2 when handling files smaller than one k in size! Even better, these performance improvements don't come at the expense of performance for other file types. In general, ReiserFS outperforms ext2 in nearly every area, but really shines when it comes to handling small files.
So how does ReiserFS go about offering such excellent small file performance? ReiserFS uses a specially optimized b* balanced tree (one per filesystem) to organize all filesystem data. This in itself offers a nice performance boost, as well as easing artificial restrictions on filesystem layouts. It's now possible to have a directory that contains 100,000 other directories, for example. Another benefit of using a b*tree is that ReiserFS, like most other next-generation filesystems, dynamically allocates inodes as needed rather than creating a fixed set of inodes at filesystem creation time. This helps the filesystem to be more flexible to the various storage requirements that may be thrown at it, while at the same time allowing for some additional space-efficiency.
ReiserFS also has a host of features aimed specifically at improving small file performance. Unlike ext2, ReiserFS doesn't allocate storage space in fixed one k or four k blocks. Instead, it can allocate the exact size it needs. And ReiserFS also includes some special optimizations centered around tails, a name for files and end portions of files that are smaller than a filesystem block. In order to increase performance, ReiserFS is able to store files inside the b*tree leaf nodes themselves, rather than storing the data somewhere else on the disk and pointing to it.
This does two things. First, it dramatically increases small file performance. Since the file data and the stat_data (inode) information are stored right next to each other, they can normally be read with a single disk IO operation. Second, ReiserFS is able to pack the tails together, saving a lot of space. In fact, a ReiserFS filesystem with tail packing enabled (the default) can store six percent more data than the equivalent ext2 filesystem, which is amazing in itself.
However, tail packing does cause a slight performance hit since it forces ReiserFS to repack data as files are modified. For this reason, ReiserFS tail packing can be turned off, allowing the administrator to choose between good speed and space efficiency, or opt for even more speed at the cost of some storage capacity.
ReiserFS truly is an excellent filesystem. In my next article, I'll guide you through the process of setting up ReiserFS under Linux 2.4. We'll also take a close look at performance tuning, application interactions (and how to work around them), the best kernels to use, and more.
- The Namesys Web page is the place to learn more about ReiserFS.
- The ReiserFS mailing list is an excellent source for current, more in-depth ReiserFS information.
- You can find a very nice in-depth look at the meta-data differences between UFS, ext2, ReiserFS, and more in Juan I. Santos Florido's Linux Gazette Journalling Filesystems review.
- Linux Weekly News is a great resource for keeping up with the latest kernel developments.