 | Level: Introductory Daniel Robbins (drobbins@gentoo.org), President/CEO
01 Jun 2001 With the 2.4 release of Linux come a host of new filesystem possibilities, including ReiserFS, XFS, GFS and others. Sure, these filesystems sound cool, but what exactly can they do, what are they good at and exactly how do you go about safely using them in a Linux production environment? In the advanced filesystem implementor's guide, Daniel Robbins answers these questions by showing you how to set up these new advanced filesystems under Linux 2.4. Along the way, he shares valuable practical implementation advice, performance information and important technical notes so that your new filesystem experience is as pleasant as possible. In this, the first article in the series, he explains the benefits of journalling and ReiserFS. What's in store
The purpose of this series is to give you a solid, practical introduction to
Linux's various new filesystems, including ReiserFS, XFS, JFS, GFS, ext3 and
others. I want to equip you with the necessary practical knowledge you need to
actually start using these filesystems. My goal is to help you avoid as many
potential pitfalls as possible; this means that we're going to take a careful
look at filesystem stability, performance issues (both good and bad), any
negative application interactions that you should be aware of, the best
kernel/patch combinations, and more. Consider this series an "insider's guide"
to these next-generation filesystems. So, that's what's in store. But to begin this series, I'm going to diverge
from this plan for just one article and prepare you for the
journey ahead. I'll cover two topics very important to the Linux development
community -- journalling, and the design vision behind ReiserFS. Journalling is
very important because it's a technology that we've been anticipating for a long
time, and it's finally here. It's used in ReiserFS, XFS, JFS, ext3 and GFS. It's
important to understand exactly what journalling does and why Linux needs it.
Even if you have a good grasp of journalling, I hope that my journalling intro
will serve as a good model for explaining the technology to others, something
that'll be common practice as departments and organizations worldwide begin
transitioning to these new journalling filesystems. Often, this process begins
with a "Linux guy/gal" such as yourself convincing others that it's the right
thing to do. In the second half of this article, we're going to take a look at the design
vision behind ReiserFS. By doing so, we're going to get a good grasp on the fact
that these new filesystems aren't just about doing the same old thing a bit
faster. They also allow us to do things in ways that simply weren't possible
before. Developers, keep this in mind as you read this series. The capabilities
of these new filesystems will likely affect how you code your future
Linux software development projects.
Understanding journalling: meta-data
As you well know, filesystems exist to allow you to store, retrieve and
manipulate data. And, in order to do this, a filesystem needs to maintain an
internal data structure that keeps all your data organized and readily
accessible. This internal data structure (literally, "the data about the data")
is called meta-data. It is the structure of this meta-data that gives
a filesystem its particular identity and performance characteristics. Normally, we don't interact with a filesystem's meta-data directly. Instead,
a specific Linux filesystem driver takes care of that job for us. A Linux
filesystem driver is specially written to manipulate this maze of meta-data.
However, in order for the filesystem driver to work properly, it has one
important requirement; it expects to find the meta-data in some kind of
reasonable, consistent, non-corrupted state. Otherwise, the filesystem driver
won't be able to understand or manipulate the meta-data, and you won't be able
to access your files.
Understanding journalling: fsck
This is where fsck comes in. When a Linux system boots, fsck starts up
and scans all local filesystems listed in the system's /etc/fstab file. fsck's job is to ensure that the to-be-mounted filesystems' meta-data is in a usable
state. Most of the time, it is. When Linux shuts down, it carefully flushes all
cached data to disk and ensures that the filesystem is cleanly unmounted, so
that it's ready for use when the system starts up again. Typically, fsck scans
the to-be-mounted filesystems and finds that they were cleanly unmounted, and
makes the reasonable assumption that all meta-data is OK. However, we all know that every now and then, something atypical
happens, such as an unexpected power failure or system lock-up. When these
unfortunate situations occur, Linux doesn't have the opportunity to cleanly
unmount the filesystem. When the system is rebooted and fsck starts its scan,
it detects that these filesystems were not cleanly unmounted and makes a
reasonable assumption that the filesystems probably aren't ready to be seen by
the Linux filesystem drivers. It's very likely that the meta-data is messed up
in some way. So, to fix this situation, fsck will begin an exhaustive scan and sanity
check on the meta-data, correcting any errors that it finds along the way. Once
fsck is complete, the filesystem is ready for use. Although some
recently-modified data may have been lost due to the unexpected power failure or
system lockup, since the meta-data is now consistent, the filesystem is ready to
be mounted and be put to use.
The problem with fsck
So far, this may not sound like a bad approach to ensuring filesystem
consistency, but the solution isn't optimal. Problems arise from the fact that
fsck must scan a filesystem's entire meta-data in order to ensure
filesystem consistency. Doing a complete
consistency check on all meta-data is a time-consuming task in itself, normally
taking at least several minutes to complete. Even worse, the bigger the
filesystem, the longer this exhaustive scan takes. This is a big problem,
because while fsck is doing its thing, your Linux system is effectively
offline, and if you have a large amount of filesystem storage, your system could
be fsck-ing for half an hour or more. Of course, standard fsck behavior can
have devastating results in mission-critical datacenter environments where
system uptime is extremely important. Fortunately, there's a better solution.
The journal
Journalling filesystems solve this fsck problem by adding a new data
structure, called a journal, to the mix. This journal is an on-disk structure.
Before the filesystem driver makes any changes to the meta-data, it writes an
entry to the journal that describes what it's about to do. Then, it goes ahead
and modifies the meta-data. By doing so, a journalling filesystem maintains a
log of recent meta-data modifications, and this comes in handy when it comes
time to check the consistency of a filesystem that wasn't cleanly unmounted.
Think of journalling filesystems this way -- in addition to storing data (your
stuff) and meta-data (the data about the stuff), they also have a journal,
which you could call meta-meta-data (the data about the data about the
stuff).
Journalling in action
So, what does fsck do with a journalling filesystem? Actually, normally, it
does nothing. It simply ignores the filesystem and allows it to be mounted. The
real magic behind quickly restoring the filesystem to a consistent state is
found in the Linux filesystem driver. When the filesystem is mounted, the Linux
filesystem driver checks to see whether the filesystem is OK. If for some reason
it isn't, then the meta-data needs to be fixed, but instead of performing an
exhaustive meta-data scan (like fsck) it instead takes a look at the journal.
Since the journal contains a chronological log of all recent meta-data changes,
it simply inspects those portions of the meta-data that have been recently
modified. Thus, it is able to bring the filesystem back to a consistent
state in a matter of seconds. And unlike the more traditional approach that
fsck takes, this journal replaying process does not take longer on larger
filesystems. Thanks to the journal, hundreds of Gigabytes of filesystem
meta-data can be brought to a consistent state almost instantaneously.
 |
ReiserFS
Now, we come to ReiserFS, the first of several journalling filesystems we're
going to be investigating. ReiserFS 3.6.x (the version included as part of Linux
2.4) is designed and developed by Hans Reiser and his team of developers at Namesys. Hans and his team share the
philosophy that the best filesystems are those that help create a single shared
environment, or namespace, where applications can interact more directly,
efficiently and powerfully. To do this, a filesystem should meet the performance
and feature needs of its users. That way, users can continue using the
filesystem directly rather than building special-purpose layers that run on
top of the filesystem, such as databases and the like.
Small file performance
So, how does one go about making the filesystem more accommodating? Namesys
has decided to focus on one aspect of the filesystem, at least initially --
small file performance. In general, filesystems like ext2 and ufs don't do very
well in this area, often forcing developers to turn to databases or special
organizational hacks to get the kind of performance they need. Over time, this
kind of "I'll code around the problem" approach encourages code bloat and lots
of incompatible special-purpose APIs, which isn't a good thing. Here's an example of how ext2 can tend to encourage this kind of programming.
ext2 is good at storing lots of twenty-plus k files, but isn't an ideal technology for
storing 2,000 50-byte files. Not only does performance drop significantly when
ext2 has to deal with extremely small files, but storage efficiency drops as
well, since ext2 allocates space in either one or four k chunks (configurable when the
filesystem is created). Now, conventional wisdom would say that you aren't supposed to store
that many ridiculously small files on a filesystem. Instead, they should be
stored in some kind of database that runs above the filesystem. In reply, Hans
Reiser would point out that whenever you need to build a layer on top of the
filesystem, it means that the filesystem isn't meeting your needs. If the
filesystem met your needs, then you could avoid using a special-purpose solution
in the first place. You would thus save development time and eliminate the code bloat
that you would have created by hand-rolling your own proprietary storage or
caching mechanism, interfacing with a database library, etc. Well, that's the theory. But how good is ReiserFS' small file performance in
practice? Amazingly good. In fact, ReiserFS is around eight to fifteen times
faster than ext2 when handling files smaller than one k in size! Even better, these
performance improvements don't come at the expense of performance for other file
types. In general, ReiserFS outperforms ext2 in nearly every area, but really
shines when it comes to handling small files.
ReiserFS technology
So how does ReiserFS go about offering such excellent small file performance?
ReiserFS uses a specially optimized b* balanced tree (one per
filesystem) to organize all filesystem data. This in itself offers a nice
performance boost, as well as easing artificial restrictions on filesystem
layouts. It's now possible to have a directory that contains 100,000 other
directories, for example. Another benefit of using a b*tree is that ReiserFS,
like most other next-generation filesystems, dynamically allocates inodes as
needed rather than creating a fixed set of inodes at filesystem creation time.
This helps the filesystem to be more flexible to the various storage
requirements that may be thrown at it, while at the same time allowing for some
additional space-efficiency. ReiserFS also has a host of features aimed specifically at improving small
file performance. Unlike ext2, ReiserFS doesn't allocate storage space in fixed
one k or four k blocks. Instead, it can allocate the exact size it needs. And ReiserFS
also includes some special optimizations centered around tails, a name for
files and end portions of files that are smaller than a filesystem block. In
order to increase performance, ReiserFS is able to store files inside the b*tree
leaf nodes themselves, rather than storing the data somewhere else on the disk
and pointing to it. This does two things. First, it dramatically increases small file
performance. Since the file data and the stat_data (inode) information are
stored right next to each other, they can normally be read with a single disk IO
operation. Second, ReiserFS is able to pack the tails together, saving a lot
of space. In fact, a ReiserFS filesystem with tail packing enabled (the default)
can store six percent more data than the equivalent ext2 filesystem, which is
amazing in itself. However, tail packing does cause a slight performance hit since it forces
ReiserFS to repack data as files are modified. For this reason, ReiserFS tail
packing can be turned off, allowing the administrator to choose between good
speed and space efficiency, or opt for even more speed at the cost of some
storage capacity. ReiserFS truly is an excellent filesystem. In my next article, I'll guide you
through the process of setting up ReiserFS under Linux 2.4. We'll also take a
close look at performance tuning, application interactions (and how to work
around them), the best kernels to use, and more.
Resources
About the author  | |  | Residing
in Albuquerque, New Mexico, Daniel Robbins is the
President/CEO of Gentoo Technologies, Inc., and the creator of Gentoo Linux, an advanced Linux for the
PC, and the Portage system, a next-generation ports system for
Linux. He has also served as a contributing author for the Macmillan books
Caldera OpenLinux Unleashed, SuSE Linux Unleashed, and
Samba Unleashed. Daniel has been involved with computers in some
fashion since the second grade, when he was first exposed to the Logo
programming language as well as a potentially dangerous dose of Pac Man.
This probably explains why he has since served as a Lead Graphic Artist at
SONY Electronic Publishing/Psygnosis. Daniel enjoys spending time
with his wife, Mary, and his new baby daughter,
Hadassah. You can contact Daniel at drobbins@gentoo.org. |
Rate this page
|  |