Advanced filesystem implementor's guide, Part 9
In this article, we'll take a look at XFS, SGI's free, 64-bit high-performance filesystem for Linux. First, I'll explain how XFS compares to ext3 and ReiserFS, and describe many of the technologies that XFS uses internally, Then in the next article, I'll guide you through the process of setting up XFS on your own system, as well as cover XFS tuning tips and useful XFS features like ACL (access control lists) and extended attribute support.
XFS was originally developed by Silicon Graphics, Inc. back in the early 90s. At that time, SGI found that their existing filesystem (EFS) was quickly becoming unsuitable for tackling the extreme computing challenges of the day. Addressing this problem, SGI decided to design a completely new high-performance 64-bit filesystem rather than attempting to tweak EFS to do something that it was never designed to do. Thus, XFS was born, and was made available to the computing public with the release of IRIX 5.3 in 1994. To this day, it continues to be used as the underlying filesystem for all of SGI's IRIX-based products, from workstations to supercomputers. And now, XFS is also available for Linux. The arrival of XFS for Linux is exciting, primarily because it provides the Linux community with a robust, refined, and very feature-rich filesystem that's capable of scaling to meet the toughest storage challenges.
XFS, ReiserFS, and ext3 performance
Up until now, choosing the appropriate next-generation Linux filesystem has been refreshingly straightforward. Those who were looking for raw performance generally leaned towards ReiserFS, while those more interested in meticulous data integrity features preferred ext3. However, with the release of XFS for Linux, things have suddenly become much more confusing. In particular, it's no longer clear that ReiserFS is still the next-gen performance leader.
Recently, I performed a series of tests in an attempt to figure out how XFS, ReiserFS, and ext3 compare in terms of raw performance. Before I share my results, it's important to understand that my results only highlight general filesystem performance trends under light system loads on a uniprocessor system, and are not an absolute measure of whether a particular filesystem is "better" than another. Despite this, my results should help give you an idea of what filesystem may be best suited for a particular task. Again, my results should not be considered conclusive; the best test is always to try your particular application under each filesystem to see how it performs.
In my tests, I found XFS to be generally quite speedy. XFS consistently won all tests that involved manipulating large files, which should be expected since it has been designed and tuned over the years to do this very well. I also discovered that XFS has a singular performance quirk: it doesn't delete files very quickly; it was easily bested by both ReiserFS and ext3 in this area. According to Steve Lord, the Principal Engineer of filesystem software for SGI, a patch has just been written to address this problem, and it should be available soon.
Other than that, XFS performance was very close to that of ReiserFS and generally surpasses that of ext3. One of the nicest things about XFS is that, like ReiserFS, it doesn't generate a lot of unnecessary disk activity. XFS tries to cache as much data in memory as possible, and generally only writes things out to disk when memory pressure dictates that it do so. When it's flushing data to disk, other IO operations seem largely unaffected. In contrast, when ext3 (in "data=ordered" mode, the default) flushes data to the drive, it can result in a lot of additional seeks and, depending on the IO load, even some unnecessary disk thrashing.
My performance and tuning tests were primarily focused around extracting an
uncompressed kernel source tarball from a RAM disk to the test filesystem, and then
recursively copying the new source tree to a new directory on the same filesystem.
XFS performed these tasks quite well, although initially, XFS performance was
slightly worse than that of ReiserFS. However, after tweaking the
mount options for
my test XFS filesystem, I was able to get XFS to perform slightly better than
ReiserFS when handling medium-sized files such as those found in the kernel source
tree. That is, except for deletes; both ReiserFS and ext3 delete files much more
quickly than XFS, at least for now.
I hope I've given you a general idea of what kind of performance you can expect from XFS; my results show that XFS is the best filesystem to use if you need to manipulate large files. For small to medium-sized files, XFS can be competitive and sometimes even faster than ReiserFS if you create and mount your XFS filesystem with some performance-enhancing options. Ext3 in "data=journal" mode offered good performance, but it was difficult to get consistent performance numbers due to apparent irregularities in how ext3 flushed data from previous tests to disk, which would result in some disk thrashing.
In the "Scalability in the XFS Filesystem" paper (see Related topics later in this article) featured at USENIX '96, the SGI engineers explain that XFS was designed with a single main idea: "think big". Indeed, XFS has been designed to eliminate the limitations found in traditional filesystems. Now, let's take a look at some of the intriguing design features behind XFS that make this possible.
Introducing allocation groups
When an XFS filesystem is created, the underlying block device is split into eight or more equally-sized linear regions. You can think of them as "chunks" or "linear ranges", but in XFS terminology each region is called an "allocation group". Allocation groups are unique in that each allocation group manages its own inodes and free space, in effect turning them into a kind of sub-filesystem that exists transparently within the XFS filesystem proper.
Allocation groups and scalability
So, why exactly does XFS have allocation groups? Primarily, XFS uses allocation groups so that it can efficiently handle parallel IO. Because each allocation group is effectively its own independent entity, the kernel can interact with multiple allocation groups simultaneously. Without allocation groups, the XFS filesystem code could become a performance bottleneck, forcing IO-hungry processes to "get in line" to make inode modifications or performing other kinds of metadata-intensive operations. Thanks to allocation groups, the XFS code will allow multiple threads and processes to continue to run in parallel, even if many of them are performing non-trivial IO on the same filesystem. So, match XFS with some high-end hardware and you'll get high-end results rather than a filesystem bottleneck. Allocation groups also help to optimize parallel IO performance on multiprocessor systems, because more than one metadata update can be "in transit" at the same time.
B+ trees everywhere
Internally, allocation groups use efficient B+ trees to keep track of important data such as ranges (also called "extents") of free space, as well as inodes. In fact, each allocation group has two B+ trees used to keep track of free space; one stores the extents of free space ordered by size, and the other tree has the regions ordered by their starting physical location on the block device. The ability to find regions of free space quickly is critical for maximizing write performance, which is something that XFS is very good at.
XFS is also very efficient when it comes to the management of inodes. Each allocation group allocates inodes as needed, in groups of 64. An allocation group keeps track of its own inodes by using a B+ tree that records where each particular inode number can be found on disk. You'll find that XFS uses B+ trees as much as possible, due to their excellent performance and tremendous scalability.
Of course, XFS is a journaling filesystem, allowing for fast recovery after an unexpected reboot. Like ReiserFS, XFS uses a logical journal; that is, it does not journal literal filesystem blocks like ext3, and instead uses an efficient on-disk format to log metadata changes. In the case of XFS, logical journaling is a good fit; on high-end hardware, the journal is often the most contentious resource of the entire filesystem. By using a space-efficient logical journal, contention for the journal can be minimized. In addition, XFS allows the journal to be stored on another block device, such as a partition on another disk. This feature works well to improve XFS filesystem performance even further.
Like ReiserFS, XFS only journals metadata, and does not take any special precautions to ensure that the data makes it to disk before metadata is written. This means that with XFS (just like with ReiserFS), it's possible for recently modified data to be lost in the event of an unexpected reboot. However, a couple of properties of XFS' journal make this issue less common than it is with ReiserFS.
With ReiserFS, an unexpected reboot can result in recently modified files containing portions of previously deleted files. Besides the obvious data loss, this could also theoretically pose a security threat. In contrast, XFS ensures that any unwritten data blocks are zeroed on reboot, when XFS journal is replayed. Thus, missing blocks are filled with null bytes, eliminating the security hole -- a much better approach.
Now, what about the data loss issue itself? In general, this problem is minimized with XFS due to the fact that XFS generally writes pending metadata updates to disk much more frequently than ReiserFS does, especially during periods high disk activity. Thus, in the event of a lockup, you will generally lose fewer of your recent metadata modifications than you would with ReiserFS. Of course, this does not directly address the problem of not writing data blocks in time, but writing metadata more frequently does encourage data to be written more frequently as well.
We'll finish our technical overview of XFS by taking a look at delayed allocation, a feature unique to XFS. As you probably know, the term allocation refers to the process of finding regions of free space to use for storing new data.
XFS handles allocation by breaking it into a two-step process. First, when XFS receives new data to be written, it records the pending transaction in RAM and simply reserves an appropriate amount of space on the underlying filesystem. However, while XFS reserves space for the new data, it doesn't decide what filesystem blocks will be used to store the data, at least not yet. XFS procrastinates, delaying this decision to the last possible moment, right before this data is actually written to disk.
By delaying allocation, XFS gains many opportunities to optimize write performance. When it comes time to write the data to disk, XFS can now allocate free space intelligently, in a way that optimizes filesystem performance. In particular, if a bunch of new data is being appended to a single file, XFS can allocate a single, contiguous region on disk to store this data. If XFS hadn't delayed its allocation decision, it may have unknowingly written the data into multiple non-contiguous chunks, reducing write performance significantly. But, because XFS delayed its allocation decision, it was able to write the data in one fell swoop, improving write performance as well as reducing overall filesystem fragmentation.
Delayed allocation also has another performance benefit. In situations where many short-lived temporary files are created, XFS may never need to write these files to disk at all. Since no blocks are ever allocated, there's no need to deallocate any blocks, and the underlying filesystem metadata doesn't even get touched.
I hope you've enjoyed reading about the performance and technical characteristics of XFS, one of Linux's powerful next-generation filesystems. Join me in my next article when I show you how to get XFS up and running on your system. In my next article, we'll also take a look at some of XFS' advanced features, such as ACLs and extended attributes. I'll see you then!
- Read Daniel's previous articles in this series, where he described:
- You can learn more about XFS at SGI's XFS page. Read the FAQ. Join the mailing list. And if you're impatient, go ahead and download and install XFS and start playing.
- Browse more Linux resources on developerWorks.
- Browse more Open source resources on developerWorks.