Linux has an interesting relationship with file systems. Because Linux is open, it tends to be a key development platform both for next-generation file systems and for new, innovative file system ideas. Two interesting recent examples include the massively scalable Ceph and the continuous snapshotting file system nilfs2 (and of course, evolutions in workhorse file systems such as the fourth extended file system [ext4]). It's also an archaeological site for file systems of the past—DOS VFAT, Macintosh(HPFS), VMS ODS-2, and Plan-9's remote file system protocol. But with all of the file systems you'll find supported within Linux, there's one that generates considerable interest because of the features it implements: Oracle's Zettabyte File System (ZFS).
The ZFS was designed and developed by Sun Microsystems (under Jeff Bonwick) and was first announced in 2004, with integration into Sun Solaris occurring in 2005). Although pairing the most popular open operating system with the most talked-about, feature-rich file system would be an ideal match, licensing issues have restricted the integration. Linux is protected by the GNU General Public License (GPL), while ZFS is covered by Sun's Common Development and Distribution License (CDDL). These license agreements have different goals and introduce restrictions that conflict. Fortunately, that doesn't mean that you as a Linux user can't enjoy ZFS and the capabilities it provides.
This article explores two methods for using ZFS in Linux. The first uses the Filesystem in Userspace (FUSE) system to push the ZFS file system into user space to avoid the licensing issues. The second method is a native port of ZFS for integration into the Linux kernel while avoiding the intellectual property issues.
Calling ZFS a file system is a bit of a misnomer, as it is much more than that in the traditional sense. ZFS combines the concepts of a logical volume manager with a very feature rich and massively scalable file system. Let's begin by exploring some of the principles on which ZFS is based. First, ZFS uses a pooled storage model instead of the traditional volume-based model. This means that ZFS views storage as a shared pool that can be dynamically allocated (and shrunk) as needed. This is advantageous over the traditional model, where file systems reside on volumes and an independent volume manager is used to administer these assets. Embedded within ZFS is an implementation of an important set of features such as snapshots, copy-on-write clones, continuous integrity checking, and data protection through RAID-Z. Going further, it's possible to use your own favorite file system (such as ext4) on top of a ZFS volume. This means that you get those features of ZFS such as snapshots on an independent file system (that likely doesn't support them directly).
But ZFS isn't just a collection of features that make up a useful file system. Rather, it's a collection of integrated and complementary features that make it an outstanding file system. Let's look at some of these features, and then see some of them in action.
As discussed earlier, ZFS incorporates a volume-management function to abstract underlying physical storage devices to the file system. Rather than viewing physical block devices directly, ZFS operates on storage pools (called zpools), which are constructed from virtual drives that can physically be represented by drives or portions of drives. Further, these pools can be constructed dynamically, even while the pool is actively in use.
ZFS uses a copy-on-write model for managing data on the storage. This means that data is never written in place (never overwritten), but instead new blocks are written and the metadata updated to reference it. Copy-on-write is advantageous for a number of reasons (not only for some of the capabilities like the snapshots and clones that it enables). By never overwriting data, it's simpler to ensure that the storage is never left in an inconsistent state (as the older data remains after the new Write operation is complete). This allows ZFS to be transaction based, and it's much simpler to implement features like atomic operations.
An interesting side effect of the copy-on-write design is that all writes to the file system become sequential writes (because remapping is always occurring). This behavior avoids hot spots in the storage and exploits the performance of sequential writes (faster than random writes).
Storage pools made up of virtual devices can be protected using one of ZFS's numerous protection schemes. You can mirror a pool across two or more devices (RAID 1) protect it with parity (similar to RAID 5) but across dynamic stripe widths (more on this later). ZFS supports a variety of parity schemes based on the number of devices in the pool. For example, you can protect three devices with RAID-Z (RAID-Z 1); with four devices, you can use RAID-Z 2 (double parity, similar to RAID6). For even greater protection, you can use RAID-Z 3 with larger numbers of disks for triple parity.
For speed (but no data protection other than error detection), you can employ striping across devices (RAID 0). You can also create striped mirrors (to mirror striped drives), similar to RAID 10.
An interesting attribute of ZFS comes with the combination of RAID-Z, copy-on-write transactions, and dynamic stripe widths. In a traditional RAID 5 architecture, all disks must have their data within the stripe, or the stripe is inconsistent. Because there's no way to update all disks atomically, it's possible to produce the well-known RAID 5 write hole problem (where a stripe is inconsistent across the drives of the RAID set). Given ZFS transactions and never having to write in place, the write hole problem is eliminated. Another convenient quality of this approach is what happens when a disk fails and a rebuild is required. A traditional RAID 5 system uses data from other disks in the set to rebuild data for the new drive. RAID-Z traverses the available metadata to read only the data that's relevant for the geometry and avoids reading the unused space on the disk. This behavior becomes even more important as disks become larger and rebuild times increase.
Although data protection provides the ability to regenerate data on a failure, it says nothing about the validity of the data in the first place. ZFS solves this issue by generating a 32-bit checksum (or 256-bit hash) for metadata for each block written. When a block is read, its checksum is verified to avoid the problem of silent data corruption. In a volume that has data protection (mirroring or RAID-Z), the alternate data can be read or regenerated automatically.
Checksums are stored with metadata in ZFS, so phantom writes can be detected and—if data protection is provided (RAID-Z)—corrected.
Given the copy-on-write nature of ZFS, features like snapshots and clones become simple to provide. Because ZFS never overwrites data but instead writes to a new location, older data can be preserved (but in the nominal case is marked for removal to converse disk space). A snapshot is a preservation of older blocks to maintain the state of a file system at a given instance in time. This approach is also space efficient, because no copy is required (unless all data in the file system is rewritten). A clone is a form of snapshot in which a snapshot is taken that is writable. In this case, original unwritten blocks are shared by each clone, and blocks that are written are available only to the specific file system clone.
Traditional file systems are made up of statically sized blocks that match the back-end storage (512 bytes). ZFS implements variable block sizes for a variety of uses (commonly up to 128KB in size, but you can change this value). One important use of variable block sizes is compression (because the resulting block size when compressed will ideally be less than the original). This functionality minimizes waste in the storage system in addition to providing better utilization of the storage network (because less data emitted to storage requires less time in transfer).
Outside of compression, supporting variable block sizes also means that you can tune the block size for the particular workload expected for improved performance.
ZFS incorporates a many other features, such as de-duplication (to minimize copies of data), configurable replication, encryption, an adaptive replacement cache for cache management, and online disk scrubbing (to identify and fix latent errors while they can be fixed when protection isn't used). It does this with immense scalability, supporting 16 exabytes of addressable storage (264 bytes).
Now that you've seen some of the abstract concepts behind ZFS, let's look at some of them in practice. This demonstration uses ZFS-FUSE. FUSE is a mechanism that allows you to implement file systems in user space without kernel code (other than the FUSE kernel module and existing file system code). The module provides a bridge from the kernel file system interface to user space for user and file system implementations. First, install the ZFS-FUSE package (the following demonstration targets Ubuntu).
Installing ZFS-FUSE is simple, particularly on Ubuntu using
apt. The following command line installs
everything you need to begin using ZFS-FUSE:
$ sudo apt-get install zfs-fuse |
This command line install ZFS-FUSE and all other dependent packages (mine
also required libaiol) as well as performing
the necessary setup for the new packages and starting the
zfs-fuse daemon.
In this demonstration, you use the loop-back device to emulate disks as
files within the host operating system. To begin, create these files
(using /dev/zero as the source) with the dd
utility (see Listing 1). With your four disk images
created, use losetup to associate the disk
images with the loop devices.
Listing 1. Setup for working with ZFS-FUSE
$ mkdir zfstest $ cd zfstest $ dd if=/dev/zero of=disk1.img bs=64M count=1 1+0 records in 1+0 records out 67108864 bytes (67 MB) copied, 1.235 s, 54.3 MB/s $ dd if=/dev/zero of=disk2.img bs=64M count=1 1+0 records in 1+0 records out 67108864 bytes (67 MB) copied, 0.531909 s, 126 MB/s $ dd if=/dev/zero of=disk3.img bs=64M count=1 1+0 records in 1+0 records out 67108864 bytes (67 MB) copied, 0.680588 s, 98.6 MB/s $ dd if=/dev/zero of=disk4.img bs=64M count=1 1+0 records in 1+0 records out 67108864 bytes (67 MB) copied, 0.429055 s, 156 MB/s $ ls disk1.img disk2.img disk3.img disk4.img $ sudo losetup /dev/loop0 ./disk1.img $ sudo losetup /dev/loop1 ./disk2.img $ sudo losetup /dev/loop2 ./disk3.img $ sudo losetup /dev/loop3 ./disk4.img $ |
With four devices available to use as your block devices for ZFS (totaling
256MB in size), create your pool using the
zpool command. You use the
zpool command to manage ZFS storage pools, but
as you'll see, you can use it for a variety of other purposes, as well.
The following command requests a ZFS storage pool to be created with four
devices and provides data protection with RAID-Z. You follow this command
with a list request to provide data on your pool (see Listing 2).
Listing 2. Creating a ZFS pool
$ sudo zpool create myzpool raidz /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3 $ sudo zfs list NAME USED AVAIL REFER MOUNTPOINT myzpool 96.5K 146M 31.4K /myzpool $ |
You can also investigate some of the attributes of your pool, as shown in Listing 3, which represent the defaults. Among other things, you can see the available capacity and portion used. (This code has been compressed for brevity.)
Listing 3. Reviewing the attributes of the storage pool
$ sudo zfs get all myzpool NAME PROPERTY VALUE SOURCE myzpool type filesystem - myzpool creation Sat Nov 13 22:43 2010 - myzpool used 96.5K - myzpool available 146M - myzpool referenced 31.4K - myzpool compressratio 1.00x - myzpool mounted yes - myzpool quota none default myzpool reservation none default myzpool recordsize 128K default myzpool mountpoint /myzpool default myzpool sharenfs off default myzpool checksum on default myzpool compression off default myzpool atime on default myzpool copies 1 default myzpool version 4 - ... myzpool primarycache all default myzpool secondarycache all default myzpool usedbysnapshots 0 - myzpool usedbydataset 31.4K - myzpool usedbychildren 65.1K - myzpool usedbyrefreservation 0 - $ |
Now, let's actually use the ZFS pool. First, create a directory within your
pool, and then enable compression within it (using the
zfs set command). Next, copy a file into it.
I've selected a file that's around 120KB in size to see the effect of ZFS
compression. Note that your pool is mounted at the root, so treat is just
like a directory within your root file system. Once the file is copied,
you can list it to see that the file is present (but is the same size as
the original). Using the dh command, you can
see that the size of the file is half the original, indicating that ZFS
has compressed it. You can also look at the
compressratio property to see how much your
pool has been compressed (using the default compressor, gzip).
Listing 4 shows the compression.
Listing 4. Demonstrating compression with ZFS
$ sudo zfs create myzpool/myzdev $ sudo zfs list NAME USED AVAIL REFER MOUNTPOINT myzpool 139K 146M 31.4K /myzpool myzpool/myzdev 31.4K 146M 31.4K /myzpool/myzdev $ sudo zfs set compression=on myzpool/myzdev $ ls /myzpool/myzdev/ $ sudo cp ../linux-2.6.34/Documentation/devices.txt /myzpool/myzdev/ $ ls -la ../linux-2.6.34/Documentation/devices.txt -rw-r--r-- 1 mtj mtj 118144 2010-05-16 14:17 ../linux-2.6.34/Documentation/devices.txt $ ls -la /myzpool/myzdev/ total 5 drwxr-xr-x 2 root root 3 2010-11-20 22:59 . drwxr-xr-x 3 root root 3 2010-11-20 22:55 .. -rw-r--r-- 1 root root 118144 2010-11-20 22:59 devices.txt $ du -ah /myzpool/myzdev/ 60K /myzpool/myzdev/devices.txt 62K /myzpool/myzdev/ $ sudo zfs get compressratio myzpool NAME PROPERTY VALUE SOURCE myzpool compressratio 1.55x - $ |
Finally, let's look at the self-repair capabilities of ZFS. Recall that
when you created your pool, you requested RAID-Z over the four devices.
You can check the status of your pool using the
zpool status command, as shown in Listing 5.
As shown, you can see the elements of your
pool (RAID-Z 1 with four devices).
Listing 5. Checking your pool status
$ sudo zpool status myzpool pool: myzpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM myzpool ONLINE 0 0 0 raidz1 ONLINE 0 0 0 loop0 ONLINE 0 0 0 loop1 ONLINE 0 0 0 loop2 ONLINE 0 0 0 loop3 ONLINE 0 0 0 errors: No known data errors $ |
Now, let's force an error into the pool. For this demonstration, go behind
the scenes and corrupt the disk file that makes up the device (your
disk4.img, represented in ZFS by the loop3
device). Use the dd command to simply zero out
the entire device (see Listing 6).
Listing 6. Corrupting the ZFS pool
$ dd if=/dev/zero of=disk4.img bs=64M count=1 1+0 records in 1+0 records out 67108864 bytes (67 MB) copied, 1.84791 s, 36.3 MB/s $ |
ZFS is currently unaware of the corruption, but you can force it to see the
problem by requesting a scrub of the pool. As shown in Listing 7,
ZFS now recognizes the corruption (of the
loop3 device) and suggests an action to replace
the device. Note also that the pool remains online, and you can still get
to your data, as ZFS self-corrects through RAID-Z.
Listing 7. Scrubbing and checking the pool
$ sudo zpool scrub myzpool $ sudo zpool status myzpool pool: myzpool state: ONLINE status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-4J scrub: scrub completed after 0h0m with 0 errors on Sat Nov 20 23:15:03 2010 config: NAME STATE READ WRITE CKSUM myzpool ONLINE 0 0 0 raidz1 ONLINE 0 0 0 loop0 ONLINE 0 0 0 loop1 ONLINE 0 0 0 loop2 ONLINE 0 0 0 loop3 UNAVAIL 0 0 0 corrupted data errors: No known data errors $ wc -l /myzpool/myzdev/devices.txt 3340 /myzpool/myzdev/devices.txt $ |
As recommended, introduce a new device to your RAID-Z set to act as the new
container. Begin by creating a new disk image and representing it as a
device with losetup. Note that this process is
similar to adding a new physical disk to the set. You then use
zpool replace to exchange the corrupted device
(loop3) with the new device
(loop4). Checking the status of the pool, you
can see your new device with a message indicating that data was rebuilt on
it (called resilvering), along with the amount of data moved
there. Note also that the pool remains online with no errors (visible to
the user). To conclude, you scrub the pool again; after checking its
status, you'll see that no issues exist, as shown in Listing 8.
Listing 8. Repairing the pool using zpool replace
$ dd if=/dev/zero of=disk5.img bs=64M count=1 1+0 records in 1+0 records out 67108864 bytes (67 MB) copied, 0.925143 s, 72.5 MB/s $ sudo losetup /dev/loop4 ./disk5.img $ sudo zpool replace myzpool loop3 loop4 $ sudo zpool status myzpool pool: myzpool state: ONLINE scrub: resilver completed after 0h0m with 0 errors on Sat Nov 20 23:23:12 2010 config: NAME STATE READ WRITE CKSUM myzpool ONLINE 0 0 0 raidz1 ONLINE 0 0 0 loop0 ONLINE 0 0 0 loop1 ONLINE 0 0 0 loop2 ONLINE 0 0 0 loop4 ONLINE 0 0 0 59.5K resilvered errors: No known data errors $ sudo zpool scrub myzpool $ sudo zpool status myzpool pool: myzpool state: ONLINE scrub: scrub completed after 0h0m with 0 errors on Sat Nov 20 23:23:23 2010 config: NAME STATE READ WRITE CKSUM myzpool ONLINE 0 0 0 raidz1 ONLINE 0 0 0 loop0 ONLINE 0 0 0 loop1 ONLINE 0 0 0 loop2 ONLINE 0 0 0 loop4 ONLINE 0 0 0 errors: No known data errors $ |
This short demonstration explores the consolidation of volume management with a file system and shows how easy it is to administer ZFS (even in the face of failures).
The advantage of ZFS on FUSE is that it's simple to begin using ZFS, but it has the downside of not being efficient as it could be. This lack of efficiency is the result of the multiple user-kernel transitions required per I/O. But given the popularity of ZFS, there is another option that provides greater performance.
A native port of ZFS to the Linux kernel is well under way at the Lawrence Livermore National Lab. This port still lacks some elements, such as the ZFS Portable Operating System Interface (for UNIX®) Layer, but this is under development. Their port provides a number of useful features, particularly if you're interested in using ZFS with Lustre. (See Resources for details.)
Hopefully, this article has whetted your appetite to dig farther into ZFS. From the earlier demonstration, you can easily get ZFS up and running on most Linux distributions—even in the kernel, with some limitations. Topics such as snapshots and clones were not demonstrated here, but the Resources section provides links a interesting articles on this topic. In the end, Linux and ZFS are state-of-the-art technologies, and it will be difficult to keep them apart.
Learn
- This exceptional presentation from Jeff
Bonwick and Bill More provides a detailed overview of ZFS and why it's the last work in file systems.
- You can learn more about ZFS in the
various Oracle Web sites for Solaris and ZFS. The OpenSolaris ZFS community site provides useful information on ZFS
and where to learn more. Wikipedia also provides a nice, compact introduction to ZFS. You
can read about RAID-Z
from Jeff Bonwick and the specific problems it solves over
traditional RAID 5.
- FUSE provides a user space framework for
the development and execution of file systems. FUSE is used with ZFS, as
demonstrated in this article with ZFS-FUSE, but it's also widely used as a
means of experimenting with file system development. You can learn more
about FUSE and file system development in Develop
your own filesystem with FUSE (Sumit Singh, developerWorks,
February 2006).
- One of the simplest means of integrating
ZFS into Linux is a straight port of the Solaris implementation, but
licensing contention precludes this. You can learn more about the licenses
at Wikipedia for the GPL
and CDDL.
- The FreeBSD Handbook provides a nice introduction to ZFS as it
applies to BSD.
- Outside of running ZFS on FUSE, there is
one native implementation of ZFS within the Linux kernel. The ZFS on Linux project is growing and
already provides an impressive set of features.
- Although ZFS provides checksums on each
block written to storage, there is also a standardized SCSI end-to-end
integrity scheme called DIF. You can learn more about DIF in this
presentation from Oracle on data integrity or in Linux Kernel Advances (M. Tim Jones, developerWorks, March 2009).
- For anyone who has read any of Tim's
other articles on developerWorks, you already know he's a fan of file
systems. Check out these other articles for all aspects of Linux file
systems:
- Anatomy of the Linux file system (October 2007)
- Next-generation Linux file systems: NiLFS(2) and Exofs (October 2009)
- Ceph: A Linux petabyte-scale distributed file system (May 2010)
- Anatomy of ext4 (February 2009)
- Anatomy of Linux journaling file systems (June 2008)
- Anatomy of the Linux virtual file system switch (August 2009)
-
In the developerWorks Linux zone,
find hundreds of how-to
articles
and tutorials, as well as downloads, discussion forums,
and a wealth of other resources for Linux developers and administrators.
-
Stay current with
developerWorks technical events and webcasts focused on a variety of IBM products and IT industry topics.
-
Attend a free developerWorks Live!
briefing to get up-to-speed quickly on IBM products and tools, as well as IT industry trends.
-
Watch developerWorks on-demand demos
ranging from product installation and setup demos for beginners, to advanced functionality for experienced developers.
-
Follow developerWorks on Twitter, or subscribe
to a
feed of Linux tweets on developerWorks.
Get products and technologies
-
Evaluate IBM products
in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the
SOA Sandbox
learning how to implement Service Oriented Architecture efficiently.
Discuss
-
Get involved in the My developerWorks community.
Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

M. Tim Jones is an embedded firmware architect and the author of Artificial Intelligence: A Systems Approach, GNU/Linux Application Programming (now in its second edition), AI Application Programming (in its second edition), and BSD Sockets Programming from a Multilanguage Perspective. His engineering background ranges from the development of kernels for geosynchronous spacecraft to embedded systems architecture and networking protocols development. Tim is a Senior Architect for Emulex Corp. in Longmont, Colorado.



