Although ZFS exists in an operating system whose future is at risk, it is easily one of the most advanced, feature-rich file systems in existence. It incorporates variable block sizes, compression, encryption, de-duplication, snapshots, clones, and (as the name implies) support for massive capacities. Get to know the concepts behind ZFS and learn how you can use ZFS today on Linux using Filesystem in Userspace (FUSE).

M. Tim Jones, Independent author

author photo - M. Tim JonesM. Tim Jones is an embedded firmware architect and the author of Artificial Intelligence: A Systems Approach, GNU/Linux Application Programming (now in its second edition), AI Application Programming (in its second edition), and BSD Sockets Programming from a Multilanguage Perspective. His engineering background ranges from the development of kernels for geosynchronous spacecraft to embedded systems architecture and networking protocols development. Tim is a Senior Architect for Emulex Corp. in Longmont, Colorado.



19 January 2011

Also available in

Connect with Tim

Tim is one of our most popular and prolific authors. Browse all of Tim's articles on developerWorks. Check out Tim's profile and connect with him, other authors, and fellow readers in My developerWorks.

Linux has an interesting relationship with file systems. Because Linux is open, it tends to be a key development platform both for next-generation file systems and for new, innovative file system ideas. Two interesting recent examples include the massively scalable Ceph and the continuous snapshotting file system nilfs2 (and of course, evolutions in workhorse file systems such as the fourth extended file system [ext4]). It's also an archaeological site for file systems of the past—DOS VFAT, Macintosh(HPFS), VMS ODS-2, and Plan-9's remote file system protocol. But with all of the file systems you'll find supported within Linux, there's one that generates considerable interest because of the features it implements: Oracle's Zettabyte File System (ZFS).

The ZFS was designed and developed by Sun Microsystems (under Jeff Bonwick) and was first announced in 2004, with integration into Sun Solaris occurring in 2005). Although pairing the most popular open operating system with the most talked-about, feature-rich file system would be an ideal match, licensing issues have restricted the integration. Linux is protected by the GNU General Public License (GPL), while ZFS is covered by Sun's Common Development and Distribution License (CDDL). These license agreements have different goals and introduce restrictions that conflict. Fortunately, that doesn't mean that you as a Linux user can't enjoy ZFS and the capabilities it provides.

This article explores two methods for using ZFS in Linux. The first uses the Filesystem in Userspace (FUSE) system to push the ZFS file system into user space to avoid the licensing issues. The second method is a native port of ZFS for integration into the Linux kernel while avoiding the intellectual property issues.

Where can you find ZFS?

Today, you can find ZFS natively within OpenSolaris (also covered under the CDDL) but also in other operating systems that have complementary licenses. For example, you can find ZFS in FreeBSD (since 2007). ZFS was once part of Darwin (a derivative of Berkeley Software Distribution [BSD], NeXTSTEP, and CMU's Mach 3 microkernel) but has since been removed.

Introducing ZFS

Calling ZFS a file system is a bit of a misnomer, as it is much more than that in the traditional sense. ZFS combines the concepts of a logical volume manager with a very feature rich and massively scalable file system. Let's begin by exploring some of the principles on which ZFS is based. First, ZFS uses a pooled storage model instead of the traditional volume-based model. This means that ZFS views storage as a shared pool that can be dynamically allocated (and shrunk) as needed. This is advantageous over the traditional model, where file systems reside on volumes and an independent volume manager is used to administer these assets. Embedded within ZFS is an implementation of an important set of features such as snapshots, copy-on-write clones, continuous integrity checking, and data protection through RAID-Z. Going further, it's possible to use your own favorite file system (such as ext4) on top of a ZFS volume. This means that you get those features of ZFS such as snapshots on an independent file system (that likely doesn't support them directly).

But ZFS isn't just a collection of features that make up a useful file system. Rather, it's a collection of integrated and complementary features that make it an outstanding file system. Let's look at some of these features, and then see some of them in action.

Storage pools

As discussed earlier, ZFS incorporates a volume-management function to abstract underlying physical storage devices to the file system. Rather than viewing physical block devices directly, ZFS operates on storage pools (called zpools), which are constructed from virtual drives that can physically be represented by drives or portions of drives. Further, these pools can be constructed dynamically, even while the pool is actively in use.

Copy-on-write

ZFS uses a copy-on-write model for managing data on the storage. This means that data is never written in place (never overwritten), but instead new blocks are written and the metadata updated to reference it. Copy-on-write is advantageous for a number of reasons (not only for some of the capabilities like the snapshots and clones that it enables). By never overwriting data, it's simpler to ensure that the storage is never left in an inconsistent state (as the older data remains after the new Write operation is complete). This allows ZFS to be transaction based, and it's much simpler to implement features like atomic operations.

An interesting side effect of the copy-on-write design is that all writes to the file system become sequential writes (because remapping is always occurring). This behavior avoids hot spots in the storage and exploits the performance of sequential writes (faster than random writes).

Data protection

Storage pools made up of virtual devices can be protected using one of ZFS's numerous protection schemes. You can mirror a pool across two or more devices (RAID 1) protect it with parity (similar to RAID 5) but across dynamic stripe widths (more on this later). ZFS supports a variety of parity schemes based on the number of devices in the pool. For example, you can protect three devices with RAID-Z (RAID-Z 1); with four devices, you can use RAID-Z 2 (double parity, similar to RAID6). For even greater protection, you can use RAID-Z 3 with larger numbers of disks for triple parity.

For speed (but no data protection other than error detection), you can employ striping across devices (RAID 0). You can also create striped mirrors (to mirror striped drives), similar to RAID 10.

An interesting attribute of ZFS comes with the combination of RAID-Z, copy-on-write transactions, and dynamic stripe widths. In a traditional RAID 5 architecture, all disks must have their data within the stripe, or the stripe is inconsistent. Because there's no way to update all disks atomically, it's possible to produce the well-known RAID 5 write hole problem (where a stripe is inconsistent across the drives of the RAID set). Given ZFS transactions and never having to write in place, the write hole problem is eliminated. Another convenient quality of this approach is what happens when a disk fails and a rebuild is required. A traditional RAID 5 system uses data from other disks in the set to rebuild data for the new drive. RAID-Z traverses the available metadata to read only the data that's relevant for the geometry and avoids reading the unused space on the disk. This behavior becomes even more important as disks become larger and rebuild times increase.

Checksums

Although data protection provides the ability to regenerate data on a failure, it says nothing about the validity of the data in the first place. ZFS solves this issue by generating a 32-bit checksum (or 256-bit hash) for metadata for each block written. When a block is read, its checksum is verified to avoid the problem of silent data corruption. In a volume that has data protection (mirroring or RAID-Z), the alternate data can be read or regenerated automatically.

Standard approaches for integrity

The T10 provides a similar mechanism for end-to-end integrity called Data Integrity Field (DIF). This mechanism proposes a field containing a cyclic redundancy check of a block and other metadata stored on disk to avoid silent data corruption. An interesting attribute of DIF is that you'll find hardware support for it in a number of storage controllers, so that the process is completely offloaded from the host processor.

Checksums are stored with metadata in ZFS, so phantom writes can be detected and—if data protection is provided (RAID-Z)—corrected.

Snapshots and clones

Given the copy-on-write nature of ZFS, features like snapshots and clones become simple to provide. Because ZFS never overwrites data but instead writes to a new location, older data can be preserved (but in the nominal case is marked for removal to converse disk space). A snapshot is a preservation of older blocks to maintain the state of a file system at a given instance in time. This approach is also space efficient, because no copy is required (unless all data in the file system is rewritten). A clone is a form of snapshot in which a snapshot is taken that is writable. In this case, original unwritten blocks are shared by each clone, and blocks that are written are available only to the specific file system clone.

Variable block sizes

Traditional file systems are made up of statically sized blocks that match the back-end storage (512 bytes). ZFS implements variable block sizes for a variety of uses (commonly up to 128KB in size, but you can change this value). One important use of variable block sizes is compression (because the resulting block size when compressed will ideally be less than the original). This functionality minimizes waste in the storage system in addition to providing better utilization of the storage network (because less data emitted to storage requires less time in transfer).

Outside of compression, supporting variable block sizes also means that you can tune the block size for the particular workload expected for improved performance.

Other features

ZFS incorporates a many other features, such as de-duplication (to minimize copies of data), configurable replication, encryption, an adaptive replacement cache for cache management, and online disk scrubbing (to identify and fix latent errors while they can be fixed when protection isn't used). It does this with immense scalability, supporting 16 exabytes of addressable storage (264 bytes).


Using ZFS on Linux today

Now that you've seen some of the abstract concepts behind ZFS, let's look at some of them in practice. This demonstration uses ZFS-FUSE. FUSE is a mechanism that allows you to implement file systems in user space without kernel code (other than the FUSE kernel module and existing file system code). The module provides a bridge from the kernel file system interface to user space for user and file system implementations. First, install the ZFS-FUSE package (the following demonstration targets Ubuntu).

Installing ZFS-FUSE

Installing ZFS-FUSE is simple, particularly on Ubuntu using apt. The following command line installs everything you need to begin using ZFS-FUSE:

$ sudo apt-get install zfs-fuse

This command line install ZFS-FUSE and all other dependent packages (mine also required libaiol) as well as performing the necessary setup for the new packages and starting the zfs-fuse daemon.

Using ZFS-FUSE

In this demonstration, you use the loop-back device to emulate disks as files within the host operating system. To begin, create these files (using /dev/zero as the source) with the dd utility (see Listing 1). With your four disk images created, use losetup to associate the disk images with the loop devices.

Listing 1. Setup for working with ZFS-FUSE
$ mkdir zfstest
$ cd zfstest
$ dd if=/dev/zero of=disk1.img bs=64M count=1
1+0 records in
1+0 records out
67108864 bytes (67 MB) copied, 1.235 s, 54.3 MB/s
$ dd if=/dev/zero of=disk2.img bs=64M count=1
1+0 records in
1+0 records out
67108864 bytes (67 MB) copied, 0.531909 s, 126 MB/s
$ dd if=/dev/zero of=disk3.img bs=64M count=1
1+0 records in
1+0 records out
67108864 bytes (67 MB) copied, 0.680588 s, 98.6 MB/s
$ dd if=/dev/zero of=disk4.img bs=64M count=1
1+0 records in
1+0 records out
67108864 bytes (67 MB) copied, 0.429055 s, 156 MB/s
$ ls
disk1.img  disk2.img  disk3.img  disk4.img
$ sudo losetup /dev/loop0 ./disk1.img 
$ sudo losetup /dev/loop1 ./disk2.img 
$ sudo losetup /dev/loop2 ./disk3.img 
$ sudo losetup /dev/loop3 ./disk4.img 
$

With four devices available to use as your block devices for ZFS (totaling 256MB in size), create your pool using the zpool command. You use the zpool command to manage ZFS storage pools, but as you'll see, you can use it for a variety of other purposes, as well. The following command requests a ZFS storage pool to be created with four devices and provides data protection with RAID-Z. You follow this command with a list request to provide data on your pool (see Listing 2).

Listing 2. Creating a ZFS pool
$ sudo zpool create myzpool raidz /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3
$ sudo zfs list
NAME      USED  AVAIL  REFER  MOUNTPOINT
myzpool  96.5K   146M  31.4K  /myzpool
$

You can also investigate some of the attributes of your pool, as shown in Listing 3, which represent the defaults. Among other things, you can see the available capacity and portion used. (This code has been compressed for brevity.)

Listing 3. Reviewing the attributes of the storage pool
$ sudo zfs get all myzpool
NAME     PROPERTY              VALUE                  SOURCE
myzpool  type                  filesystem             -
myzpool  creation              Sat Nov 13 22:43 2010  -
myzpool  used                  96.5K                  -
myzpool  available             146M                   -
myzpool  referenced            31.4K                  -
myzpool  compressratio         1.00x                  -
myzpool  mounted               yes                    -
myzpool  quota                 none                   default
myzpool  reservation           none                   default
myzpool  recordsize            128K                   default
myzpool  mountpoint            /myzpool               default
myzpool  sharenfs              off                    default
myzpool  checksum              on                     default
myzpool  compression           off                    default
myzpool  atime                 on                     default
myzpool  copies                1                      default
myzpool  version               4                      -
...
myzpool  primarycache          all                    default
myzpool  secondarycache        all                    default
myzpool  usedbysnapshots       0                      -
myzpool  usedbydataset         31.4K                  -
myzpool  usedbychildren        65.1K                  -
myzpool  usedbyrefreservation  0                      -
$

Now, let's actually use the ZFS pool. First, create a directory within your pool, and then enable compression within it (using the zfs set command). Next, copy a file into it. I've selected a file that's around 120KB in size to see the effect of ZFS compression. Note that your pool is mounted at the root, so treat is just like a directory within your root file system. Once the file is copied, you can list it to see that the file is present (but is the same size as the original). Using the dh command, you can see that the size of the file is half the original, indicating that ZFS has compressed it. You can also look at the compressratio property to see how much your pool has been compressed (using the default compressor, gzip). Listing 4 shows the compression.

Listing 4. Demonstrating compression with ZFS
$ sudo zfs create myzpool/myzdev
$ sudo zfs list
NAME             USED  AVAIL  REFER  MOUNTPOINT
myzpool          139K   146M  31.4K  /myzpool
myzpool/myzdev  31.4K   146M  31.4K  /myzpool/myzdev
$ sudo zfs set compression=on myzpool/myzdev
$ ls /myzpool/myzdev/
$ sudo cp ../linux-2.6.34/Documentation/devices.txt /myzpool/myzdev/
$ ls -la ../linux-2.6.34/Documentation/devices.txt 
-rw-r--r-- 1 mtj mtj 118144 2010-05-16 14:17 ../linux-2.6.34/Documentation/devices.txt
$ ls -la /myzpool/myzdev/
total 5
drwxr-xr-x 2 root root      3 2010-11-20 22:59 .
drwxr-xr-x 3 root root      3 2010-11-20 22:55 ..
-rw-r--r-- 1 root root 118144 2010-11-20 22:59 devices.txt
$ du -ah /myzpool/myzdev/
60K	/myzpool/myzdev/devices.txt
62K	/myzpool/myzdev/
$ sudo zfs get compressratio myzpool
NAME     PROPERTY       VALUE  SOURCE
myzpool  compressratio  1.55x  -
$

Finally, let's look at the self-repair capabilities of ZFS. Recall that when you created your pool, you requested RAID-Z over the four devices. You can check the status of your pool using the zpool status command, as shown in Listing 5. As shown, you can see the elements of your pool (RAID-Z 1 with four devices).

Listing 5. Checking your pool status
$ sudo zpool status myzpool
  pool: myzpool
 state: ONLINE
 scrub: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	myzpool     ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    loop0   ONLINE       0     0     0
	    loop1   ONLINE       0     0     0
	    loop2   ONLINE       0     0     0
	    loop3   ONLINE       0     0     0

errors: No known data errors
$

Now, let's force an error into the pool. For this demonstration, go behind the scenes and corrupt the disk file that makes up the device (your disk4.img, represented in ZFS by the loop3 device). Use the dd command to simply zero out the entire device (see Listing 6).

Listing 6. Corrupting the ZFS pool
$ dd if=/dev/zero of=disk4.img bs=64M count=1
1+0 records in
1+0 records out
67108864 bytes (67 MB) copied, 1.84791 s, 36.3 MB/s
$

ZFS is currently unaware of the corruption, but you can force it to see the problem by requesting a scrub of the pool. As shown in Listing 7, ZFS now recognizes the corruption (of the loop3 device) and suggests an action to replace the device. Note also that the pool remains online, and you can still get to your data, as ZFS self-corrects through RAID-Z.

Listing 7. Scrubbing and checking the pool
$ sudo zpool scrub myzpool
$ sudo zpool status myzpool
  pool: myzpool
 state: ONLINE
status: One or more devices could not be used because the label is missing or
	invalid.  Sufficient replicas exist for the pool to continue
	functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-4J
 scrub: scrub completed after 0h0m with 0 errors on Sat Nov 20 23:15:03 2010
config:

	NAME        STATE     READ WRITE CKSUM
	myzpool     ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    loop0   ONLINE       0     0     0
	    loop1   ONLINE       0     0     0
	    loop2   ONLINE       0     0     0
	    loop3   UNAVAIL      0     0     0  corrupted data

errors: No known data errors
$ wc -l /myzpool/myzdev/devices.txt
3340 /myzpool/myzdev/devices.txt
$

As recommended, introduce a new device to your RAID-Z set to act as the new container. Begin by creating a new disk image and representing it as a device with losetup. Note that this process is similar to adding a new physical disk to the set. You then use zpool replace to exchange the corrupted device (loop3) with the new device (loop4). Checking the status of the pool, you can see your new device with a message indicating that data was rebuilt on it (called resilvering), along with the amount of data moved there. Note also that the pool remains online with no errors (visible to the user). To conclude, you scrub the pool again; after checking its status, you'll see that no issues exist, as shown in Listing 8.

Listing 8. Repairing the pool using zpool replace
$ dd if=/dev/zero of=disk5.img bs=64M count=1
1+0 records in
1+0 records out
67108864 bytes (67 MB) copied, 0.925143 s, 72.5 MB/s
$ sudo losetup /dev/loop4 ./disk5.img 
$ sudo zpool replace myzpool loop3 loop4
$ sudo zpool status myzpool
  pool: myzpool
 state: ONLINE
 scrub: resilver completed after 0h0m with 0 errors on Sat Nov 20 23:23:12 2010
config:

	NAME        STATE     READ WRITE CKSUM
	myzpool     ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    loop0   ONLINE       0     0     0
	    loop1   ONLINE       0     0     0
	    loop2   ONLINE       0     0     0
	    loop4   ONLINE       0     0     0  59.5K resilvered

errors: No known data errors
$ sudo zpool scrub myzpool
$ sudo zpool status myzpool
  pool: myzpool
 state: ONLINE
 scrub: scrub completed after 0h0m with 0 errors on Sat Nov 20 23:23:23 2010
config:

	NAME        STATE     READ WRITE CKSUM
	myzpool     ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    loop0   ONLINE       0     0     0
	    loop1   ONLINE       0     0     0
	    loop2   ONLINE       0     0     0
	    loop4   ONLINE       0     0     0

errors: No known data errors
$

This short demonstration explores the consolidation of volume management with a file system and shows how easy it is to administer ZFS (even in the face of failures).


Other Linux-ZFS possibilities

The advantage of ZFS on FUSE is that it's simple to begin using ZFS, but it has the downside of not being efficient as it could be. This lack of efficiency is the result of the multiple user-kernel transitions required per I/O. But given the popularity of ZFS, there is another option that provides greater performance.

A native port of ZFS to the Linux kernel is well under way at the Lawrence Livermore National Lab. This port still lacks some elements, such as the ZFS Portable Operating System Interface (for UNIX®) Layer, but this is under development. Their port provides a number of useful features, particularly if you're interested in using ZFS with Lustre. (See Resources for details.)


Going further

Hopefully, this article has whetted your appetite to dig farther into ZFS. From the earlier demonstration, you can easily get ZFS up and running on most Linux distributions—even in the kernel, with some limitations. Topics such as snapshots and clones were not demonstrated here, but the Resources section provides links a interesting articles on this topic. In the end, Linux and ZFS are state-of-the-art technologies, and it will be difficult to keep them apart.

Resources

Learn

Get products and technologies

  • Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.

Discuss

  • Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Linux on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux
ArticleID=618575
ArticleTitle=Run ZFS on Linux
publish-date=01192011