As an architect in the storage industry, I have an affinity to file systems. These systems are the user interfaces to storage systems, and although they all tend to offer a similar set of features, they also can provide notably different features. Ceph is no different, and it offers some of the most interesting features you'll find in a file system.
Ceph began as a PhD research project in storage systems by Sage Weil at the University of California, Santa Cruz (UCSC). But as of late March 2010, you can now find Ceph in the mainline Linux kernel (since 2.6.34). Although Ceph may not be ready for production environments, it's still useful for evaluation purposes. This article explores the Ceph file system and the unique features that make it an attractive alternative for scalable distributed storage.
Developing a distributed file system is a complex endeavor, but it's immensely valuable if the right problems are solved. Ceph's goals can be simply defined as:
- Easy scalability to multi-petabyte capacity
- High performance over varying workloads (input/output operations per second [IOPS] and bandwidth)
- Strong reliability
Unfortunately, these goals can compete with one another (for example, scalability can reduce or inhibit performance or impact reliability). Ceph has developed some very interesting concepts (such as dynamic metadata partitioning and data distribution and replication), which this article explores shortly. Ceph's design also incorporates fault-tolerance features to protect against single points of failure, with the assumption that storage failures on a large scale (petabytes of storage) will be the norm rather than the exception. Finally, its design does not assume particular workloads but includes the ability to adapt to changing distributed workloads to provide the best performance. It does all of this with the goal of POSIX compatibility, allowing it to be transparently deployed for existing applications that rely on POSIX semantics (through Ceph-proposed enhancements). Finally, Ceph is open source distributed storage and part of the mainline Linux kernel (2.6.34).
Now, let's explore the Ceph architecture and its core elements at a high level. I then dig down another level to identify some of the key aspects of Ceph to provide a more detailed exploration.
The Ceph ecosystem can be broadly divided into four segments (see Figure 1): clients (users of the data), metadata servers (which cache and synchronize the distributed metadata), an object storage cluster (which stores both data and metadata as objects and implements other key responsibilities), and finally the cluster monitors (which implement the monitoring functions).
Figure 1. Conceptual architecture of the Ceph ecosystem
As Figure 1 shows, clients perform metadata operations (to identify the location of data) using the metadata servers. The metadata servers manage the location of data and also where to store new data. Note that metadata is stored in the storage cluster (as indicated by "Metadata I/O"). Actual file I/O occurs between the client and object storage cluster. In this way, higher-level POSIX functions (such as open, close, and rename) are managed through the metadata servers, whereas POSIX functions (such as read and write) are managed directly through the object storage cluster.
Another perspective of the architecture is provided in Figure 2. A set of servers access the Ceph ecosystem through a client interface, which understands the relationship between metadata servers and object-level storage. The distributed storage system can be viewed in a few layers, including a format for the storage devices (the Extent and B-tree-based Object File System [EBOFS] or an alternative) and an overriding management layer designed to manage data replication, failure detection, and recovery and subsequent data migration called Reliable Autonomic Distributed Object Storage (RADOS). Finally, monitors are used to identify component failures, including subsequent notification.
Figure 2. Simplified layered view of the Ceph ecosystem
With the conceptual architecture of Ceph under your belts, you can dig down another level to see the major components implemented within the Ceph ecosystem. One of the key differences between Ceph and traditional file systems is that rather than focusing the intelligence in the file system itself, the intelligence is distributed around the ecosystem.
Figure 3 shows a simple Ceph ecosystem. The Ceph Client is the user of the Ceph file system. The Ceph Metadata Daemon provides the metadata services, while the Ceph Object Storage Daemon provides the actual storage (for both data and metadata). Finally, the Ceph Monitor provides cluster management. Note that there can be many Ceph clients, many object storage endpoints, numerous metadata servers (depending on the capacity of the file system), and at least a redundant pair of monitors. So, how is this file system distributed?
Figure 3. Simple Ceph ecosystem
As Linux presents a common interface to the file systems (through the virtual file system switch [VFS]), the user's perspective of Ceph is transparent. The administrator's perspective will certainly differ, given the potential for many servers encompassing the storage system (see the Resources section for information on creating a Ceph cluster). From the users' point of view, they have access to a large storage system and are not aware of the underlying metadata servers, monitors, and individual object storage devices that aggregate into a massive storage pool. Users simply see a mount point, from which standard file I/O can be performed.
The Ceph file system—or at least the client interface—is implemented in the Linux kernel. Note that in the vast majority of file systems, all of the control and intelligence is implemented within the kernel's file system source itself. But with Ceph, the file system's intelligence is distributed across the nodes, which simplifies the client interface but also provides Ceph with the ability to massively scale (even dynamically).
Rather than rely on allocation lists (metadata to map blocks on a disk to a given file), Ceph uses an interesting alternative. A file from the Linux perspective is assigned an inode number (INO) from the metadata server, which is a unique identifier for the file. The file is then carved into some number of objects (based on the size of the file). Using the INO and the object number (ONO), each object is assigned an object ID (OID). Using a simple hash over the OID, each object is assigned to a placement group. The placement group (identified as a PGID) is a conceptual container for objects. Finally, the mapping of the placement group to object storage devices is a pseudo-random mapping using an algorithm called Controlled Replication Under Scalable Hashing (CRUSH). In this way, mapping of placement groups (and replicas) to storage devices does not rely on any metadata but instead on a pseudo-random mapping function. This behavior is ideal, because it minimizes the overhead of storage and simplifies the distribution and lookup of data.
The final component for allocation is the cluster map. The cluster map is an efficient representation of the devices representing the storage cluster. With a PGID and the cluster map, you can locate any object.
The job of the metadata server (cmds) is to manage the file system's namespace. Although both metadata and data are stored in the object storage cluster, they are managed separately to support scalability. In fact, metadata is further split among a cluster of metadata servers that can adaptively replicate and distribute the namespace to avoid hot spots. As shown in Figure 4, the metadata servers manage portions of the namespace and can overlap (for redundancy and also for performance). The mapping of metadata servers to namespace is performed in Ceph using dynamic subtree partitioning, which allows Ceph to adapt to changing workloads (migrating namespaces between metadata servers) while preserving locality for performance.
Figure 4. Partitioning of the Ceph namespace for metadata servers
But because each metadata server simply manages the namespace for the population of clients, its primary application is an intelligent metadata cache (because actual metadata is eventually stored within the object storage cluster). Metadata to write is cached in a short-term journal, which eventually is pushed to physical storage. This behavior allows the metadata server to serve recent metadata back to clients (which is common in metadata operations). The journal is also useful for failure recovery: if the metadata server fails, its journal can be replayed to ensure that metadata is safely stored on disk.
Metadata servers manage the inode space, converting file names to metadata. The metadata server transforms the file name into an inode, file size, and striping data (layout) that the Ceph client uses for file I/O.
Ceph includes monitors that implement management of the cluster map, but some elements of fault management are implemented in the object store itself. When object storage devices fail or new devices are added, monitors detect and maintain a valid cluster map. This function is performed in a distributed fashion where map updates are communicated with existing traffic. Ceph uses Paxos, which is a family of algorithms for distributed consensus.
Similar to traditional object storage, Ceph storage nodes include not only storage but also intelligence. Traditional drives are simple targets that only respond to commands from initiators. But object storage devices are intelligent devices that act as both targets and initiators to support communication and collaboration with other object storage devices.
From a storage perspective, Ceph object storage devices perform the mapping of objects to blocks (a task traditionally done at the file system layer in the client). This behavior allows the local entity to best decide how to store an object. Early versions of Ceph implemented a custom low-level file system on the local storage called EBOFS. This system implemented a nonstandard interface to the underlying storage tuned for object semantics and other features (such as asynchronous notification of commits to disk). Today, the B-tree file system (BTRFS) can be used at the storage nodes, which already implements some of the necessary features (such as embedded integrity).
Because the Ceph clients implement CRUSH and do not have knowledge of the block mapping of files on the disks, the underlying storage devices can safely manage the mapping of objects to blocks. This allows the storage nodes to replicate data (when a device is found to have failed). Distributing the failure recovery also allows the storage system to scale, because failure detection and recovery are distributed across the ecosystem. Ceph calls this RADOS (see Figure 3).
As if the dynamic and adaptive nature of the file system weren't enough, Ceph also implements some interesting features visible to the user. Users can create snapshots, for example, in Ceph on any subdirectory (including all of the contents). It's also possible to perform file and capacity accounting at the subdirectory level, which reports the storage size and number of files for a given subdirectory (and all of its nested contents).
Although Ceph is now integrated into the mainline Linux kernel, it's properly noted there as experimental. File systems in this state are useful to evaluate but are not yet ready for production environments. But given Ceph's adoption into the Linux kernel and the motivation by its originators to continue its development, it should be available soon to solve your massive storage needs.
Ceph isn't unique in the distributed file system space, but it is unique in the way that it manages a large storage ecosystem. Other examples of distributed file systems include the Google File System (GFS), the General Parallel File System (GPFS), and Lustre, to name just a few. The ideas behind Ceph appear to offer an interesting future for distributed file systems, as massive scales introduce unique challenges to the massive storage problem.
Ceph is not only a file system but an object storage ecosystem with enterprise-class features. In the Resources section, you'll find information on how to set up a simple Ceph cluster (including metadata servers, object servers, and monitors). Ceph fills a gap in distributed storage, and it will be interesting to see how the open source offering evolves in the future.
- The Ceph creators' paper "Ceph: A
Scalable, High-Performance Distributed File System" (PDF) and Sage
Weil's PhD dissertation, "Ceph: Reliable,
Scalable, and High-Performance Distributed Storage" (PDF), reveal
the original ideas behind Ceph.
- The Storage Systems
Research Center's Petabyte-Scale Storage site offers
additional technical information about Ceph.
- Visit the Ceph home page for the latest information.
Controlled, Scalable, Decentralized Placement of Replicated Data"
and "RADOS: A Scalable, Reliable Storage Service for Petabyte-scale
Storage Clusters" (PDF) discuss two of the most interesting
aspects of the Ceph file system.
- "The Ceph
filesystem" on LWN.net provides an early take on the
Ceph file system (including a set of entertaining comments).
"Building a Small Ceph Cluster gives instructions
for building a Ceph cluster along with tips for distribution of assets. This article walks you through
getting the Ceph source, building a new kernel, and then deploying the
various elements of the Ceph ecosystem.
- At the Paxos Wikipedia page, learn more about how Ceph metadata servers utilize Paxos as a
consensus protocol among the distributed entities.
- In "Anatomy of the Linux virtual file system switch" (developerWorks,
August 2009), learn more about the VFS, a flexible mechanism that Linux
includes to allow multiple file systems to exist concurrently.
- In "Next-generation Linux file systems: NiLFS(2) and exofs"
(developerWorks, October 2009), learn more about exofs, another Linux file system that
utilizes object storage. exofs maps
object storage device-based storage into a traditional Linux file system.
- At the kernel wiki
site for BTRFS and in "Linux Kernel Advances" (developerWorks, March 2009), you can
learn how to use the BTRFS on individual object storage nodes.
In the developerWorks Linux zone,
find hundreds of how-to
and tutorials, as well as downloads, discussion forums,
and a wealth other resources for Linux developers and administrators.
Stay current with
developerWorks technical events and webcasts focused on a variety of IBM products and IT industry topics.
Attend a free developerWorks Live!
briefing to get up-to-speed quickly on IBM products and tools as well as IT industry trends.
Watch developerWorks on-demand demos
ranging from product installation and setup demos for beginners, to advanced functionality for experienced developers.
Follow developerWorks on Twitter, or subscribe
feed of Linux tweets on developerWorks.
Get products and technologies
Evaluate IBM products
in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the
learning how to implement Service Oriented Architecture efficiently.
Get involved in the My developerWorks community.
Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.
M. Tim Jones is an embedded firmware architect and the author of Artificial Intelligence: A Systems Approach, GNU/Linux Application Programming (now in its second edition), AI Application Programming (in its second edition), and BSD Sockets Programming from a Multilanguage Perspective. His engineering background ranges from the development of kernels for geosynchronous spacecraft to embedded systems architecture and networking protocols development. Tim is a Consultant Engineer for Emulex Corp. in Longmont, Colorado.