Ceph is a POSIX-compliant (Portable Operating System for UNIX®), open source distributed storage system that operates under the GNU Lesser General Public License. Initially developed by Sage Weill in 2007, the philosophy of the project is to propose a cluster without any single point of failure by ensuring permanent data replication across the cluster nodes.
As in any classical distributed file system, the files put into the cluster are stripped and placed in the cluster nodes according to a pseudo-random, data-distribution algorithm known as the Ceph Controlled Replication Under Scalable Hashing (CRUSH).
Ceph is an interesting storage alternative because of some of the concepts it implements, such as metadata partitioning and a replication or placement group strategy that aggregates a series of objects into a group that is mapped to a series of object storage daemons or OSDs.
These features permit auto-scaling, healing, and self-managing clusters since they provide (on different levels) ways to interact with your Ceph cluster using the following bindings:
- The Reliable Autonomic Distributed Object Store (RADOS) gateway is a Representational State Transfer (REST)-ful interface your applications can talk to for storing objects directly in the cluster.
libradoslibrary is a convenient way to access RADOS with support for the PHP, Ruby, Java™, Python, and
- Ceph's RADOS block device (RBD) is a fully distributed block device that uses a Linux® kernel and a Quick EMUlator (QEMU)/Kernel-based Virtual Machine (KVM) driver.
- The native CephFS is a distributed file system that fully supports Filesystem in Userspace (FUSE).
As shown in Figure 1, the Ceph ecosystem is divided into five components:
- RADOS gateway
- Various nodes in the cluster
Figure 1. The Ceph ecosystem
The Ceph ecosystem supports many ways to interact with it natively, making its integration inside an already-running infrastructure easy and convenient even though it performs the rather complex task of delivering block and object storage in one unified project file.
Next, see the blocks Ceph is made of and what role each plays in Ceph.
The RADOS object store
Figure 1 showed the RADOS object store as the foundation of the storage cluster. For every operation made through the numerous clients or gateways (RADOSGW, RBD, or CephFS), the data goes into RADOS or is read from it. Figure 2 shows the RADOS cluster, which is a set of two daemons: The Ceph object storage daemons (OSDs) and the Ceph monitors that maintain the master copy of the cluster map.
Figure 2. The RADOS object store
The cluster map describes the physical location of the object chunks as well as a list of "buckets" that aggregates the devices into physical locations. The map is ruled by Ceph's advanced placement algorithm which models the logical location over the physical one. Figure 3 depicts the "pools" inside the cluster, the logical partitions for storing your objects. Each pool is dynamically mapped to OSDs.
Figure 3. The RADOS placement groups
Now, let's look at the first set of daemons — the OSDs, then the monitors, and finally Ceph's metadata servers which belong to the CephFS distributed file system.
An OSD is the daemon that accesses and writes the data into a file system as it provides access to it over the cluster network. For the cluster to operate fully, the Ceph developers recommend either XFS (the Silicon Graphics journalling file system) or the B tree file system (Btrfs) as a file system you use for object storage. The fourth extended file system (ext4) is also a possibility, but it doesn't provide the features XFS and Btrfs provide for Ceph.
In this example, XFS was deployed on all storage nodes. Figure 4 shows how the Ceph OSD interacts with physical storage.
Figure 4. The RADOS OSDs
In the RADOS cluster, the Ceph monitor daemons
ceph-mon) reside next to the OSDs. The monitors are the daemons the clients
communicate with to manipulate the stored data inside the cluster. This is one
of the innovative approaches Ceph proposes: Instead of contacting a centralized
metadata server that manages access to the data cluster, the lightweight daemons
deliver the cluster map to the client and handle all the communication with the
ceph-mon also manages the
consistency of the data inside the cluster. The monitors act according to the consensus
Paxos protocol; running at least three instances of
is a prerequisite for your cluster setup.
Figure 5 presents an image of the way your clients interact with the cluster through monitor daemons.
Figure 5. The RADOS monitors
The metadata servers
The last stack Ceph uses is the Ceph metadata server, exposed through the
ceph-mds daemon which stores metadata for CephFS.
The metadata is the same you'd find in other file systems, data such as file owner, timestamps, permissions, etc. The metadata daemons expose a POSIX-compliant distributed file system and store the metadata inside RADOS.
Note that the metadata server itself does not serve files to clients; this removes any single points of failure inside your cluster.
Figure 6 shows the role
plays when you use CephFS.
Figure 6. The Ceph metadata servers
The CRUSH algorithm
Ceph CRUSH (for Controlled Replication Under Scalable Hashing) is the algorithm responsible for data placement and retrieval within your cluster. Both storage clients and OSDs use the CRUSH algorithm for data placement and distribution computing rather than depending on a central lookup table that would likely introduce an single point of failure within the cluster. In this way, CRUSH alleviates the cluster workload by distributing the work to the clients and OSDs in the cluster.
Given the nature of the algorithm, the placement can be deterministically computed by the clients themselves — this has the effect of removing the need for maintaining a highly available placement map of the cluster objects. In other words, your cluster requires less load than a classical cluster.
Topology and infrastructure awareness is another innovative feature of CRUSH.
The OSDs are logically nested into a hierarchy of components such as racks or switches
which makes it possible to isolate a zone of faulty hardware or a distribution based on client
proximity. The CRUSH map describes the placement of objects that both OSDs
and clients compute; it is maintained by the lightweight node monitors (
daemons) whose only job is the adjustment and propagation of that map in case of
infrastructure changes. This type of scalable model is the opposite of classical data cluster
models: Clients usually do nothing but request the data while the cluster performs
all the complex placement computing. Finally, the metadata is handled by the metadata
servers and accessed by clients.
The next section provides an example of CephFS usage and integration within a running OpenStack cloud. The first part covers the deployment of CephFS as shared storage for your instances; the second part shows the way the Glance imaging service can natively put and retrieve images from RADOS.
CephFS as shared instance storage in cloud
You can integrate your Ceph cluster easily, in many ways, because of its numerous gateways. For example:
- You can use Ceph as a back end for the instances directory using the native CephFS.
- Or a support for your Glance imaging repository (Ceph is now integrated as a pipeline for Glance).
- Or even as a strong and reliable base for your permanent volumes back end (native integration is possible, as well).
The Ceph community's huge effort toward the technology's transparent integration transforms Ceph into a way not only to secure your cloud data, but to provide a homogenous solution for management and administration. This provides administrators with an opportunity to come up with creative implementations without a performance sacrifice and potential instabilities since Ceph is designed not to allow single points of failure.
In this section, two types of implementations are described: — Using CephFS as a back end for your instances and the native integration of Ceph into Glance. For this article, I assume that you already have two servers dedicated to your OpenStack instances and that both are able to mount and simultaneously access the CephFS disk. I also assume that you already have a running Ceph cluster that is healthy. My setup is based on Ubuntu Server Precise 12.04, but CephFS is available on many Linux platforms.
Install the components CephFS requires by running the following command on both compute nodes:
sudo aptitude install ceph-fs-common ceph-fuse
All the dependent packages are installed. You need a Ceph admin key — for testing purposes, I use the admin user account, but in a production environment you should create a dedicated user. To retrieve your key, run this command:
sudo ceph-authtool --print-key /etc/ceph/keyring.admin AQDVGc5P0LXzIhAA5C020gbdrgypSFGUpG2cqQ
To stop the computing services on your nodes, run the command:
sudo service nova-compute stop; sudo service libvirt-bin stop
Copy the content of your instances directory to CephFS and then mount it in the location nova-compute uses:
sudo mount -t ceph «ip-of-ceph-mon1:6789,ip-of-ceph-mon-X:6789:/ \ /mnt/ -o name=admin,secret=AQDVGc5P0LXzIhAA5C020gbdrgypSFGUpG2cqQ==
If you use QEMU Copy On Write (Qcow2) format for your base images, the following command will do the trick:
mkdir /mnt/_base && for i in $( ls /var/lib/nova/instances/_base/); \ do sudo qemu-img covert -O qcow2 $i /mnt/_base/$i; done sudo cp -r /var/lib/nova/instances/instance-* /mnt
You can now unmount (
/mnt) and remount the CephFS
volume to the correct location:
sudo mount -t ceph «ip-of-ceph-mon1:6789,ip-of-ceph-mon-X:6789:/ \ /var/lib/nova/instances/ -o name=admin,secret=AQDVGc5P0LXzIhAA5C020gbdrgypSFGUpG2cqQ== sudo chown nova. /var/lib/nova/instances; \ service libvirt-bin start; service nova-compute start
Voila! You now have highly available shared storage between two compute nodes, making features such as migration or high availability recovery scenarios possible.
The next section shows how to launch a migration from the first compute node to the second.
Live-migrate your instances like a champ
Now that you have shared storage for your instances, you can initiate a live migration between nodes. This is useful if you want to lighten the load on a compute node. Make sure you followed the procedure to enable live migration (check Resources for a link to that information). Retrieve your instance ID by using the Nova list and initiate the live migration:
nova live-migration 0a2419bf-9254-4e02-98d4-98ef66c43d43 compute-node-B
After a couple of seconds, the instances should be running on the second compute node.
Next, another interesting integration case of Ceph inside your infrastructure.
Glance, the imaging service, is able to use multiple back-end storage systems for its images. Edit /etc/glance/glance-api.conf and update the following configuration options to enable integration between Glance and RADOS:
default_store = rbd rbd_store_user = ceph-glance rbd_store_pool = glance-images
Next, create a Ceph pool and user:
sudo rados mkpool glance-images; sudo ceph-authtool --create-keyring /etc/ceph/openstack/glance.keyring sudo ceph-authtool --gen-key --name client.ceph-glance --cap mon 'allow r' \ --cap osd 'allow rwx pool=glance-images' /etc/ceph/openstack/glance.keyring sudo ceph auth add client.ceph-glance -i /etc/ceph/openstack/rbd.keyring; sudo chown glance:glance /etc/glance/rbd.keyring
Restart the Glance services:
cd /et/init.d; for i in $( ls glance-* ); do sudo service $i restart; done
You can now upload images that will be directly distributed and placed within your Ceph cluster!
I introduced you to a Ceph cluster and provided a description of the role of each component in the cluster. The innovative approach Ceph proposes not only makes modeling a highly available and reliable storage architecture possible, it can also empower your infrastructure by scaling out easily.
Ceph is an active project, mature enough to be considered a solid choice for a distributed storage solution. The project is one creative answer for companies looking for an efficient and complete solution to manage their data — from securing to expanding storage offerings, organizations can take advantage of the numerous libraries and gateways Ceph exposes to offer their customers new products and ways to manage their data.
- Resources to expand your knowledge on this topic:
- "Ceph: petabyte-scale storage for large and small deployments" by Sage Weil covers the goals and features of Ceph.
- "Ceph: A Linux petabyte-scale distributed file system" by M. Tim Jones explores in detail the Ceph architecture by presenting the concepts and underlying mechanisms.
- The official OpenStack documentation on live migrations assists you in the configuration of your cloud for live migrations.
- This whitepaper, "CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data" dives into the CRUSH data-distribution algorithm. PDF
- If you remember nothing else, remember Jim O'Reilly's note that "Ceph uniquely delivers object, block and file storage in one unified system" (from his column, "Why IT Pros Should Check Out Ceph", Enterprise Conversations, Feb. 2013).
- "Cloud computing and storage with OpenStack" by M. Tim Jones presents the benefits of using OpenStack over other Infrastructure as a Service solutions.
- Visit the official IBM OpenStack blog.
- In the developerWorks Cloud computing zone, discover and share knowledge and experience of application and services from developers building their projects for cloud deployment.
- Engage other OpenStack articles at developerWorks.
- Create your developerWorks profile today and set up a watch list for topics that interest you. Get connected and stay connected with the developerWorks community.
- Follow developerWorks on Twitter.
- Watch developerWorks demos ranging from product installation and setup demos for beginners to advanced functionality for experienced developers.
- Resources to help you complete the tasks in this article:
- The official Ceph website provides all the resources and materials you need to learn and discover Ceph and its components.
- The OpenStack website is the unique source for information on the OpenStack family of projects, news on community projects, documentation, and everything else related to OpenStack.
Get products and technologies
- Get the latest Ceph version, and start to deploy your first distributed storage system.
- Download Ubuntu Precise Pangolin, the Ubuntu server distribution, from the official Ubuntu website.
- Access IBM SmartCloud Enterprise.
- Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.
- Start to discuss, share, or explore Ceph in the IBM developerWorks community.