Integrate a Ceph storage cluster within an OpenStack cloud

Discover Ceph, the open source distributed storage system that empowers your OpenStack environment

For companies deploying and proposing cloud computing offers as Infrastructure as a Service, data replication and storage mechanisms remain the de facto prerequisite to ensure both integrity and service continuity for their customers. Cloud computing offers a model in which the location of data is not as important as in other models of infrastructures (such as those in which the company directly owns expensive storage hardware). Ceph is an open source, unified, distributed storage system that offers a convenient way to deploy a low-cost and massively scalable storage platform with commodity hardware. Discover how a Ceph cluster — presenting object, block, and file storage from a single point — is created, its algorithms and replication mechanism, and how it can be integrated with your cloud data architectures and models. The author proposes a simple, yet powerful approach for integrating the Ceph cluster within an OpenStack ecosystem.

Share:

Razique Mahroua (razique.mahroua@gmail.com), Cloud computing consultant, Independent

Razique Mahroua photoRazique Mahroua is a systems administrator and consultant for a hosting company specializing in cloud solutions. Currently involved in several open source projects, he's part of the official OpenStack doc core team. His experience ranges from cloud solutions, implementations (IaaS, PaaS), and by-products such as data clustering to network high availability and data integrity. He currently assists several companies looking for best practices around cloud solutions.



23 April 2013

Also available in Chinese

Ceph is a POSIX-compliant (Portable Operating System for UNIX®), open source distributed storage system that operates under the GNU Lesser General Public License. Initially developed by Sage Weill in 2007, the philosophy of the project is to propose a cluster without any single point of failure by ensuring permanent data replication across the cluster nodes.

As in any classical distributed file system, the files put into the cluster are stripped and placed in the cluster nodes according to a pseudo-random, data-distribution algorithm known as the Ceph Controlled Replication Under Scalable Hashing (CRUSH).

Ceph is an interesting storage alternative because of some of the concepts it implements, such as metadata partitioning and a replication or placement group strategy that aggregates a series of objects into a group that is mapped to a series of object storage daemons or OSDs.

These features permit auto-scaling, healing, and self-managing clusters since they provide (on different levels) ways to interact with your Ceph cluster using the following bindings:

  • The Reliable Autonomic Distributed Object Store (RADOS) gateway is a Representational State Transfer (REST)-ful interface your applications can talk to for storing objects directly in the cluster.
  • The librados library is a convenient way to access RADOS with support for the PHP, Ruby, Java™, Python, and C/C++ programming languages.
  • Ceph's RADOS block device (RBD) is a fully distributed block device that uses a Linux® kernel and a Quick EMUlator (QEMU)/Kernel-based Virtual Machine (KVM) driver.
  • The native CephFS is a distributed file system that fully supports Filesystem in Userspace (FUSE).

As shown in Figure 1, the Ceph ecosystem is divided into five components:

  • librados library
  • RADOS gateway
  • RBD
  • CephFS
  • Various nodes in the cluster
Figure 1. The Ceph ecosystem
Image presenting the Ceph components

The Ceph ecosystem supports many ways to interact with it natively, making its integration inside an already-running infrastructure easy and convenient even though it performs the rather complex task of delivering block and object storage in one unified project file.

Next, see the blocks Ceph is made of and what role each plays in Ceph.

The RADOS object store

Figure 1 showed the RADOS object store as the foundation of the storage cluster. For every operation made through the numerous clients or gateways (RADOSGW, RBD, or CephFS), the data goes into RADOS or is read from it. Figure 2 shows the RADOS cluster, which is a set of two daemons: The Ceph object storage daemons (OSDs) and the Ceph monitors that maintain the master copy of the cluster map.

Figure 2. The RADOS object store
Image presenting the RADOS object store components

The cluster map describes the physical location of the object chunks as well as a list of "buckets" that aggregates the devices into physical locations. The map is ruled by Ceph's advanced placement algorithm which models the logical location over the physical one. Figure 3 depicts the "pools" inside the cluster, the logical partitions for storing your objects. Each pool is dynamically mapped to OSDs.

Figure 3. The RADOS placement groups
Image presenting the RADOS placement group algorithm

Now, let's look at the first set of daemons — the OSDs, then the monitors, and finally Ceph's metadata servers which belong to the CephFS distributed file system.


The OSDs

An OSD is the daemon that accesses and writes the data into a file system as it provides access to it over the cluster network. For the cluster to operate fully, the Ceph developers recommend either XFS (the Silicon Graphics journalling file system) or the B tree file system (Btrfs) as a file system you use for object storage. The fourth extended file system (ext4) is also a possibility, but it doesn't provide the features XFS and Btrfs provide for Ceph.

In this example, XFS was deployed on all storage nodes. Figure 4 shows how the Ceph OSD interacts with physical storage.

Figure 4. The RADOS OSDs
Image presenting the RADOS OSDs

The monitors

The Paxos consensus protocols

Paxos is a set of protocols for solving consensus in a network of unreliable processors; consensus refers to the process of agreeing on one result among participants. When participants or their means of communication experience failures, it becomes a problem. Paxos includes a range of trade-offs between number of processors, message delays before learning agreed values, participant activity level, number of messages sent, and types of failures. Paxos is normally used when durability is required and the amount of durable state could be large, like when replicating a file or a database.

In the RADOS cluster, the Ceph monitor daemons (ceph-mon) reside next to the OSDs. The monitors are the daemons the clients communicate with to manipulate the stored data inside the cluster. This is one of the innovative approaches Ceph proposes: Instead of contacting a centralized metadata server that manages access to the data cluster, the lightweight daemons deliver the cluster map to the client and handle all the communication with the external applications.

The ceph-mon also manages the consistency of the data inside the cluster. The monitors act according to the consensus Paxos protocol; running at least three instances of ceph-mon is a prerequisite for your cluster setup.

Figure 5 presents an image of the way your clients interact with the cluster through monitor daemons.

Figure 5. The RADOS monitors
Image presenting the RADOS monitor daemons

The metadata servers

Why CephFS is not considered production ready

Although all the Ceph ecosystem components are considered production ready, CephFS isn't, mainly because of the way monitor daemons work. They are actually not highly available according to Ceph's other blocks, which means that one node takes the relay only if the active nodes fail (Active/Passive approach).

The last stack Ceph uses is the Ceph metadata server, exposed through the ceph-mds daemon which stores metadata for CephFS.

The metadata is the same you'd find in other file systems, data such as file owner, timestamps, permissions, etc. The metadata daemons expose a POSIX-compliant distributed file system and store the metadata inside RADOS.

Note that the metadata server itself does not serve files to clients; this removes any single points of failure inside your cluster.

Figure 6 shows the role ceph-mds plays when you use CephFS.

Figure 6. The Ceph metadata servers
Image showing the Ceph metadata servers

The CRUSH algorithm

Ceph CRUSH (for Controlled Replication Under Scalable Hashing) is the algorithm responsible for data placement and retrieval within your cluster. Both storage clients and OSDs use the CRUSH algorithm for data placement and distribution computing rather than depending on a central lookup table that would likely introduce an single point of failure within the cluster. In this way, CRUSH alleviates the cluster workload by distributing the work to the clients and OSDs in the cluster.

Given the nature of the algorithm, the placement can be deterministically computed by the clients themselves — this has the effect of removing the need for maintaining a highly available placement map of the cluster objects. In other words, your cluster requires less load than a classical cluster.

Topology and infrastructure awareness is another innovative feature of CRUSH. The OSDs are logically nested into a hierarchy of components such as racks or switches which makes it possible to isolate a zone of faulty hardware or a distribution based on client proximity. The CRUSH map describes the placement of objects that both OSDs and clients compute; it is maintained by the lightweight node monitors (ceph-mon daemons) whose only job is the adjustment and propagation of that map in case of infrastructure changes. This type of scalable model is the opposite of classical data cluster models: Clients usually do nothing but request the data while the cluster performs all the complex placement computing. Finally, the metadata is handled by the metadata servers and accessed by clients.

The next section provides an example of CephFS usage and integration within a running OpenStack cloud. The first part covers the deployment of CephFS as shared storage for your instances; the second part shows the way the Glance imaging service can natively put and retrieve images from RADOS.


CephFS as shared instance storage in cloud

You can integrate your Ceph cluster easily, in many ways, because of its numerous gateways. For example:

  • You can use Ceph as a back end for the instances directory using the native CephFS.
  • Or a support for your Glance imaging repository (Ceph is now integrated as a pipeline for Glance).
  • Or even as a strong and reliable base for your permanent volumes back end (native integration is possible, as well).

The Ceph community's huge effort toward the technology's transparent integration transforms Ceph into a way not only to secure your cloud data, but to provide a homogenous solution for management and administration. This provides administrators with an opportunity to come up with creative implementations without a performance sacrifice and potential instabilities since Ceph is designed not to allow single points of failure.

In this section, two types of implementations are described: — Using CephFS as a back end for your instances and the native integration of Ceph into Glance. For this article, I assume that you already have two servers dedicated to your OpenStack instances and that both are able to mount and simultaneously access the CephFS disk. I also assume that you already have a running Ceph cluster that is healthy. My setup is based on Ubuntu Server Precise 12.04, but CephFS is available on many Linux platforms.

Install the components CephFS requires by running the following command on both compute nodes:

sudo aptitude install ceph-fs-common ceph-fuse

All the dependent packages are installed. You need a Ceph admin key — for testing purposes, I use the admin user account, but in a production environment you should create a dedicated user. To retrieve your key, run this command:

sudo ceph-authtool --print-key /etc/ceph/keyring.admin 
AQDVGc5P0LXzIhAA5C020gbdrgypSFGUpG2cqQ

To stop the computing services on your nodes, run the command:

sudo service nova-compute stop; sudo service libvirt-bin stop

Copy the content of your instances directory to CephFS and then mount it in the location nova-compute uses:

sudo mount -t ceph «ip-of-ceph-mon1:6789,ip-of-ceph-mon-X:6789:/ \
/mnt/ -o name=admin,secret=AQDVGc5P0LXzIhAA5C020gbdrgypSFGUpG2cqQ==

If you use QEMU Copy On Write (Qcow2) format for your base images, the following command will do the trick:

mkdir /mnt/_base && for i in $( ls /var/lib/nova/instances/_base/); \
do sudo qemu-img covert -O qcow2 $i /mnt/_base/$i; done

sudo cp -r /var/lib/nova/instances/instance-* /mnt

You can now unmount (/mnt) and remount the CephFS volume to the correct location:

sudo mount -t ceph «ip-of-ceph-mon1:6789,ip-of-ceph-mon-X:6789:/ \
/var/lib/nova/instances/ -o name=admin,secret=AQDVGc5P0LXzIhAA5C020gbdrgypSFGUpG2cqQ==

sudo chown nova. /var/lib/nova/instances; \
service libvirt-bin start; service nova-compute start

Voila! You now have highly available shared storage between two compute nodes, making features such as migration or high availability recovery scenarios possible.

The next section shows how to launch a migration from the first compute node to the second.


Live-migrate your instances like a champ

Now that you have shared storage for your instances, you can initiate a live migration between nodes. This is useful if you want to lighten the load on a compute node. Make sure you followed the procedure to enable live migration (check Resources for a link to that information). Retrieve your instance ID by using the Nova list and initiate the live migration:

nova live-migration 0a2419bf-9254-4e02-98d4-98ef66c43d43 compute-node-B

After a couple of seconds, the instances should be running on the second compute node.

Next, another interesting integration case of Ceph inside your infrastructure.


RADOS-Glance integration

Glance, the imaging service, is able to use multiple back-end storage systems for its images. Edit /etc/glance/glance-api.conf and update the following configuration options to enable integration between Glance and RADOS:

default_store = rbd 
rbd_store_user =
ceph-glance rbd_store_pool = glance-images

Next, create a Ceph pool and user:

sudo rados mkpool glance-images; 
sudo ceph-authtool --create-keyring /etc/ceph/openstack/glance.keyring

sudo ceph-authtool --gen-key --name client.ceph-glance --cap mon 'allow r' \
--cap osd 'allow rwx pool=glance-images' /etc/ceph/openstack/glance.keyring

sudo ceph auth add client.ceph-glance -i /etc/ceph/openstack/rbd.keyring;
sudo chown glance:glance /etc/glance/rbd.keyring

Restart the Glance services:

cd /et/init.d; for i in $( ls glance-* ); do
sudo service $i restart; done

You can now upload images that will be directly distributed and placed within your Ceph cluster!


Conclusion

I introduced you to a Ceph cluster and provided a description of the role of each component in the cluster. The innovative approach Ceph proposes not only makes modeling a highly available and reliable storage architecture possible, it can also empower your infrastructure by scaling out easily.

Ceph is an active project, mature enough to be considered a solid choice for a distributed storage solution. The project is one creative answer for companies looking for an efficient and complete solution to manage their data — from securing to expanding storage offerings, organizations can take advantage of the numerous libraries and gateways Ceph exposes to offer their customers new products and ways to manage their data.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Cloud computing on developerWorks


  • Bluemix Developers Community

    Get samples, articles, product docs, and community resources to help build, deploy, and manage your cloud apps.

  • developerWorks Labs

    Experiment with new directions in software development.

  • DevOps Services

    Software development in the cloud. Register today to create a project.

  • Try SoftLayer Cloud

    Deploy public cloud instances in as few as 5 minutes. Try the SoftLayer public cloud instance for one month.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Cloud computing
ArticleID=877549
ArticleTitle=Integrate a Ceph storage cluster within an OpenStack cloud
publish-date=04232013