BlueStore

BlueStore is the next generation storage implementation for Ceph. As the market for storage devices now includes solid state drives or SSDs and non-volatile memory over PCI Express or NVMe, their use in Ceph reveals some of the limitations of the FileStore storage implementation. While FileStore has many improvements to facilitate SSD and NVMe storage, other limitations remain. Among them, increasing placement groups remains computationally expensive, and the double write penalty remains. Whereas, FileStore interacts with a file system on a block device, BlueStore eliminates that layer of indirection and directly consumes a raw block device for object storage. BlueStore uses the very light weight BlueFS file system on a small partition for its k/v databases. BlueStore eliminates the paradigm of a directory representing a placement group, a file representing an object and file XATTRs representing metadata. BlueStore also eliminates the double write penalty of FileStore, so write operations are nearly twice as fast with BlueStore under most workloads.

BlueStore stores data as:

Object Data

In BlueStore, Ceph stores objects as blocks directly on a raw block device. The portion of the raw block device that stores object data does NOT contain a filesystem. The omission of the filesystem eliminates a layer of indirection and thereby improves performance. However, much of the BlueStore performance improvement comes from the block database and write-ahead log.

Block Database

In BlueStore, the block database handles the object semantics to guarantee Consistency. An object’s unique identifier is a key in the block database. The values in the block database consist of a series of block addresses that refer to the stored object data, the object’s placement group, and object metadata. The block database may reside on a BlueFS partition on the same raw block device that stores the object data, or it may reside on a separate block device, usually when the primary block device is a hard disk drive and an SSD or NVMe will improve performance. The block database provides a number of improvements over FileStore; namely, the key/value semantics of BlueStore do not suffer from the limitations of filesystem XATTRs. BlueStore may assign objects to other placement groups quickly within the block database without the overhead of moving files from one directory to another, as is the case in FileStore. BlueStore also introduces new features. The block database can store the checksum of the stored object data and its metadata, allowing full data checksum operations for each read, which is more efficient than periodic scrubbing to detect bit rot. BlueStore can compress an object and the block database can store the algorithm used to compress an object—ensuring that read operations select the appropriate algorithm for decompression.

Write-ahead Log

In BlueStore, the write-ahead log ensures Atomicity, similar to the journaling functionality of FileStore. Like FileStore, BlueStore logs all aspects of each transaction. However, the BlueStore write-ahead log or WAL can perform this function simultaneously, which eliminates the double write penalty of FileStore. Consequently, BlueStore is nearly twice as fast as FileStore on write operations for most workloads. BlueStore can deploy the WAL on the same device for storing object data, or it may deploy the WAL on another device, usually when the primary block device is a hard disk drive and an SSD or NVMe will improve performance.

Note: It is only helpful to store a block database or a write-ahead log on a separate block device if the separate device is faster than the primary storage device. For example, SSD and NVMe devices are generally faster than HDDs. Placing the block database and the WAL on separate devices may also have performance benefits due to differences in their workloads.