BlueStore
BlueStore is the next generation storage implementation for Ceph. As the market
for storage devices now includes solid state drives or SSDs and non-volatile memory over PCI Express
or NVMe, their use in Ceph reveals some of the limitations of the FileStore storage
implementation. While FileStore has many improvements to facilitate SSD and NVMe
storage, other limitations remain. Among them, increasing placement groups remains computationally
expensive, and the double write penalty remains. Whereas, FileStore interacts with
a file system on a block device, BlueStore eliminates that layer of indirection and
directly consumes a raw block device for object storage. BlueStore uses the very
light weight BlueFS file system on a small partition for its k/v databases.
BlueStore eliminates the paradigm of a directory representing a placement group, a
file representing an object and file XATTRs representing metadata. BlueStore also
eliminates the double write penalty of FileStore, so write operations are nearly
twice as fast with BlueStore under most workloads.
BlueStore stores data as:
- Object Data
-
In
BlueStore, Ceph stores objects as blocks directly on a raw block device. The portion of the raw block device that stores object data does NOT contain a filesystem. The omission of the filesystem eliminates a layer of indirection and thereby improves performance. However, much of theBlueStoreperformance improvement comes from the block database and write-ahead log. - Block Database
-
In
BlueStore, the block database handles the object semantics to guarantee Consistency. An object’s unique identifier is a key in the block database. The values in the block database consist of a series of block addresses that refer to the stored object data, the object’s placement group, and object metadata. The block database may reside on aBlueFSpartition on the same raw block device that stores the object data, or it may reside on a separate block device, usually when the primary block device is a hard disk drive and an SSD or NVMe will improve performance. The block database provides a number of improvements overFileStore; namely, the key/value semantics ofBlueStoredo not suffer from the limitations of filesystem XATTRs.BlueStoremay assign objects to other placement groups quickly within the block database without the overhead of moving files from one directory to another, as is the case inFileStore.BlueStorealso introduces new features. The block database can store the checksum of the stored object data and its metadata, allowing full data checksum operations for each read, which is more efficient than periodic scrubbing to detect bit rot.BlueStorecan compress an object and the block database can store the algorithm used to compress an object—ensuring that read operations select the appropriate algorithm for decompression. - Write-ahead Log
-
In
BlueStore, the write-ahead log ensures Atomicity, similar to the journaling functionality ofFileStore. LikeFileStore,BlueStorelogs all aspects of each transaction. However, theBlueStorewrite-ahead log or WAL can perform this function simultaneously, which eliminates the double write penalty ofFileStore. Consequently,BlueStoreis nearly twice as fast asFileStoreon write operations for most workloads. BlueStore can deploy the WAL on the same device for storing object data, or it may deploy the WAL on another device, usually when the primary block device is a hard disk drive and an SSD or NVMe will improve performance.