Ceph BlueStore BlueFS

BlueStore block database stores metadata as key-value pairs in a RocksDB database. The block database resides on a small BlueFS partition on the storage device. BlueFS is a minimal file system that is designed to hold the RocksDB files.

BlueFS files

There are three types of files that RocksDB produces.

  • Control files, for example CURRENT, IDENTITY, and MANIFEST-00011.

  • Database (DB) table files, for example 004112.sst.

  • Write ahead logs (WAL), for example 00038.log.

There is also an internal, hidden file that serves as BlueFS replay log, ino 1, that works as directory structure, file mapping, and operations log.

Fallback hierarchy

With BlueFS it is possible to put any file on any device. Parts of file can even reside on different devices, that is WAL, DB, and SLOW. There is an order to where BlueFS puts files. File is put to secondary storage only when primary storage is exhausted, and tertiary only when secondary is exhausted.

The order for the specific files is as follows, for each device type.

Write ahead logs: WAL, DB, SLOW

Replay log ino 1: DB, SLOW

Control and DB files: DB, SLOW

  • Control and DB file order when running out of space: SLOW

IMPORTANT: There is an exception to control and DB file order. When RocksDB detects that you are running out of space on DB file, it directly notifies you to put file to SLOW device.