Input/output operations
Ceph clients retrieve a Cluster Map from a Ceph monitor, bind to a pool, and perform input/output (I/O) on objects within placement groups in the pool. The pool’s CRUSH ruleset and the number of placement groups are the main factors that determine how Ceph will place the data. With the latest version of the cluster map, the client knows about all of the monitors and OSDs in the cluster and their current state. However, the client doesn’t know anything about object locations.
The only inputs required by the client are the object ID and the pool name. It is simple: Ceph stores data in named pools. When a client wants to store a named object in a pool it takes the object name, a hash code, the number of PGs in the pool and the pool name as inputs; then, CRUSH (Controlled Replication Under Scalable Hashing) calculates the ID of the placement group and the primary OSD for the placement group.
Ceph clients use the following steps to compute PG IDs.
-
The client inputs the pool ID and the object ID. For example,
pool = liverpoolandobject-id = john. -
CRUSH takes the object ID and hashes it.
-
CRUSH calculates the hash modulo of the number of PGs to get a PG ID. For example,
58. -
CRUSH calculates the primary OSD corresponding to the PG ID.
-
The client gets the pool ID given the pool name. For example, the pool
liverpoolis pool number4. -
The client prepends the pool ID to the PG ID. For example,
4.58. -
The client performs an object operation such as write, read, or delete by communicating directly with the Primary OSD in the Acting Set.
The topology and state of the Ceph storage cluster are relatively stable during a session.
Empowering a Ceph client via librados to compute object locations is much faster
than requiring the client to make a query to the storage cluster over a chatty session for each
read/write operation. The CRUSH algorithm allows a client to compute where objects should be
stored, and enables the client to contact the primary OSD in the acting set directly to store
or retrieve data in the objects. Since a cluster at the exabyte scale has thousands of OSDs, network
oversubscription between a client and a Ceph OSD is not a significant problem. If the cluster state
changes, the client can simply request an update to the cluster map from the Ceph monitor.