Erasure-coding
Ceph can load one of many erasure code algorithms. The earliest and most commonly used is
the Reed-Solomon algorithm. An erasure code is a forward error correction (FEC)
code. FEC code transforms a message of K chunks into a longer message that is
called a code word of N chunks, such that Ceph can recover the original message
from a subset of the N chunks.
More specifically, N = K+M where the variable K is the original
number of data chunks. The variable M stands for the extra or redundant chunks that
the erasure code algorithm adds to provide protection from failures. The variable N
is the total number of chunks that are created after the erasure coding process. The value of
M is N-K which means that the algorithm computes
N-K redundant chunks from K original data chunks. This approach
ensures that that Ceph can access all the original data. The system is resilient to arbitrary
N-K failures. For instance, in a 10 K of 16 N
configuration, or erasure coding 10/16, the erasure code algorithm adds six extra
chunks to the 10 base chunks K. For example, in a M = K-N or
16-10 = 6 configuration, Ceph will spread the 16 chunks N across
16 OSDs. The original file might be reconstructed from the 10 verified N chunks
even if 6 OSDs fail—​ensuring that the IBM Storage
Ceph cluster
will not lose data, and thereby ensures a high level of fault tolerance.
Like replicated pools, in an erasure-coded pool, the primary OSD in the up set receives all write
operations. In replicated pools, Ceph makes a deep copy of each object in the placement group on the
secondary OSDs in the set. For erasure coding, the process is a bit different. An erasure coded pool
stores each object as K+M chunks. It is divided into K data chunks
and M coding chunks. The pool is configured to have a size of K+M
so that Ceph stores each chunk in an OSD in the acting set. Ceph stores the rank of the chunk as an
attribute of the object. The primary OSD is responsible for encoding the payload into
K+M chunks and sends them to the other OSDs. The primary OSD is also responsible
for maintaining an authoritative version of the placement group logs.
For example, in a typical configuration, a system administrator creates an erasure-coded pool to
use six OSDs and sustain the loss of two of them. That is, (K+M = 6) such that
(M = 2).
When Ceph writes the object NYAN containing ABCDEFGHIJKL to the pool, the
erasure encoding algorithm splits the content into four data chunks by simply dividing the content
into four parts: ABC, DEF, GHI, and
JKL. The algorithm will pad the content if the content length is not a multiple of
K. The function also creates two coding chunks: the fifth with YXY
and the sixth with QGC. Ceph stores each chunk on an OSD in the acting set with a
shard_id corresponding to the acting set position, where it stores the chunks in objects that have
the same name, NYAN, but reside on different OSDs. For example, Chunk 1 contains
ABC and Ceph stores it on OSD5 while chunk 5 contains YXY
and Ceph stores it on OSD4.
In a recovery scenario, the client attempts to read the object NYAN from the erasure-coded
pool by reading chunks 1 through 6. The OSD informs the algorithm that chunks 2 and 6 are missing.
For example, the primary OSD might not read chunk 6 because the OSD6 is out, and might not
read chunk 2, because OSD2 was the slowest and its chunk was not taken into account. However,
when the algorithm has four chunks, it reads the four chunks: chunk 1 containing
ABC, chunk 3 containing GHI, chunk 4 containing
JKL, and chunk 5 containing YXY. Then, it rebuilds the original
content of the object ABCDEFGHIJKL, and original content of chunk 6, which
contained QGC.
lrc) plug-in in the erasure code profile creates extra
chunks and requires fewer OSDs to recover from. For example, in an lrc profile
configuration K=4 M=2 L=3, the algorithm creates six chunks (K+M),
just as the jerasure plug-in would, but the locality value (L=3)
requires that the algorithm create 2 more chunks locally. The algorithm creates the additional
chunks as such, (K+M)/L. If the OSD containing chunk 0 fails, this chunk can be
recovered by using chunks 1, 2 and the first local chunk. In this case, the algorithm only requires
3 chunks for recovery instead of 5.Reference
-
For more information about CRUSH, the erasure-coding profiles, and plug-ins, see Storage strategies.
-
For more details on Object Map, see Ceph client object map.