Using the Ceph Manager balancer module

The balancer is a module for Ceph Manager (ceph-mgr) that optimizes the placement of placement groups (PGs) across OSDs in order to achieve a balanced distribution, either automatically or in a supervised fashion.

Note: The balancer module cannot be disabled but can be turned off to customize the configuration.
The following are the supported balancer modes:
crush-compat
The CRUSH compat mode uses the compat weight-set feature to manage an alternative set of weights for devices in the CRUSH hierarchy.

The normal weights should remain set to the size of the device to reflect the target amount of data that you want to store on the device. The balancer then optimizes the weight-set values, adjusting them up or down in small increments in order to achieve a distribution that matches the target distribution as closely as possible. Because PG placement is a pseudorandom process, there is a natural amount of variation in the placement; by optimizing the weights, the balancer counter-acts that natural variation.

This mode is fully backwards compatible with older clients. When an OSDMap and CRUSH map are shared with older clients, the balancer presents the optimized weights as the real weights.

The primary restriction of this mode is that the balancer cannot handle multiple CRUSH hierarchies with different placement rules if the subtrees of the hierarchy share any OSDs. Because this configuration makes managing space utilization on the shared OSDs difficult, it is generally not recommended. As such, this restriction is normally not an issue.

upmap
Use upmap to optimize PG placement and achieve uniform distribution across OSDs. In most cases, this distribution is nearly uniform, with each OSD hosting an equal number of PGs (+/-1 PG). In most cases, this distribution is "perfect", with an equal number of PGs on each OSD +/-1 PG, as they might not divide evenly.
  1. Run the following command to check the PG distribution by device class:
    ceph osd df deviceclass
    If the VAR values for hdd or ssd are outside the range [0.95 – 1.05], the balancer may be disabled or misconfigured.
    If any OSD shows a REWEIGHT value other than 1.000, reset the OSD.
    ceph osd reweight osd.ID 1.000
  2. Improve balancer effectiveness by ensuring that pools have enough PGs. If PGS per OSD is less than 100, add PGs to the pools served by that device class.
    When PG autoscaler is enabled, apply the following settings:
    ceph config set global max_pg_per_osd 500
    ceph config set global mon_target_pg_per_osd 300
    ceph config set mgr/balancer/upmap_max_deviation 1
    Note: Using ceph config set mgr/balancer/upmap_max_deviation 1 is useful when OSD sizes vary significantly.
  3. Overlapping CRUSH roots can prevent proper balancing.
    Verify the pool rules by using the following command:
    ceph osd pool ls detail
  4. After verifying the pool rules, inspect the CRUSH rule.
    ceph osd crush rule dump | jq 'map(select(.rule_id == RULE_ID) | .steps)'

    Each pool should use a rule that specifies a device class, for example, default~ssd. If any pool uses a rule without a device class, contact IBM Support for guidance.

read
The OSDMap can store explicit mappings for individual primary OSDs as exceptions to the normal CRUSH placement calculation. These pg-upmap-primary entries provide fine-grained control over primary PG mappings. This mode optimizes the placement of individual primary PGs in order to achieve balanced reads, or primary PGs, in a cluster. In read mode, upmap behavior is not exercised, so this mode is best for uses cases in which only read balancing is desired.
Important: To allow use of this feature, you must tell the cluster that it only needs to support rRef or later clients with the following command:
ceph osd set-require-min-compat-client reef
This command fails if any pre-Reef clients or daemons are connected to the monitors. To work around this issue, use the --yes-i-really-mean-it flag:
ceph osd set-require-min-compat-client reef --yes-i-really-mean-it
You can check what client versions are in use with: the ceph features command. For example,
[ceph: root@host01 /]# ceph features
upmap-read
This balancer mode combines optimization benefits of both upmap and read mode. Like in read mode, upmap-read makes use of pg-upmap-primary.
Use upmap-read for achieving the upmap mode’s offering of balanced PG distribution as well as the read mode’s offering of balanced reads.
Important: To allow use of this feature, you must tell the cluster that it only needs to support Reef or later clients with the following command:
ceph osd set-require-min-compat-client reef
This command fails if any pre-Reef clients or daemons are connected to the monitors. To work around this issue, use the --yes-i-really-mean-it flag:
ceph osd set-require-min-compat-client reef --yes-i-really-mean-it
You can check what client versions are in use with: the ceph features command. For example,
[ceph: root@host01 /]# ceph features