Balancing a Ceph cluster using read balancer

Balance primary placement groups (PGs) in a cluster. The balancer can optimize the allocation of placement groups across OSDs to achieve a balanced distribution. The balancer can operate either automatically (online), offline, or in a supervised fashion (offline).

Online optimization

Balance primary PGs in a cluster by using the balancer module. Using the balancer module, all offline optimization steps are completed automatically.

Before you begin

Before you begin, make sure that you have a running IBM Storage Ceph cluster.

Procedure

Enable the balancer module.

ceph mgr module enable balancer

For example,

[ceph: root@host01 /]# ceph mgr module enable balancer

Turn on the balancer module.

ceph balancer on

For example,

[ceph: root@host01 /]# ceph balancer on

Update the min_compat_client setting to ensure compatibility. Read balancing requires that no clients older than IBM Storage Ceph 7 connect to the cluster. Older clients are not supported and will fail to connect if read balancing is enabled.
To use online optimization, the support for Reef or later clients must be indicated on the cluster.
```
ceph osd set-require-min-compat-client reef
```
Note: This command fails if any pre-Reef clients or daemons are connected to the monitors. To work around this issue, use the --yes-i-really-mean-it flag:
```
ceph osd set-require-min-compat-client reef --yes-i-really-mean-it
```
You can check what client versions are in use with: the ceph features command. For example,
```
[ceph: root@host01 /]# ceph features
```
Change the mode to upmap-read or read for read balancing.
The default mode is upmap, enable these modes by running one of the following commands:
- ```
ceph balancer mode upmap-read
```
- ```
ceph balancer mode read
```

Check the current status of the balancer.

ceph balancer status

For example,

[ceph: root@host01 /]# ceph balancer status
{
"active": true,
"last_optimize_duration": "0:00:00.013640",
"last_optimize_started": "Mon Nov 22 14:47:57 2024",
"mode": "upmap-read",
"no_optimization_needed": true,
"optimize_result": "Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect",
"plans": []
}

Offline optimization (Technology Preview)

Before you begin

Before you begin, make sure that you have the following prerequisites in place:

A running IBM Storage Ceph cluster.
Before running the offline read balancer, run the capacity balancer to balance PG placement across OSDs. This will ensure optimal results. Execute the following steps:
1. Get the latest copy of your osdmap.
```
[ceph: root@host01 /]# ceph osd  getmap -o map
```
2. Run the upmap balancer.
```
[ceph: root@host01 /]# ospmaptool map –upmap balance.sh
```
3. The file balance.sh contains the proposed solution.
  
  The commands in this procedure are normal Ceph CLI commands that are run to apply the changes to the cluster.
  
  Run the following command if there are any recommendations in the balance.sh file.
```
[ceph: root@host01 /]# source balance.sh
```
For information, see Balancing IBM Storage Ceph cluster using the capacity balancer

About this task

Important: Technology Preview features are not supported with IBM production service level agreements (SLAs), might not be functionally complete, and IBM does not recommend using them for production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

If you have unbalanced primary OSDs, you can update them with an offline optimizer that is built into the osdmaptool. Run the capacity balancer before running the read balancer to ensure optimal results.

Procedure

Check the read_balance_score, available for each pool.

ceph osd pool ls detail

For example,

[ceph: root@host01 /]# ceph osd pool ls detail

Note:

You can use json formatting to get more details.

If the read_balance_score is considerably above 1, your pool has unbalanced primary OSDs.

For a homogenous cluster the optimal score is [Ceil{(number of PGs/Number of OSDs)}/(number of PGs/Number of OSDs)]/[ (number of PGs/Number of OSDs)/(number of PGs/Number of OSDs)].

For example, if you have a pool with 32 PG and 10 OSDs then (number of PGs/Number of OSDs) = 32/10 = 3.2. So, the optimal score if all the devices are identical is the ceiling value of 3.2 divided by (number of PGs/Number of OSDs) that is 4/3.2 = 1.25. If you have another pool in the same system with 64 PGs the optimal score is 7/6.4 =1.09375 ).

The following is an example output:

$ ceph osd pool ls detail
pool 1 '.mgr' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 17 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 3.00
pool 2 'cephfs.a.meta' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 55 lfor 0/0/25 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs read_balance_score 1.50
pool 3 'cephfs.a.data' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 27 lfor 0/0/25 flags hashpspool,bulk stripe_width 0 application cephfs read_balance_score 1.31

Get the latest copy of your osdmap.

ceph osd getmap -o om

For example,

[ceph: root@host01 /]# ceph osd getmap -o om
got osdmap epoch 56

Run the optimizer.

The balance.sh file contains the proposed solution.

osdmaptool om --read balance.sh --read-pool POOL_NAME [--vstart]

For example,

[ceph: root@host01 /]# sdmaptool om --read balance.sh --read-pool cephfs.a.meta
./bin/osdmaptool: osdmap file 'om'
writing upmap command output to: balance.sh
---------- BEFORE ------------ 
 osd.0 | primary affinity: 1 | number of prims: 4
 osd.1 | primary affinity: 1 | number of prims: 8
 osd.2 | primary affinity: 1 | number of prims: 4
 
read_balance_score of 'cephfs.a.meta': 1.5

---------- AFTER ------------ 
 osd.0 | primary affinity: 1 | number of prims: 5
 osd.1 | primary affinity: 1 | number of prims: 6
 osd.2 | primary affinity: 1 | number of prims: 5
 
read_balance_score of 'cephfs.a.meta': 1.13


num changes: 2

Apply the changes to the cluster. The balance.sh file contains the proposed solution.
The commands in this step are normal Ceph CLI commands that are run to apply the changes to the cluster.
```
source balance.sh
```
For example,
```
[ceph: root@host01 /]# source balance.sh
```
The following shows an example of offline balancer output and execution.
```
$ cat balance.sh
ceph osd pg-upmap-primary 2.3 0
ceph osd pg-upmap-primary 2.4 2

$ source balance.sh
change primary for pg 2.3 to osd.0
change primary for pg 2.4 to osd.2
```
Note: If you are running the command ceph osd pg-upmap-primary for the first time, you might get a warning as: Error EPERM: min_compat_client luminous < reef, which is required for pg-upmap-primary.. In this case, run the ceph osd set-require-min-compat-client reef command to adjust your cluster's min-compat-client.

What to do next

Consider rechecking the scores and re-running the balancer if the number of PGs change or if any OSDs are added or removed from the cluster as these operations can considerably impact the read balancer effect on a pool.

Supervised optimization

The read balancer can also be used with supervised optimization. If using supervised optimization, use the information detailed in Supervised optimization . Set the mode to either upmap-read or read.