Known issues

This section documents known issues found in this release of IBM Storage Ceph.

cephadm utility

Applying RDMA-enabled NFS spec to existing service causes error state

Applying an RDMA-enabled NFS specification to an existing NFS service or cluster causes the service to enter an error state. As a result, the NFS service or cluster becomes unavailable and cannot function correctly.

As a workaround, redeploy the NFS service.

(IBMCEPH-14150)

Promtail image continues to appear in cephadm list-images output

The Promtail container image might still appear in cephadm list-images output after upgrading to Alloy.

Currently, during the transition from Promtail to Alloy, cephadm continues to register the Promtail container image to maintain backward compatibility. As a result, the Promtail image remains visible even though Alloy is the default logging solution.

As a workaround, no action is required. Ignore the Promtail image entry during the transition phase.

(IBMCEPH-13162)

Grafana certificate does not migrate during upgrade
When you upgrade from IBM Storage Ceph 8.1.0 to 9.9.0, the existing user-signed Grafana certificate is not migrated. Instead, Grafana switches to a cephadm-signed certificate. As a result, duplicate certificate entries may appear, and certificate-related health warnings can persist. Manual reconfiguration is required if you want to use custom TLS certificates.
Note: Data services remain unaffected.

To work without custom TLS certificates, you can continue using the cephadm-signed certificate.

As a workaround to use custom TLS certificates, complete the following steps:

  1. Change the Grafana specification to use certificate_source: reference.
  2. Use certmgr to upload a valid user-signed certificate and key for each host.
  3. Run the ceph orch reconfig grafana command.

(IBMCEPH-13080)

NFS QoS configuration updates do not take effect when reapplying spec

When the NFS specification is reapplied with updated cluster_qos_config values, the new QoS settings appear as updated in cephadm but do not take effect on the NFS client side. This occurs because the updated configuration is not applied to the running NFS service. As a result, NFS clients continue to operate with the previous QoS settings despite the new values being shown as set.

As a workaround, after reapplying the specification with the updated cluster_qos_config, restart the NFS service or cluster to ensure that the new QoS values take effect.

(IBMCEPH-11821)

Monitor configuration updates are not applied on restart

When you update monitor configuration settings, such as public_network, the changes are not applied when the monitor daemon is restarted. This occurs because monitor daemon configurations are not dynamically refreshed during a restart.

As a result, the monitor continues to run with the previous configuration, and the updated values do not take effect.

As a workaround, redeploy the monitor daemon instead of restarting it after updating the configuration. This ensures that the updated configuration is applied successfully.

(IBMCEPH-12242)

Crash daemon cannot access crash directory due to permission changes

When certain services, such as Grafana, are deployed, the permissions of the crash directory can change. As a result, the crash daemon cannot access the directory, preventing it from functioning correctly.

As a workaround, manually update the permissions of the crash directory to 167. You must repeat this action each time a daemon deployment changes the directory permissions to ensure proper access.

(IBMCEPH-12678)

Ceph build

HAProxy deployment fails when QAT is enabled with ingress

Deploying HAProxy with QAT enabled fails when using the ingress feature.

Currently, HAProxy no longer supports ssl_engine in default builds, and newer OpenSSL versions have removed the legacy engine used by QAT. As a result, HAProxy cannot run with QAT enabled, and deployment fails.

As a workaround, disable QAT support in the HAProxy configuration by setting:
haproxy_qat_support: false
ssl: true

(IBMCEPH-13100)

Ceph Dashboard

NVMe‑oF gateway Subsystems and Namespaces tabs load slowly in large clusters

The Ceph Dashboard can experience slow performance when displaying NVMe-oF gateway subsystem and namespace information in large-scale environments.

Currently, there is no workaround.

(IBMCEPH-14406)

Ceph File System (CephFS)

root_squash kernel client may cause data inconsistency and triggers HEALTH_ERR
A bug in the root_squash implementation can cause changes made by a kernel client restricted with root_squash capabilities to be lost. Although the issue is fixed for the FUSE client and the MDS, the kernel client remains affected. As a result, the cluster emits the following error when it detects a client with the broken root_squash implementation:
HEALTH_ERR: MDS_CLIENTS_BROKEN_ROOTSQUASH

This occurs, due to the risk of data inconsistency and lost updates.

To avoid this issue, it is recommended to discontinue using root_squash with kernel clients until a fix is available.

To prevent affected clients from connecting, you can evict and permanently block them by setting the required client feature.

ceph fs required_client_features add client_mds_auth_caps

This helps protect the cluster from inconsistent behavior caused by affected clients.

(IBMCEPH-14902)

Ceph Object Gateway multi-site

Secondary site displays old zonegroup name after rename

After renaming a zonegroup in a multisite configuration, the secondary site might still display the previous zonegroup name.

Currently, when a zonegroup is renamed on the primary site, the old name is not removed from the .rgw.root pool. As a result, both the old and new zonegroup names appear in the radosgw-admin zonegroup list output, and sync operations might be impacted.

As a workaround, perform the following steps:
  1. Verify that the new zonegroup name exists.
    radosgw-admin zonegroup list
  2. List entries in the .rgw.root pool and locate the old zonegroup name.
    rados -p .rgw.root ls
    The old name appears in the format:
    zonegroups_names.OLD_ZONEGROUP_NAME
  3. Remove the old zonegroup name from the pool:
    rados -p .rgw.root rm zonegroups_names.OLD_ZONEGROUP_NAME

    Removing the old zonegroup name restores normal sync operations.

(IBMCEPH-13140)

RADOS

ceph versions -f xml command produces non-well-formed XML output

When you run the ceph versions -f xml command, the generated output is not well-formed XML and cannot be parsed by standard XML parsers. This occurs because the command uses full Ceph version strings (including special characters such as dots, parentheses, spaces, and hashes) as XML tag names, which violates XML syntax rules.

As a result, XML parsing fails with errors such as not well‑formed (invalid token), preventing automated processing or validation of the output.

Currently there is no workaround.

(IBMCEPH-13690)

SMB file services

SMB service downtime during Ceph upgrades leads to client disconnects

Currently, rolling updates are not supported for SMB clusters. As a result, the Ceph‑SMB service is brought down and restarted during version updates, causing service downtime and client disconnects.

Currently, there is no workaround.

(IBMCEPH-11758)

MGR SMB module crashes during remote calls under RADOS instability
When the Manager (MGR) SMB module is called remotely during periods of RADOS instability, it can crash with an sqlite3.InternalError. As a result, a crash entry is logged in the cluster. View the error by using the following command:
ceph crash info CRASH_ID
As a workaround, restart the SMB manager module by disabling and re-enabling it.
ceph mgr module disable smb
ceph mgr module enable smb
This clears the issue, and the crash is no longer reported.

(IBMCEPH-16071)

Ceph NVMe-oF gateway

Using --encryption_algorithm option when creating a namespace can lead to failures

When you create a namespace using the --encryption_algorithm option, the operation can lead to issues due to unsupported or incorrectly handled encryption settings. As a result, the namespace may not function as expected, potentially causing failures in deployment or access.

To avoid this issue, do not use the --encryption_algorithm option when creating a namespace. Allow the parameter to use its default value to ensure proper behavior.

(IBMCEPH-14934)

Auto-listener attempts to bind to interface in DOWN state and causes gateway failure

If a network interface is in a DOWN state but still has an IP address within the configured network-mask subnet, the auto-listener attempts to create a listener on that address during redeployment. Because the interface is not active, the operation fails and causes the gateway to crash. As a result, the gateway cannot start successfully after redeployment.

As a workaround, remove the IP address from the interface that is in the DOWN state so that the gateway does not attempt to bind to it.

(IBMCEPH-14286)

ceph nvmeof CLI does not validate --server_address values

Currently, when an incorrect value is provided for --server_address, the CLI does not report an error and instead uses the default gateway IP address. As a result, commands may be executed against an unintended gateway.

Currently, there is no workaround.

(IBMCEPH-14187)

Removing one subsystem network mask drops listeners even when a broader mask still applies

When an NVMe-oF subsystem is configured with multiple network masks (for example, a narrower mask such as 10.0.64.0/22 and a broader mask such as 10.0.0.0/16), removing only the narrower mask can cause existing listeners to be removed unexpectedly. This occurs even when the remaining broader mask still covers the listener IP addresses.

As a result, the ceph nvmeof listener list command may return no listeners, and valid listener entries are no longer shown or used, leading to potential connectivity issues.

Currently, there is no workaround.

(IBMCEPH-14023)

ceph nvmeof get_subsystems CLI output is not easily readable

The output of the ceph nvmeof get_subsystems command can be difficult to read.

Currently, the default command output is poorly formatted and hard to interpret. As a result, users might find it difficult to read subsystem details from the CLI output.

As a workaround, use the --format json option to obtain output in a readable format.

(IBMCEPH-12797)

NVMe-oF gateways fail to start when scaling and encryption key is defined in spec

NVMe-oF gateways do not start when scaling NVMe-oF gateways with an encryption key defined in the specification file, using the ceph orch apply command with the --placement option. This occurs because the key file is not copying to the nodes where the new gateways are deployed. As a result, the new gateways are prevented from running.

As a workaround, when the encryption_key is defined in the specification file, do not use the --placement option. Use the following command to scale-up:
ceph orch apply -i spec.yaml

This ensures that the encryption_key is properly propagated to all nodes where gateways are deployed.

(IBMCEPH-15175)