Known issues
This section documents known issues found in this release of IBM Storage Ceph.
cephadm utility
- Applying RDMA-enabled NFS spec to existing service causes error state
-
Applying an RDMA-enabled NFS specification to an existing NFS service or cluster causes the service to enter an error state. As a result, the NFS service or cluster becomes unavailable and cannot function correctly.
As a workaround, redeploy the NFS service.
(IBMCEPH-14150)
- Promtail image continues to appear in
cephadmlist-images output -
The Promtail container image might still appear in
cephadmlist-images output after upgrading to Alloy.Currently, during the transition from Promtail to Alloy,
cephadmcontinues to register the Promtail container image to maintain backward compatibility. As a result, the Promtail image remains visible even though Alloy is the default logging solution.As a workaround, no action is required. Ignore the Promtail image entry during the transition phase.
(IBMCEPH-13162)
- Grafana certificate does not migrate during upgrade
-
When you upgrade from IBM Storage Ceph 8.1.0 to 9.9.0, the existing user-signed Grafana certificate is not migrated. Instead, Grafana switches to a
cephadm-signed certificate. As a result, duplicate certificate entries may appear, and certificate-related health warnings can persist. Manual reconfiguration is required if you want to use custom TLS certificates.Note: Data services remain unaffected.To work without custom TLS certificates, you can continue using the
cephadm-signed certificate.As a workaround to use custom TLS certificates, complete the following steps:
- Change the Grafana specification to use
certificate_source: reference. - Use
certmgrto upload a valid user-signed certificate and key for each host. - Run the ceph orch reconfig grafana command.
(IBMCEPH-13080)
- Change the Grafana specification to use
- NFS QoS configuration updates do not take effect when reapplying spec
-
When the NFS specification is reapplied with updated
cluster_qos_configvalues, the new QoS settings appear as updated incephadmbut do not take effect on the NFS client side. This occurs because the updated configuration is not applied to the running NFS service. As a result, NFS clients continue to operate with the previous QoS settings despite the new values being shown as set.As a workaround, after reapplying the specification with the updated
cluster_qos_config, restart the NFS service or cluster to ensure that the new QoS values take effect.(IBMCEPH-11821)
- Monitor configuration updates are not applied on restart
-
When you update monitor configuration settings, such as
public_network, the changes are not applied when the monitor daemon is restarted. This occurs because monitor daemon configurations are not dynamically refreshed during a restart.As a result, the monitor continues to run with the previous configuration, and the updated values do not take effect.
As a workaround, redeploy the monitor daemon instead of restarting it after updating the configuration. This ensures that the updated configuration is applied successfully.
(IBMCEPH-12242)
- Crash daemon cannot access crash directory due to permission changes
-
When certain services, such as Grafana, are deployed, the permissions of the crash directory can change. As a result, the crash daemon cannot access the directory, preventing it from functioning correctly.
As a workaround, manually update the permissions of the crash directory to 167. You must repeat this action each time a daemon deployment changes the directory permissions to ensure proper access.
(IBMCEPH-12678)
Ceph build
- HAProxy deployment fails when QAT is enabled with ingress
-
Deploying HAProxy with QAT enabled fails when using the ingress feature.
Currently, HAProxy no longer supports ssl_engine in default builds, and newer OpenSSL versions have removed the legacy engine used by QAT. As a result, HAProxy cannot run with QAT enabled, and deployment fails.
As a workaround, disable QAT support in the HAProxy configuration by setting:haproxy_qat_support: false ssl: true(IBMCEPH-13100)
Ceph Dashboard
- NVMe‑oF gateway Subsystems and Namespaces tabs load slowly in large clusters
-
The Ceph Dashboard can experience slow performance when displaying NVMe-oF gateway subsystem and namespace information in large-scale environments.
Currently, there is no workaround.
(IBMCEPH-14406)
Ceph File System (CephFS)
root_squashkernel client may cause data inconsistency and triggers HEALTH_ERR-
A bug in the root_squash implementation can cause changes made by a kernel client restricted with
root_squashcapabilities to be lost. Although the issue is fixed for the FUSE client and the MDS, the kernel client remains affected. As a result, the cluster emits the following error when it detects a client with the brokenroot_squashimplementation:HEALTH_ERR: MDS_CLIENTS_BROKEN_ROOTSQUASH
This occurs, due to the risk of data inconsistency and lost updates.
To avoid this issue, it is recommended to discontinue using
root_squashwith kernel clients until a fix is available.To prevent affected clients from connecting, you can evict and permanently block them by setting the required client feature.
ceph fs required_client_features add client_mds_auth_capsThis helps protect the cluster from inconsistent behavior caused by affected clients.
(IBMCEPH-14902)
Ceph Object Gateway multi-site
- Secondary site displays old zonegroup name after rename
-
After renaming a zonegroup in a multisite configuration, the secondary site might still display the previous zonegroup name.
Currently, when a zonegroup is renamed on the primary site, the old name is not removed from the .rgw.root pool. As a result, both the old and new zonegroup names appear in the
radosgw-admin zonegroup listoutput, and sync operations might be impacted.As a workaround, perform the following steps:- Verify that the new zonegroup name exists.
radosgw-admin zonegroup list - List entries in the
.rgw.rootpool and locate the old zonegroup name.rados -p .rgw.root lsThe old name appears in the format:zonegroups_names.OLD_ZONEGROUP_NAME
- Remove the old zonegroup name from the pool:
rados -p .rgw.root rm zonegroups_names.OLD_ZONEGROUP_NAMERemoving the old zonegroup name restores normal sync operations.
(IBMCEPH-13140)
- Verify that the new zonegroup name exists.
RADOS
ceph versions -f xmlcommand produces non-well-formed XML output-
When you run the ceph versions -f xml command, the generated output is not well-formed XML and cannot be parsed by standard XML parsers. This occurs because the command uses full Ceph version strings (including special characters such as dots, parentheses, spaces, and hashes) as XML tag names, which violates XML syntax rules.
As a result, XML parsing fails with errors such as not well‑formed (invalid token), preventing automated processing or validation of the output.
Currently there is no workaround.
(IBMCEPH-13690)
SMB file services
- SMB service downtime during Ceph upgrades leads to client disconnects
-
Currently, rolling updates are not supported for SMB clusters. As a result, the Ceph‑SMB service is brought down and restarted during version updates, causing service downtime and client disconnects.
Currently, there is no workaround.
(IBMCEPH-11758)
MGRSMB module crashes during remote calls under RADOS instability-
When the Manager (
MGR) SMB module is called remotely during periods of RADOS instability, it can crash with an sqlite3.InternalError. As a result, a crash entry is logged in the cluster. View the error by using the following command:ceph crash info CRASH_IDAs a workaround, restart the SMB manager module by disabling and re-enabling it.
This clears the issue, and the crash is no longer reported.ceph mgr module disable smb ceph mgr module enable smb(IBMCEPH-16071)
Ceph NVMe-oF gateway
- Using --encryption_algorithm option when creating a namespace can lead to failures
-
When you create a namespace using the --encryption_algorithm option, the operation can lead to issues due to unsupported or incorrectly handled encryption settings. As a result, the namespace may not function as expected, potentially causing failures in deployment or access.
To avoid this issue, do not use the --encryption_algorithm option when creating a namespace. Allow the parameter to use its default value to ensure proper behavior.
(IBMCEPH-14934)
- Auto-listener attempts to bind to interface in DOWN state and causes gateway failure
-
If a network interface is in a DOWN state but still has an IP address within the configured network-mask subnet, the auto-listener attempts to create a listener on that address during redeployment. Because the interface is not active, the operation fails and causes the gateway to crash. As a result, the gateway cannot start successfully after redeployment.
As a workaround, remove the IP address from the interface that is in the DOWN state so that the gateway does not attempt to bind to it.
(IBMCEPH-14286)
ceph nvmeofCLI does not validate --server_address values-
Currently, when an incorrect value is provided for --server_address, the CLI does not report an error and instead uses the default gateway IP address. As a result, commands may be executed against an unintended gateway.
Currently, there is no workaround.
(IBMCEPH-14187)
- Removing one subsystem network mask drops listeners even when a broader mask still applies
-
When an NVMe-oF subsystem is configured with multiple network masks (for example, a narrower mask such as 10.0.64.0/22 and a broader mask such as 10.0.0.0/16), removing only the narrower mask can cause existing listeners to be removed unexpectedly. This occurs even when the remaining broader mask still covers the listener IP addresses.
As a result, the
ceph nvmeof listener listcommand may return no listeners, and valid listener entries are no longer shown or used, leading to potential connectivity issues.Currently, there is no workaround.
(IBMCEPH-14023)
- ceph nvmeof get_subsystems CLI output is not easily readable
-
The output of the ceph nvmeof get_subsystems command can be difficult to read.
Currently, the default command output is poorly formatted and hard to interpret. As a result, users might find it difficult to read subsystem details from the CLI output.
As a workaround, use the --format json option to obtain output in a readable format.
(IBMCEPH-12797)
- NVMe-oF gateways fail to start when scaling and encryption key is defined in spec
-
NVMe-oF gateways do not start when scaling NVMe-oF gateways with an encryption key defined in the specification file, using the
ceph orch applycommand with the --placement option. This occurs because the key file is not copying to the nodes where the new gateways are deployed. As a result, the new gateways are prevented from running.As a workaround, when theencryption_keyis defined in the specification file, do not use the --placement option. Use the following command to scale-up:ceph orch apply -i spec.yamlThis ensures that the
encryption_keyis properly propagated to all nodes where gateways are deployed.(IBMCEPH-15175)