Monitoring the endurance of SSD Devices

You can monitor the endurance of the SSD drives in your system by using the mmhealth command.

An SSD or physical disk has a finite lifetime based on the number of drive writes per day. The SSD endurance is a number between 0 and 255. The ssd-endurance-percentage value indicates the percentage of life that is used by the drive. The value 0 indicates that full life remains, and 100 indicates that the drive is at or past its end of life. When the endurance number exceeds this threshold, the mmhealth command displays a ssd_endurance_warn warning with the specific physical disk name and the recovery group name information. The drive must be replaced when the value exceeds 100, and the state of its health is reported as DEGRADED by the mmhealth command.

Issue the following command to display the health status of the NATIVE_RAID component:

[root@client21 ~]# mmhealth node show NATIVE_RAID
If the endurance number exceeds 100, the system gives an output similar to the following:


Node name:      client21.sonasad.almaden.ibm.com

Component             Status        Status Change     Reasons
----------------------------------------------------------------------------------------------------------------
NATIVE_RAID           DEGRADED      Now               ssd_endurance_warn(rg1/n001p013)
  ARRAY               HEALTHY       Now               -
  NVME                HEALTHY       1 hour ago        -
  PHYSICALDISK        DEGRADED      Now               ssd_endurance_warn(rg1/n001p013)
  RECOVERYGROUP       HEALTHY       Now               -
  VIRTUALDISK         HEALTHY       Now               -
You can replace the SSD physical disk to resolve this warning message. After the SSD is replaced, issue the mmhealth command as shown to check the health status of the SSD:

[root@client21 ~]# mmhealth node show NATIVE_RAID
After the issue is resolved the system gives an output similar to the following:

Node name:      client21.sonasad.almaden.ibm.com

Component             Status        Status Change     Reasons
--------------------------------------------------------------------
NATIVE_RAID           HEALTHY       Now               -
  ARRAY               HEALTHY       Now               -
  NVME                HEALTHY       1 hour ago        -
  PHYSICALDISK        HEALTHY       Now               -
  RECOVERYGROUP       HEALTHY       Now               -
  VIRTUALDISK         HEALTHY       Now               -