IBM Support

IBM Elastic Storage Server Alert: ESS may experience data loss during sudden enclosure power cycle

News


Abstract

ESS with HDD drives may experience data loss during sudden enclosure power cycle

Content

Problem:
ESS Spectrum Scale RAID software requires that all HDD drives within its RAID array operate such that write operations acknowledged by the drive to the host are committed to the drive’s persistent storage. This setting is governed by the write cache enable (WCE) parameter in the SCSI caching mode page 0x08. The correct setting for all ESS HDD drives within all enclosures should be WCE=0, meaning that volatile write caching is disabled.
If an ESS with HDD drives is operated with WCE values other than 0, then the system is at risk of data loss and file system corruption in the event of an unexpected enclosure power loss.
While the occurrence of a power loss may be rare, it is recommended that all customers running an ESS with HDD drives verify their systems with the procedure listed below.
 
Detecting the problem:
Enter the following lines into an executable shell script and run on each ESS I/O node (if running ESS 6.1.2.0 or greater the script is found in /opt/ibm/ess/tools/samples/ess5000_wce_check.sh).
#!/usr/lpp/mmfs/bin/mmksh
typeset -A wce_values
current=0
default=2
saved=3
exit_code=0
thishost=$(hostname)
now=$(date +%Y%m%d%H%M%S)
tmpfile=/tmp/mytmp.${now}
touch $tmpfile
total_bad_wce=0
total_drives=$( tslsenclslot -ad | mmyfields -s slot Devices | awk -F, '{print $1}' | grep -v "^[ ]*$" | wc -l )
echo "WCECHK: Total: $total_drives drives to check at $thishost" >> $tmpfile
for disk in $( tslsenclslot -ad | mmyfields -s slot Devices | awk -F, '{print $1}' | grep -v "^[ ]*$")
do
    set -A wce_values
    value=0
    bad_wce=false
    for i in $current $default $saved
    do
        byte02=0x$(sg_modes -c $i --page=0x08 $disk --raw | hexdump -C | sed -n 2p | awk '{print $4}')
        (( value = (byte02 & 0x4) != 0 ))
        wce_values[$i]=$value
        if [[ $value -ne 0 ]]; then
            bad_wce=true
        fi
done
     if [[ $bad_wce == true ]]; then
     (( total_bad_wce++ ))
     exit_code=1
     echo "WCECHK: Bad WCE setting on $disk" >> $tmpfile
     printf "WCECHK: WCE current=%d default=%d saved=%d\n" ${wce_values[$current]} ${wce_values[$default]} ${wce_values[$saved]} >> $tmpfile
     fi
done
echo "WCECHK: Total: $total_bad_wce drives with Bad WCE setting" >> $tmpfile
cat $tmpfile | logger
cat $tmpfile
rm -f $tmpfile
exit $exit_code
If the settings are correct then you will see output similar to this:
       # ess5000_wce_check
         WCECHK: Total: 104 drives to check at c145f03n04.gpfs.net
         WCECHK: Total: 0 drives with Bad WCE setting
If the scripts finds disks with an incorrect setting then they will be listed in the output of the script.  

The script also saves the output to /var/log/messages with the header “WCECHK" to make a grep easy to search on.  

Resolving the problem:
If drives with an incorrect setting are found and you have ESS 6.1.2.1 or greater installed and you have the sdparm tool available then run this command for all affected drives:
         /usr/bin/sdparm--set WCE=0 --save /dev/<device>

If you are running a version of ESS without the sdparm tool then run this command for all affected drives:
        sg_wr_mode -p 8 -c 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 -m 0,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 --save /dev/<device>

Rerun the detecting script above to ensure the problem has been addressed.

Platforms impacted:
All ESS platforms with HDD drives at all code levels. 

[{"Type":"MASTER","Line of Business":{"code":"LOB26","label":"Storage"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STHMCM","label":"IBM Elastic Storage Server"},"ARM Category":[{"code":"a8m50000000KzfKAAS","label":"Disk Errors"}],"Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions"}]

Document Information

Modified date:
29 February 2024

UID

ibm16485915