News
Abstract
ESS with HDD drives may experience data loss during sudden enclosure power cycle
Content
Problem:
ESS Spectrum Scale RAID software requires that all HDD drives within its RAID array operate such that write operations acknowledged by the drive to the host are committed to the drive’s persistent storage. This setting is governed by the write cache enable (WCE) parameter in the SCSI caching mode page 0x08. The correct setting for all ESS HDD drives within all enclosures should be WCE=0, meaning that volatile write caching is disabled.
If an ESS with HDD drives is operated with WCE values other than 0, then the system is at risk of data loss and file system corruption in the event of an unexpected enclosure power loss.
While the occurrence of a power loss may be rare, it is recommended that all customers running an ESS with HDD drives verify their systems with the procedure listed below.
Detecting the problem:
Enter the following lines into an executable shell script and run on each ESS I/O node (if running ESS 6.1.2.0 or greater the script is found in /opt/ibm/ess/tools/samples/ess5000_wce_check.sh).
#!/usr/lpp/mmfs/bin/mmksh
typeset -A wce_values
current=0
default=2
saved=3
exit_code=0
thishost=$(hostname)
now=$(date +%Y%m%d%H%M%S)
tmpfile=/tmp/mytmp.${now}
touch $tmpfile
total_bad_wce=0
total_drives=$( tslsenclslot -ad | mmyfields -s slot Devices | awk -F, '{print $1}' | grep -v "^[ ]*$" | wc -l )
echo "WCECHK: Total: $total_drives drives to check at $thishost" >> $tmpfile
for disk in $( tslsenclslot -ad | mmyfields -s slot Devices | awk -F, '{print $1}' | grep -v "^[ ]*$")
do
set -A wce_values
value=0
bad_wce=false
for i in $current $default $saved
do
byte02=0x$(sg_modes -c $i --page=0x08 $disk --raw | hexdump -C | sed -n 2p | awk '{print $4}')
(( value = (byte02 & 0x4) != 0 ))
wce_values[$i]=$value
if [[ $value -ne 0 ]]; then
bad_wce=true
fi
done
if [[ $bad_wce == true ]]; then
(( total_bad_wce++ ))
exit_code=1
echo "WCECHK: Bad WCE setting on $disk" >> $tmpfile
printf "WCECHK: WCE current=%d default=%d saved=%d\n" ${wce_values[$current]} ${wce_values[$default]} ${wce_values[$saved]} >> $tmpfile
fi
done
echo "WCECHK: Total: $total_bad_wce drives with Bad WCE setting" >> $tmpfile
cat $tmpfile | logger
cat $tmpfile
rm -f $tmpfile
exit $exit_code
If the settings are correct then you will see output similar to this:
# ess5000_wce_check
WCECHK: Total: 104 drives to check at c145f03n04.gpfs.net
WCECHK: Total: 0 drives with Bad WCE setting
If the scripts finds disks with an incorrect setting then they will be listed in the output of the script.
The script also saves the output to /var/log/messages with the header “WCECHK" to make a grep easy to search on.
Resolving the problem:
If drives with an incorrect setting are found and you have ESS 6.1.2.1 or greater installed and you have the sdparm tool available then run this command for all affected drives:
/usr/bin/sdparm--set WCE=0 --save /dev/<device>
If you are running a version of ESS without the sdparm tool then run this command for all affected drives:
sg_wr_mode -p 8 -c 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 -m 0,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 --save /dev/<device>
Rerun the detecting script above to ensure the problem has been addressed.
Platforms impacted:
All ESS platforms with HDD drives at all code levels.
[{"Type":"MASTER","Line of Business":{"code":"LOB26","label":"Storage"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STHMCM","label":"IBM Elastic Storage Server"},"ARM Category":[{"code":"a8m50000000KzfKAAS","label":"Disk Errors"}],"Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions"}]
Was this topic helpful?
Document Information
Modified date:
29 February 2024
UID
ibm16485915