IBM Elastic Storage Server Alert: ESS may experience data loss during sudden enclosure power cycle

News

Abstract

ESS with HDD drives may experience data loss during sudden enclosure power cycle

Content

Problem:

ESS Spectrum Scale RAID software requires that all HDD drives within its RAID array operate such that write operations acknowledged by the drive to the host are committed to the drive’s persistent storage. This setting is governed by the write cache enable (WCE) parameter in the SCSI caching mode page 0x08. The correct setting for all ESS HDD drives within all enclosures should be WCE=0, meaning that volatile write caching is disabled.

If an ESS with HDD drives is operated with WCE values other than 0, then the system is at risk of data loss and file system corruption in the event of an unexpected enclosure power loss.

While the occurrence of a power loss may be rare, it is recommended that all customers running an ESS with HDD drives verify their systems with the procedure listed below.

Detecting the problem:

Enter the following lines into an executable shell script and run on each ESS I/O node (if running ESS 6.1.2.0 or greater the script is found in /opt/ibm/ess/tools/samples/ess5000_wce_check.sh).

#!/usr/lpp/mmfs/bin/mmksh

typeset -A wce_values

current=0

default=2

saved=3

exit_code=0

thishost=$(hostname)

now=$(date +%Y%m%d%H%M%S)

tmpfile=/tmp/mytmp.${now}

touch $tmpfile

total_bad_wce=0

total_drives=$( tslsenclslot -ad | mmyfields -s slot Devices | awk -F, '{print $1}' | grep -v "^[ ]*$" | wc -l )

echo "WCECHK: Total: $total_drives drives to check at $thishost" >> $tmpfile

for disk in $( tslsenclslot -ad | mmyfields -s slot Devices | awk -F, '{print $1}' | grep -v "^[ ]*$")

set -A wce_values

value=0

bad_wce=false

for i in $current $default $saved

byte02=0x$(sg_modes -c $i --page=0x08 $disk --raw | hexdump -C | sed -n 2p | awk '{print $4}')

(( value = (byte02 & 0x4) != 0 ))

wce_values[$i]=$value

if [[ $value -ne 0 ]]; then

bad_wce=true

done

if [[ $bad_wce == true ]]; then

(( total_bad_wce++ ))

exit_code=1

echo "WCECHK: Bad WCE setting on $disk" >> $tmpfile

printf "WCECHK: WCE current=%d default=%d saved=%d\n" ${wce_values[$current]} ${wce_values[$default]} ${wce_values[$saved]} >> $tmpfile

done

echo "WCECHK: Total: $total_bad_wce drives with Bad WCE setting" >> $tmpfile

cat $tmpfile | logger

cat $tmpfile

rm -f $tmpfile

exit $exit_code

If the settings are correct then you will see output similar to this:

# ess5000_wce_check

WCECHK: Total: 104 drives to check at c145f03n04.gpfs.net

WCECHK: Total: 0 drives with Bad WCE setting

If the scripts finds disks with an incorrect setting then they will be listed in the output of the script.

The script also saves the output to /var/log/messages with the header “WCECHK" to make a grep easy to search on.

Resolving the problem:

If drives with an incorrect setting are found and you have ESS 6.1.2.1 or greater installed and you have the sdparm tool available then run this command for all affected drives:

/usr/bin/sdparm--set WCE=0 --save /dev/<device>

If you are running a version of ESS without the sdparm tool then run this command for all affected drives:
sg_wr_mode -p 8 -c 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 -m 0,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 --save /dev/<device>

Rerun the detecting script above to ensure the problem has been addressed.

Platforms impacted:

All ESS platforms with HDD drives at all code levels.

[{"Type":"MASTER","Line of Business":{"code":"LOB26","label":"Storage"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STHMCM","label":"IBM Elastic Storage Server"},"ARM Category":[{"code":"a8m50000000KzfKAAS","label":"Disk Errors"}],"Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions"}]

Tips

IBM Elastic Storage Server Alert: ESS may experience data loss during sudden enclosure power cycle

News

Abstract

Content

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?