IBM Support

IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption

Flashes (Alerts)


Abstract

IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 levels, in which resending an NSD RPC after a network reconnect function may result in file system corruption or undetected file data corruption.

Content

Problem Summary:

When a TCP connection breaks between two nodes, an attempt is made to re-establish the TCP connection by initiating a new connection and then restoring the state of the RPC flow, which includes resending RPCs that were pending at the time the connection was broken. A problem has been identified in the logic that handles this resending.

When the nsdMsgWriteExt RPC is submitted to an NSD server asynchronously, the RPC data will be copied to a temporary buffer for resending in case a network reconnect takes place. A problem in the logic that handles the original and the temporary buffer used for the reconnects may result in some of the important data at the start of the RPC (such as NSD disk number, sector index, and bytes to write) not being copied to the newly allocated heap buffer, so the first part of the user file data would be regarded as the NSD disk number, sector index, and bytes to write. Though the most likely outcome is a daemon crash with 'logAssertFailed: !"Request and queue size mismatch"', it is possible that the user data may be similar enough to the NSD header that the RPC will be interpreted as if it were a correct NSD RPC, and in the worst case (which should be rare) may result in file system corruption or undetected file data corruption.

Users Affected:

This issue affects customers running any level of IBM Spectrum Scale (GPFS) V4.1 or V4.2, when the following conditions are met:

1) Network instability results in a network reconnect, with the following messages in mmfs.log indicating the network reconnect took place:

[E] Close connection to xxx.xxx.xxx.xxx nodename <c0n0> (Connection reset by peer). Attempting reconnect.
[N] Connecting to xxx.xxx.xxx.xxx nodename <c0n0>
[N] Reconnected to xxx.xxx.xxx.xxx nodename <c0n0>
[...]
[E] Close connection to xxx.xxx.xxx.xxx nodename <c0n0> (Broken pipe). Attempting reconnect.
[I] Accepted and connected to xxx.xxx.xxx.xxx nodename <c0n0>

2) The nsdMsgWriteExt RPC with inline data is resent by the network reconnect function.

a) The most likely result would be a daemon assert in the NSD server, as shown in this example:

2017-09-02_05:20:05.407-0400: [I] Accepted and connected to xxx.xxx.xxx.xxx nodename  <c0n2>
2017-09-02_05:20:06.450-0400: [E] NSD: Request and queue size mismatch: rsize 2010566525, qsize 4194304, queue NSD type NsdQueueTraditional [4], da -668478587:-4880623824366314304, flags 0x7e00fbce, nsdId 21CD1ACF:6F2A3718
2017-09-02_05:20:06.450-0400: [X] logAssertFailed: !"Request and queue size mismatch"
2017-09-02_05:20:06.450-0400: [X] return code 0, reason code 0, log record tag 0
2017-09-02_05:20:06.482-0400: [E] Signal 11 at location 0x7F98CD3A2347 in process 15659, link reg 0xFFFFFFFFFFFFFFFF.
[...]
2017-09-02_05:20:09.249-0400: [X] *** Assert exp(!"Request and queue size mismatch") in line 6612 of file nsdServer.C
2017-09-02_05:20:09.249-0400: [E] *** Traceback:
2017-09-02_05:20:09.249-0400: [E]         2:0x7F98CC699F23 logAssertFailed + 0x3C3 at Logger.C:572
2017-09-02_05:20:09.249-0400: [E]         3:0x7F98CD3C72E4 nsdWorkerThread(void*) + 0x8A4 at nsdServer.C:6612
2017-09-02_05:20:09.249-0400: [E]         4:0x7F98CC0EAAA5 Thread::callBody(Thread*) + 0x45 at thread.C:384
2017-09-02_05:20:09.249-0400: [E]         5:0x7F98CC0D5D00 Thread::callBodyWrapper(Thread*) + 0xC0 at mastdep.C:1072
2017-09-02_05:20:09.249-0400: [E]         6:0x7F98CB5B2DC5 start_thread + 0xC5 at mastdep.C:1072
2017-09-02_05:20:09.249-0400: [E]         7:0x7F98CA8BB1CD __clone + 0x6D at mastdep.C:1072
mmfsd: nsdServer.C:6612: void logAssertFailed(UInt32, const char*, UInt32, Int32, Int32, UInt32, const char*, const char*): Assertion `!"Request and queue size mismatch"' failed.

b) In rare cases, the result may be file system corruption or undetected file data corruption, if the initial portion of the user file data is interpreted as a correct NSD RPC header.

Recommendations:

1. Users running IBM Spectrum Scale V4.2.0.0 through V4.2.3.4 should apply IBM Spectrum Scale V4.2.3.5, available from Fix Central at:
https://www.ibm.com/support/fixcentral/swg/selectFixes?parent=Software%20defined%20storage&product=ibm/StorageSoftware/IBM+Spectrum+Scale&release=4.2.3&platform=All&function=all, or contact IBM Service to obtain and apply the efix for their level of code, reference APAR IJ00398.

2. Users running IBM GPFS V4.1.0.0 through V4.1.0.8, or IBM Spectrum Scale V4.1.1.0 through V4.1.1.17 should apply IBM Spectrum Scale V4.1.1.18, available from Fix Central at: https://www.ibm.com/support/fixcentral/swg/selectFixes?parent=Software%20defined%20storage&product=ibm/StorageSoftware/IBM+Spectrum+Scale&release=4.1.1&platform=All&function=all, or contact IBM Service to obtain and apply the efix for their level of code, reference APAR IJ00451.

3. If you believe that your Spectrum Scale file system may be affected by this issue, you should contact IBM Service as soon as possible for further guidance and assistance.

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Component":"--","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF016","label":"Linux"},{"code":"PF033","label":"Windows"}],"Version":"4.1.1;4.2.0;4.2.1;4.2.2;4.2.3","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
26 September 2022

UID

ssg1S1010668