we've a cluster since... 2011 which is a 5 node quorum, 4 of them with most of the NSD. The fifth node is mainly a tie breaker with NSD only used for desc.
The cluster started on GPFS 3.4 and is now 3.5.11 (edit: 3.5.13 finally)
We also have 6 GPFS client.
In the past 2 weeks, we've moved a lot of data around our NSD because we replaced our previous LUNs with new one. Around 40TB of data moved with mmadddisk and mmdeldisk. Wednesday, we had a first occurence of mmfsd crashing on one of the quorum/nsd (which was also FS mgr of 1 of our 4 filesystems). Had +/- 1min20 of impact but everything went back to normal without manual intervention.
Thursday evening, while some mmrestripe where running, the same thing happened on 3 of the quorum (so we lost our quorum) and we had an impact of +/- 5min30. Again, everything went back to normal without manual intervention.
In our investigation, we've found this error in the mmfs log
Thu May 22 21:15:53.870 2014: GPFS: 6027-611 Recovery: mfg2, delay 10 sec. for safe recovery.
Thu May 22 21:16:31.386 2014: The pagepool size may be too small. Try increasing the pagepool size or adjusting pagepool usage with, for example, nsdBufSpace, nsdRAIDBufferPoolSizePct, or verbsSendBufferMemoryMB).
Thu May 22 21:16:31.396 2014: logAssertFailed: !"More than 22 minutes searching for a free buffer in the pagepool"
Thu May 22 21:16:31.408 2014: return code 0, reason code 0, log record tag 0
The assert subroutine failed: !"More than 22 minutes searching for a free buffer in the pagepool", file ../../../../../../../src/avs/fs/mmfs/ts/bufmgr/bufmgr.C, line 519
We had our GPFS servers with a 128MB pagepool, 3 of our GPFS clients with a 1GB pagepool and the other at 128MB also.
So we thought of increasing it to 1GB but more investigation let us thing that with nsdBufSpace at 30%, we would need a pagepool of 3GB on the GPFS server. This is 70% of 3GB pretty much useless as this remaining pagepool would be usuable for GPFS cache but there is not much client activity on them. So we are thinking about using a 1GB pagepool but with a 90% nsdBufSpace.
What are your thought?
Also we had the surprise today to have the same kind of crash on one of our GPFS client. We thought the issue would only by on the NSD server... so we are kinda confused.