Topic
2 replies Latest Post - ‏2014-05-26T11:41:10Z by YannickBergeron
YannickBergeron
YannickBergeron
35 Posts
ACCEPTED ANSWER

Pinned topic GPFS 3.5, pagepool and nsdBufSpace

‏2014-05-23T19:33:23Z |

Hi,

we've a cluster since... 2011 which is a 5 node quorum, 4 of them with most of the NSD. The fifth node is mainly a tie breaker with NSD only used for desc.

The cluster started on GPFS 3.4 and is now 3.5.11 (edit: 3.5.13 finally)

We also have 6 GPFS client.

In the past 2 weeks, we've moved a lot of data around our NSD because we replaced our previous LUNs with new one. Around 40TB of data moved with mmadddisk and mmdeldisk. Wednesday, we had a first occurence of mmfsd crashing on one of the quorum/nsd (which was also FS mgr of 1 of our 4 filesystems). Had +/- 1min20 of impact but everything went back to normal without manual intervention.

Thursday evening, while some mmrestripe where running, the same thing happened on 3 of the quorum (so we lost our quorum) and we had an impact of +/- 5min30. Again, everything went back to normal without manual intervention.

 

In our investigation, we've found this error in the mmfs log

Thu May 22 21:15:53.870 2014: GPFS: 6027-611 Recovery: mfg2, delay 10 sec. for safe recovery.
Thu May 22 21:16:31.386 2014: The pagepool size may be too small.  Try increasing the pagepool size or adjusting pagepool usage with, for example, nsdBufSpace, nsdRAIDBufferPoolSizePct, or verbsSendBufferMemoryMB).
Thu May 22 21:16:31.396 2014: logAssertFailed: !"More than 22 minutes searching for a free buffer in the pagepool"
Thu May 22 21:16:31.408 2014: return code 0, reason code 0, log record tag 0
The assert subroutine failed: !"More than 22 minutes searching for a free buffer in the pagepool", file ../../../../../../../src/avs/fs/mmfs/ts/bufmgr/bufmgr.C, line 519

...

 

We had our GPFS servers with a 128MB pagepool, 3 of our GPFS clients with a 1GB pagepool and the other at 128MB also.

So we thought of increasing it to 1GB but more investigation let us thing that with nsdBufSpace at 30%, we would need a pagepool of 3GB on the GPFS server. This is 70% of 3GB pretty much useless as this remaining pagepool would be usuable for GPFS cache but there is not much client activity on them. So we are thinking about using a 1GB pagepool but with a 90% nsdBufSpace.

What are your thought?

 

Also we had the surprise today to have the same kind of crash on one of our GPFS client. We thought the issue would only by on the NSD server... so we are kinda confused.

 

Best regards,

 

Yannick Bergeron

Updated on 2014-05-26T11:41:30Z at 2014-05-26T11:41:30Z by YannickBergeron
  • yuri
    yuri
    192 Posts
    ACCEPTED ANSWER

    Re: GPFS 3.5, pagepool and nsdBufSpace

    ‏2014-05-23T20:42:16Z  in response to YannickBergeron

    This isn't necessarily something that can be entirely solved with tuning.  It's true than 128M pagepool size isn't a big size, by modern standards.  Increasing it to 1G would be a good idea regardless.  I don't think you need to go as high as 3G in your case.  However, I would recommend opening a PMR and uploading a gpfs.snap package, so that the logs and the internal dumps can be analyzed.  There may be a bug in there.  3.4.0.11 is a bit old, at this point, so this may be a known issue.

    yuri

  • YannickBergeron
    YannickBergeron
    35 Posts
    ACCEPTED ANSWER

    Re: GPFS 3.5, pagepool and nsdBufSpace

    ‏2014-05-26T11:41:10Z  in response to YannickBergeron

    you probably mean 3.5.0.11 and not 3.4.0.11

    I've double checked and we're at 3.5.0.13, not 3.5.0.11.