Topic
  • 7 replies
  • Latest Post - ‏2008-06-27T15:03:57Z by dlmcnabb
SystemAdmin
SystemAdmin
2092 Posts

Pinned topic Measuring GPFS Pagepool Consumption

‏2008-06-26T02:20:13Z |
While vaildating mission critical application benchmark improvements is perhaps the best way to determine if the memory invested into GPFS pagepool is a good investment, not all applications runing on generic compute nodes within a GPFS cluster will be properly profiled and most will likely exploit pagepool differently. Does GPFS provide a way to extract actual pagepool consumption versus allocation (pinned memory) metrics?
Updated on 2008-06-27T15:03:57Z at 2008-06-27T15:03:57Z by dlmcnabb
  • dlmcnabb
    dlmcnabb
    1012 Posts

    Re: Measuring GPFS Pagepool Consumption

    ‏2008-06-26T07:43:57Z  
    Since the pagepool is a cache, it only throws things out when something else needs space. Therefore, after filling up, it is always 100% "consumed".

    If you want gory details, "mmfsadm dump pgalloc" will show more than anyone really understands, and there is no description of this dumped data. (WARNING: Be careful when using the mmfsadm dump command on an active system, since it may follow a pointer to deallocated space which may cause a SIGSEGV. So only use it when the system is mostly idle.)
  • gcorneau
    gcorneau
    162 Posts

    Re: Measuring GPFS Pagepool Consumption

    ‏2008-06-26T13:12:08Z  
    One other comment. You stated "...actual pagepool consumption versus allocation (pinned memory) metrics" and I thought I'd point out that the GPFS page pool pinned memory is allocated when the GPFS deaemon starts. I.e. GPFS doesn't allocate just some of the pagepool at the beginning and then allocate more as needs increase. You can see the overall memory utilization for GPFS on AIX via "svmon -P <pid_of_mmfsd>"

    <hr />
    Glen Corneau
    IBM Power Systems Advanced Technical Support
  • SystemAdmin
    SystemAdmin
    2092 Posts

    Re: Measuring GPFS Pagepool Consumption

    ‏2008-06-27T00:05:49Z  
    Don't worry Glen, I do not currently have any plans to exploit Loadleveler's llsubmit filter to custom tailor GPFS's pagepool on compute nodes for particular batch jobs that could benefit from a larger than normal pagepool versus those that could not, but...

    according to the GPFS documentation pagepool is advertised to be dynamically changeable these days via mmchconfig -i/-I. This would leave me to believe that you can change the value per node without shutting down and restarting the GPFS daemon. Is this true, false, or misleading? Are there caveats for shrinking versus growing?

    Over time, the pagepool on a given node is not going to be 100% "hot" for all applications despite being 100% allocated, hence the buffer label descriptors "cold, done, free, hot, and inactive". I agree that mmfsadm probing is far from the preferred solution, but as long as carbon-based units are required to turn the GPFS pagepool/MFTC/MSC cache tuning knobs (as opposed to GPFS doing real-time cache self-optimization) the required information needs to be accessible in order to make tuning decisions as well as to monitor the results after those decisions are implemented.

    If mmpmon is the preferred monitoring mechanism, are there plans to incorporate "pagepool, MFTC, and MSC" cache metrics?
  • dlmcnabb
    dlmcnabb
    1012 Posts

    Re: Measuring GPFS Pagepool Consumption

    ‏2008-06-27T01:36:33Z  
    Don't worry Glen, I do not currently have any plans to exploit Loadleveler's llsubmit filter to custom tailor GPFS's pagepool on compute nodes for particular batch jobs that could benefit from a larger than normal pagepool versus those that could not, but...

    according to the GPFS documentation pagepool is advertised to be dynamically changeable these days via mmchconfig -i/-I. This would leave me to believe that you can change the value per node without shutting down and restarting the GPFS daemon. Is this true, false, or misleading? Are there caveats for shrinking versus growing?

    Over time, the pagepool on a given node is not going to be 100% "hot" for all applications despite being 100% allocated, hence the buffer label descriptors "cold, done, free, hot, and inactive". I agree that mmfsadm probing is far from the preferred solution, but as long as carbon-based units are required to turn the GPFS pagepool/MFTC/MSC cache tuning knobs (as opposed to GPFS doing real-time cache self-optimization) the required information needs to be accessible in order to make tuning decisions as well as to monitor the results after those decisions are implemented.

    If mmpmon is the preferred monitoring mechanism, are there plans to incorporate "pagepool, MFTC, and MSC" cache metrics?
    Changing the pagepool is perfectly fine, but you have to be slightly careful in making requests when increasing the size bigger than any previous size.

    There is a fixed size table of the managed pagepool segments. When new territory is needed, one or more of the slots in this table will be used to describe the new territory. These segment memory areas are never returned to the system, but when you reduce the pagepool size GPFS unpins portions of the pagepool memory and doesn't use those addresses. If you then make the pagepool bigger, GPFS will repin areas it has not used, and once the existing segments are all pinned it will add new segments.

    So increasing pagepool 10M at a time will use up the slots fast, whereas increasing pagepool by 1G chunk will make a big segment, which you can relinquish by dropping pagepool size back 1G-10M and then freely bump up and down without taking more slots.

    Look at the table at the end of "mmfsadm dump pgalloc" to see the table.

    AIX 64bit kernel has 1024 entries and each segment can be 256M for a max pagepool of 256G.
    AIX 32bit kernel can only have 8 slots for 2G max.

    Linux 64bit machines can have 256 entries and each segment can be 1G for a 256G max.
    Linux 32bit machines can have 8 slots for 8G max.

    Once all the slots are filled (or pinnable memory all used up), increasing pagepool will just passively return the same answer it had before.
  • SystemAdmin
    SystemAdmin
    2092 Posts

    Re: Measuring GPFS Pagepool Consumption

    ‏2008-06-27T03:23:05Z  
    • dlmcnabb
    • ‏2008-06-27T01:36:33Z
    Changing the pagepool is perfectly fine, but you have to be slightly careful in making requests when increasing the size bigger than any previous size.

    There is a fixed size table of the managed pagepool segments. When new territory is needed, one or more of the slots in this table will be used to describe the new territory. These segment memory areas are never returned to the system, but when you reduce the pagepool size GPFS unpins portions of the pagepool memory and doesn't use those addresses. If you then make the pagepool bigger, GPFS will repin areas it has not used, and once the existing segments are all pinned it will add new segments.

    So increasing pagepool 10M at a time will use up the slots fast, whereas increasing pagepool by 1G chunk will make a big segment, which you can relinquish by dropping pagepool size back 1G-10M and then freely bump up and down without taking more slots.

    Look at the table at the end of "mmfsadm dump pgalloc" to see the table.

    AIX 64bit kernel has 1024 entries and each segment can be 256M for a max pagepool of 256G.
    AIX 32bit kernel can only have 8 slots for 2G max.

    Linux 64bit machines can have 256 entries and each segment can be 1G for a 256G max.
    Linux 32bit machines can have 8 slots for 8G max.

    Once all the slots are filled (or pinnable memory all used up), increasing pagepool will just passively return the same answer it had before.
    OK, let me add another question that has been asked often :

    • GPFS docs suggest that pagepool < 50% physical RAM on Linux. Is this a guide or a hard limit originating from GPFS code? My guess is that the limit ensures that apps running on the node are not starved of memory. Is this correct?

    If I have nodes, say with 64GB physical memory and these node do not run anything other than GPFS, can I allot 60GB to GPFS? The OS and daemons need < 500MB (actual measured). So I still have 3+GB free.

    PS: I did test a cluster of x3755s with 30GB pagepool/32GB physical and the nodes crashed badly (no data corruption though).

    Regards
    Anand
  • dlmcnabb
    dlmcnabb
    1012 Posts

    Re: Measuring GPFS Pagepool Consumption

    ‏2008-06-27T12:30:55Z  
    • dlmcnabb
    • ‏2008-06-27T01:36:33Z
    Changing the pagepool is perfectly fine, but you have to be slightly careful in making requests when increasing the size bigger than any previous size.

    There is a fixed size table of the managed pagepool segments. When new territory is needed, one or more of the slots in this table will be used to describe the new territory. These segment memory areas are never returned to the system, but when you reduce the pagepool size GPFS unpins portions of the pagepool memory and doesn't use those addresses. If you then make the pagepool bigger, GPFS will repin areas it has not used, and once the existing segments are all pinned it will add new segments.

    So increasing pagepool 10M at a time will use up the slots fast, whereas increasing pagepool by 1G chunk will make a big segment, which you can relinquish by dropping pagepool size back 1G-10M and then freely bump up and down without taking more slots.

    Look at the table at the end of "mmfsadm dump pgalloc" to see the table.

    AIX 64bit kernel has 1024 entries and each segment can be 256M for a max pagepool of 256G.
    AIX 32bit kernel can only have 8 slots for 2G max.

    Linux 64bit machines can have 256 entries and each segment can be 1G for a 256G max.
    Linux 32bit machines can have 8 slots for 8G max.

    Once all the slots are filled (or pinnable memory all used up), increasing pagepool will just passively return the same answer it had before.
    I guess a simpler way of putting it:
    Set the permanent configuration to specify the largest pagepool you would use ever.
    mmchconfig pagepool=$biggestpagepool
    This sets up the segments describing the pagepool during daemon start up. Then in the /var/mmfs/etc/mmfsup user configurable script, immediately drop the pagepool for the node down to its normal level using
    mmchconfig pagepool=$normalpagepool -I -N $thisnode
    The -I will change the daemon value but not change the permanent configuration setting.

    Then when you want to change the pagepool for a specific set of nodes:
    mmchconfig pagepool=$newpagepoolvalue -I -N $node1,$node2,...,$nodeN

    The effect of changing the pagepool in this case will be just to pin/unpin parts of the pagepool as needed. Note that when unpinning, the contents of that memory are basically thrown out of the cache (flushed to disk if necessary), so would have to be read back in if actively being used.
  • dlmcnabb
    dlmcnabb
    1012 Posts

    Re: Measuring GPFS Pagepool Consumption

    ‏2008-06-27T15:03:57Z  
    OK, let me add another question that has been asked often :

    • GPFS docs suggest that pagepool < 50% physical RAM on Linux. Is this a guide or a hard limit originating from GPFS code? My guess is that the limit ensures that apps running on the node are not starved of memory. Is this correct?

    If I have nodes, say with 64GB physical memory and these node do not run anything other than GPFS, can I allot 60GB to GPFS? The OS and daemons need < 500MB (actual measured). So I still have 3+GB free.

    PS: I did test a cluster of x3755s with 30GB pagepool/32GB physical and the nodes crashed badly (no data corruption though).

    Regards
    Anand
    In release 3.2 there is a GPFS config variable PagepoolMaxPhysMemPct which defaults to 75% and can be increased to 90%.

    In release 3.1 GPFS will try and allocate/pin whatever you specify until the operating system refuses.

    Either case may also totally use up the memory the operating system if it has no space left when other applications pin memory. So the documentation is a guideline, but use your best judgment.