Topic
  • 3 replies
  • Latest Post - ‏2013-10-10T16:15:01Z by dlmcnabb
db808
db808
87 Posts

Pinned topic Relationship between GPFS storage pool blocksize and file system-level block size (data or metadata)

‏2013-10-03T16:03:13Z |

GPFS 3.5 introduced a blocksize parameter associated with a storage pool.  I believe that this parameter is most often used with either GPFS native RAID and/or the new GPFS File Placement Optimizer (FPO)

However, the GPFS manual text on "mmadddisk" does NOT restrict the use of the blocksize parameter within the pool stanza entry.  The pool-level blocksize parameter also does NOT have the specific size restrictions that qualify the GPFS-level maximum blocksize parameter, or the specific size restrictions of the file system-level data block size and metadata blocksize

If you are using traditional GPFS, without native RAID, and without FPO, can you use a separate(different) storage pool-level blocksize, along with a file system-specific data blocksize and metadata blocksize?

If my file system data block size is 8MB, but the storage pool block size is 2MB, what happens?

If my file system data block size is 1MB, but the storage pool block size is 8 MB, what happens?

Thanks,

Dave B

 

  • yuri
    yuri
    277 Posts
    ACCEPTED ANSWER

    Re: Relationship between GPFS storage pool blocksize and file system-level block size (data or metadata)

    ‏2013-10-03T23:05:08Z  

    What GPFS really allows is having two block sizes: one for data and one for metadata.  If the two sizes are not the same, the system pool must be metadata-only.  In the future, GPFS may possibly allow specifying different block sizes for different pools, and the stanza syntax rules allow for that, but that's not an option at present.  Supporting different block sizes for different pools while also allowing files to be migrated between pools is actually pretty hard, because it means that during pool migration file metadata would have to address blocks of different sizes.

    The need for different data and metadata block sizes stems not so much from GNR or FPO, but from the fact that on certain storage systems using very large data blocks (e.g. 16MiB) produces tangible performance benefits.  At the same time, using jumbo blocks for metadata doesn't make sense: metadata IO is rarely full-block, so the performance benefit isn't there.  At the same time, it's counter-effective to have certain metadata structures to be too large (locking granularity becomes too coarse), so they have a cap on the maximum size, and if the block size is so large that even a subblock (1/32nd of block) is larger than, say, an indirect block, some space would be wasted.  So for metadata it's not recommended to use block size larger than 1MiB, in fact the default of 256KiB should work well.

    yuri

  • yuri
    yuri
    277 Posts

    Re: Relationship between GPFS storage pool blocksize and file system-level block size (data or metadata)

    ‏2013-10-03T23:05:08Z  

    What GPFS really allows is having two block sizes: one for data and one for metadata.  If the two sizes are not the same, the system pool must be metadata-only.  In the future, GPFS may possibly allow specifying different block sizes for different pools, and the stanza syntax rules allow for that, but that's not an option at present.  Supporting different block sizes for different pools while also allowing files to be migrated between pools is actually pretty hard, because it means that during pool migration file metadata would have to address blocks of different sizes.

    The need for different data and metadata block sizes stems not so much from GNR or FPO, but from the fact that on certain storage systems using very large data blocks (e.g. 16MiB) produces tangible performance benefits.  At the same time, using jumbo blocks for metadata doesn't make sense: metadata IO is rarely full-block, so the performance benefit isn't there.  At the same time, it's counter-effective to have certain metadata structures to be too large (locking granularity becomes too coarse), so they have a cap on the maximum size, and if the block size is so large that even a subblock (1/32nd of block) is larger than, say, an indirect block, some space would be wasted.  So for metadata it's not recommended to use block size larger than 1MiB, in fact the default of 256KiB should work well.

    yuri

  • db808
    db808
    87 Posts

    Re: Relationship between GPFS storage pool blocksize and file system-level block size (data or metadata)

    ‏2013-10-08T19:53:33Z  
    • yuri
    • ‏2013-10-03T23:05:08Z

    What GPFS really allows is having two block sizes: one for data and one for metadata.  If the two sizes are not the same, the system pool must be metadata-only.  In the future, GPFS may possibly allow specifying different block sizes for different pools, and the stanza syntax rules allow for that, but that's not an option at present.  Supporting different block sizes for different pools while also allowing files to be migrated between pools is actually pretty hard, because it means that during pool migration file metadata would have to address blocks of different sizes.

    The need for different data and metadata block sizes stems not so much from GNR or FPO, but from the fact that on certain storage systems using very large data blocks (e.g. 16MiB) produces tangible performance benefits.  At the same time, using jumbo blocks for metadata doesn't make sense: metadata IO is rarely full-block, so the performance benefit isn't there.  At the same time, it's counter-effective to have certain metadata structures to be too large (locking granularity becomes too coarse), so they have a cap on the maximum size, and if the block size is so large that even a subblock (1/32nd of block) is larger than, say, an indirect block, some space would be wasted.  So for metadata it's not recommended to use block size larger than 1MiB, in fact the default of 256KiB should work well.

    yuri

    Hi Yuri,

    Thank you for your quick reply. 

    I understood that GPFS now allows a separate block size for data and metadata, but I was hoping that new functionality was being added to enable a per-storage-pool block size, when I saw the addition of a blocksize parameter on the "%pool" stanza used by several updated GPFS 3.5.x commands, such as mmadddisk, and other commands.

    From your comment, the answer is apparently no (for now).

    My immediate interest was using a pool-based blocksize to allow a reduction in the metadata blocksize.  Of course, this assumes that metadata-only NSDs could exist in a storage pool other than the System storage pool, which is probably not allowed also.

    We are one of the early adopters of large GPFS blocksizes for data, having started using a 4 MB block size over 4 years ago ...before the impact of the ultra-large blocksize on metadata was well advertised.  These early 4MB block size GPFS file systems, were truly "large file" file systems, with average file sizes measured in gigabytes.  With so few files and directories, the fact that we were wasting metadata for directories was not significant.

    Well, a small number of these earlier systems were deployed with much smaller file workloads, and very "tall" directory structures, with the average number of files per directory being 12 or less.  Thus, we have millions of directories, where each directory is 1 fragment or 1/32 of 4 MB = 128kb.  Of this 128kb, I have since learned that GPFS will only use a maximum of 32kb of directory entries.  So for the few larger directories we have, we also have allocated 4-fold more directory metadata than we needed to.

    With disk-based metadata, especially if it is well buffered, it is still effectively a non-issue ... assuming the performance is sufficient.  However, as our GPFS workload profiles get broader, and we start deploying systems with smaller files, the attractiveness of using SSDs for metadata also increases ... if metadata is not wastefully allocated.

    In one of our systems with mirrored metadata and 6+ million directories with a 4MB blocksize that pre-dates GPFS' support of separate metadata block sizes, we have over 2.5 TB of metadata.  If we could use a 128kb Metadata block size with the resulting 4 kb directory size, the total metadata would be less than 180 GB, and fit on a single 200 GB SSD mirrored pair.  The cost of a single pair of 200 GB SSDs is much lower than the 14 x 400 GB SSDs needed for 2.5TB mirrored in the larger case.

    So, for now, since we can not afford the outage needed to rebuild the file system, we live within the performance envelope that we have.

    All our new deployments are using a metadata blocksize of 1MB or less, with 4-8 MB data block sizes.  Our "tall" file systems typically use a 256kb metadata block size (as you suggest), resulting in the 8kb directory that handles ~ 100 entries with long 40-50 character file names.  We also have some "flat" file systems, with only a few massive directories with several hundred thousand files each ... for these we want the full-size 32kb directory block that comes with the 1 MB metadata blocksize.

    We would embrace a technique that would allow us to "trim" the metadata on a few of our older filesystem that are running with 4 MB metadata block sizes, and accomplish this "trim" without a file system rebuild.  This would be valuable even if was equivalent to a relatively large 1MB metadata blocksize.  That would save us 96kb per directory for millions of directories.

    Thanks for your help.

    Dave B

     

  • dlmcnabb
    dlmcnabb
    1012 Posts

    Re: Relationship between GPFS storage pool blocksize and file system-level block size (data or metadata)

    ‏2013-10-10T16:15:01Z  
    • db808
    • ‏2013-10-08T19:53:33Z

    Hi Yuri,

    Thank you for your quick reply. 

    I understood that GPFS now allows a separate block size for data and metadata, but I was hoping that new functionality was being added to enable a per-storage-pool block size, when I saw the addition of a blocksize parameter on the "%pool" stanza used by several updated GPFS 3.5.x commands, such as mmadddisk, and other commands.

    From your comment, the answer is apparently no (for now).

    My immediate interest was using a pool-based blocksize to allow a reduction in the metadata blocksize.  Of course, this assumes that metadata-only NSDs could exist in a storage pool other than the System storage pool, which is probably not allowed also.

    We are one of the early adopters of large GPFS blocksizes for data, having started using a 4 MB block size over 4 years ago ...before the impact of the ultra-large blocksize on metadata was well advertised.  These early 4MB block size GPFS file systems, were truly "large file" file systems, with average file sizes measured in gigabytes.  With so few files and directories, the fact that we were wasting metadata for directories was not significant.

    Well, a small number of these earlier systems were deployed with much smaller file workloads, and very "tall" directory structures, with the average number of files per directory being 12 or less.  Thus, we have millions of directories, where each directory is 1 fragment or 1/32 of 4 MB = 128kb.  Of this 128kb, I have since learned that GPFS will only use a maximum of 32kb of directory entries.  So for the few larger directories we have, we also have allocated 4-fold more directory metadata than we needed to.

    With disk-based metadata, especially if it is well buffered, it is still effectively a non-issue ... assuming the performance is sufficient.  However, as our GPFS workload profiles get broader, and we start deploying systems with smaller files, the attractiveness of using SSDs for metadata also increases ... if metadata is not wastefully allocated.

    In one of our systems with mirrored metadata and 6+ million directories with a 4MB blocksize that pre-dates GPFS' support of separate metadata block sizes, we have over 2.5 TB of metadata.  If we could use a 128kb Metadata block size with the resulting 4 kb directory size, the total metadata would be less than 180 GB, and fit on a single 200 GB SSD mirrored pair.  The cost of a single pair of 200 GB SSDs is much lower than the 14 x 400 GB SSDs needed for 2.5TB mirrored in the larger case.

    So, for now, since we can not afford the outage needed to rebuild the file system, we live within the performance envelope that we have.

    All our new deployments are using a metadata blocksize of 1MB or less, with 4-8 MB data block sizes.  Our "tall" file systems typically use a 256kb metadata block size (as you suggest), resulting in the 8kb directory that handles ~ 100 entries with long 40-50 character file names.  We also have some "flat" file systems, with only a few massive directories with several hundred thousand files each ... for these we want the full-size 32kb directory block that comes with the 1 MB metadata blocksize.

    We would embrace a technique that would allow us to "trim" the metadata on a few of our older filesystem that are running with 4 MB metadata block sizes, and accomplish this "trim" without a file system rebuild.  This would be valuable even if was equivalent to a relatively large 1MB metadata blocksize.  That would save us 96kb per directory for millions of directories.

    Thanks for your help.

    Dave B

     

    If you upgrade to 3.5, you can start using the "data-in-inode" feature. This also includes putting small directories directly in the inode. A 512 byte directory inode can hold roughly 11 entries if the entry names are normal sizes. So many of your small directories may fit. But you have to recreate each directory to get them packed. There is no utility to do this. You would have to write something to carefully crawl the filesystem, recreating small directories.