Topic
14 replies Latest Post - ‏2013-02-01T17:40:42Z by db808
sdesmet
sdesmet
7 Posts
ACCEPTED ANSWER

Pinned topic mmcrfs and metadata blocksize

‏2012-01-04T09:04:30Z |
Hi,

In the manpage of mmcrfs there is an extra line:
--metadata-block-size MetadataBlockSize
Specifies the block size for the system storage pool, provided its usage is set to metadataOnly. Valid values are the same as those listed for -B BlockSize in Options.

But this parameter isn't listed in mmcrfs --help.

Is this a supported parameter, because this could be very useful. For the moment we run out of metadata space when we create filesystems with large blocksizes which have a large amount of directories (>10M directories).
Updated on 2013-02-01T17:40:42Z at 2013-02-01T17:40:42Z by db808
  • sxiao
    sxiao
    28 Posts
    ACCEPTED ANSWER

    Re: mmcrfs and metadata blocksize

    ‏2012-01-04T15:57:55Z  in response to sdesmet
    This is not a generally supported parameter at this time. It is for IBM internal use only for now.
    • HajoEhlers
      HajoEhlers
      234 Posts
      ACCEPTED ANSWER

      Re: mmcrfs and metadata blocksize

      ‏2012-01-09T17:36:16Z  in response to sxiao
      > this is not a generally supported parameter at this time. It is for IBM internal use only for now.

      Could you give information about when this parameter will be a supported one ? We are going to create new GPFS based on GPFS v3.4.0.9 with dedicated metadataonly (system) disks in th every near future and i think to use this feature would make sense.

      tia
      Hajo
      • SystemAdmin
        SystemAdmin
        2092 Posts
        ACCEPTED ANSWER

        Re: mmcrfs and metadata blocksize

        ‏2012-01-09T19:40:38Z  in response to HajoEhlers
        As you may have seen elsewhere in this forum...

        The IBM policy specifically prohibits discussing future product plans in any public forum by IBM employees, in any shape or form. One can argue about the merits of such a policy in the age of Twitter, but it is what it is. You'd need to contact an IBM representative and sign an NDA to discuss future product plans.

        Using different block sizes for data and metadata is officially unsupported outside of GNR installs, at this time, because it hasn't been thoroughly tested. It may or may not work well in a given environment. As with any unsupported feature, it's up to you whether the perceived value of the feature is worth losing official support.

        yuri
  • sxiao
    sxiao
    28 Posts
    ACCEPTED ANSWER

    Re: mmcrfs and metadata blocksize

    ‏2012-01-04T19:12:15Z  in response to sdesmet
    It looks like manpage for mmcrfs was updated with release of GPFS Native RAID.

    Here is link to documentation for mmcrfs in GPFS Native RAID Administration and Programming Reference.
    • SystemAdmin
      SystemAdmin
      2092 Posts
      ACCEPTED ANSWER

      Re: mmcrfs and metadata blocksize

      ‏2012-01-04T19:27:35Z  in response to sxiao
      Right, Native raid supports block sizes of 16MB which kind-of forces the issue....

      But I don't think you have to deploy the native raid feature to use the metadata-block-size parameter.

      Having a relatively small metadata-block-size might be a good idea, if you've decided to segregate metadata onto its own set of LUNs.

      For example, maybe you're going to put all your metadata on SSDs...
      • SystemAdmin
        SystemAdmin
        2092 Posts
        ACCEPTED ANSWER

        Re: mmcrfs and metadata blocksize

        ‏2012-01-09T19:27:56Z  in response to SystemAdmin
        I don't know the official answer, but if your version of GPFS accepts the new parameter, and it is included in the official docs for your version, go-for-it!

        It seems to me that you don't need to deploy the "Native Raid" feature to use this parameter -- even if you have the Native raid feature, you would likely be putting the system pool metadata on a not-native-GPFS-raid set of disks.... which kinda proves to me that you don't should not need native-GPFS-raid to use metadata-blocksize.
        • SystemAdmin
          SystemAdmin
          2092 Posts
          ACCEPTED ANSWER

          Re: mmcrfs and metadata blocksize

          ‏2012-01-09T23:33:11Z  in response to SystemAdmin
          That said, also see Yuri's warning!

          Still, if you have a paid up license for GPFS Native Raid (GNR) and you're running the code on a GNR supported OS (AIX for now) - you should be good to go. (I'd assume this is true, even when you have zero GNR volumes configured?!) Otherwise, read Yuri's warning again...
          • SystemAdmin
            SystemAdmin
            2092 Posts
            ACCEPTED ANSWER

            Re: mmcrfs and metadata blocksize

            ‏2012-07-20T15:43:19Z  in response to SystemAdmin
            Hi, so I didn't want to start a new thread for the same topic, but is this (--metadata-block-size) feature officially supported in GPFS v.3.5.0.2, for non-native RAID setups yet?

            I don't see anything in the manpage anymore about GNR only, so I figured I would ask.
            • SystemAdmin
              SystemAdmin
              2092 Posts
              ACCEPTED ANSWER

              Re: mmcrfs and metadata blocksize

              ‏2012-07-20T16:40:11Z  in response to SystemAdmin
              Read the official current GPFS docs from an IBM publication website.
              Then re-read the above entries.

              I think there is a loophole -- if native raid and its associated options are supported for your GPFS release and OS, then it is "legal" to run that with a configuration of 0 (no) native raid LUNs configured... --- that ought to be a supported special case configuration.
            • dlmcnabb
              dlmcnabb
              994 Posts
              ACCEPTED ANSWER

              Re: mmcrfs and metadata blocksize

              ‏2012-07-20T22:40:18Z  in response to SystemAdmin
              The metadata blocksize option was designed because with data blocksizes larger than 1M, the metadata was wasting space because it was not able to use more than 32K in a subblock (1/32 of a fullblock) for directories and indirect blocks. It is only coincidentally related to GNR because GNR was being targeted for sites with much larger data blocksizes.
              • mduff
                mduff
                31 Posts
                ACCEPTED ANSWER

                Re: mmcrfs and metadata blocksize

                ‏2013-02-01T01:19:11Z  in response to dlmcnabb
                Are there recommended guidelines for using the --metadata-block-size option? Is the gain only in space, or is it possible to see a performance gain by tuning the metadata block size?

                For a sizing guideline, for example, would a metadata block size of 32k (or possibly 16k) for anything over 1M be a good choice?

                Thank you
                • dlmcnabb
                  dlmcnabb
                  994 Posts
                  ACCEPTED ANSWER

                  Re: mmcrfs and metadata blocksize

                  ‏2013-02-01T03:00:22Z  in response to mduff
                  Use 256K. Anything smaller makes allocation blocks for the inode file inefficient. Anything larger wastes space for directories. These are the two largest consumers of metadata space.
                  • mduff
                    mduff
                    31 Posts
                    ACCEPTED ANSWER

                    Re: mmcrfs and metadata blocksize

                    ‏2013-02-01T07:31:46Z  in response to dlmcnabb
                    Thank you!
                    • db808
                      db808
                      34 Posts
                      ACCEPTED ANSWER

                      Re: mmcrfs and metadata blocksize

                      ‏2013-02-01T17:40:42Z  in response to mduff
                      Hello all,

                      If you have the luxury of creating a new GPFS 3.5.x I would suggest that you also review the other GPFS metadata related enhancements.

                      On the metadata block size topic, one purpose of the new parameter is to undo some of the negative artifacts that occur when you increase the GPFS (data) block size, and the metadata is forced to use the same value.

                      We have been running for 3 years with a GPFS block size of 4MB, on our large-file-optimized storage systems. This is great for bandwidth efficiency (and reducing metadata overhead by 1/4 compared with the default 1 MB GPFS block size), but the minimal storage allocation unit (called a segment) is 1/32 of the block size. For a 4 MB block size, the segment size is 128kb ... and is shared with the Metadata-only volumes.

                      A segment size of 128kb is very wasteful on the metadata side. If you create a directory with a few file names within it, GPFS will allocate 128kb for that directory, as an example. If you have a "tall" directory structure you end up consuming more metadata space, and the space usage is less focused. If you wanted to use SSDs for metadata, you would be putting a lot of empty space on the SSDs.

                      The new metadata-blocksize parameter allows you to keep the smaller blocksize (such as the 1MB default), while still allowing you to increase the data-blocksize for increased bandwidth efficiency.

                      Personally, I like a metadata-blocksize of 1MB. This results in a 1MB/32 = 32kb segment size, which is often the hardware stripe segment size, and also often the storage cache memory "page" size. IO of this size or less will likely only engage a single disk and a single storage cache memory "page". Also, the long-time GPFS default has been a 1 MB blocksize, and using a 1MB metadata blocksize puts you back to the metadata performance levels that many GPFS users are indirectly familiar with.

                      Other significant metadata-related enhancements in GPFS 3.5.x are the striped metadata-journal file (across all metadata NSDs) helping avoid metadata hot spots, and the increased inode sizes of 1k, 2k, and 8k. We have found that with an 8K inode size, over 90% of our directories will fit in the enlarged inode, significantly reducing the amount of metadata IO needed to read a directory. So, if you were using a 4MB block size under GPFS 3.4.x, you consumed a minimum of 128kb of metadata space and 2 IOs to read a small directory. Under GPFS 3.5.x with a 4MB data, 1 MB metadata block sizes, with 8kb inode size, you consume only 8kb of metadata space, and use a maximum of 1 IO to read the whole directory. GPFS also has some patented inode prefetch algorithms that might mask the IO needed to read the inode.

                      If you are heavily caching inodes in GPFS, there is currently little documentation that identifies how the GPFS buffer pool changes when you increase the inode size. If the whole larger inode is cached, the memory required for the inode buffer pool will increase (visible in some of the mmdiag and mmfsadm dump statistics). This increased inode pool memory usage will displace a trivial amount of "data" buffer space. There were some internal GPFS limits on types of object pools as these pools neared 2GB. You may need to reduce the number of cached inodes by the same factor that you increased the inode size.

                      A useful GPFS tool would be one that computed how much metadata space was "unused" allocation of a full segment size for the metadata structure. A useful metric that I use is the (directory size) / (number of file names). On a densely packed directory, the ratio should approach the average file name length plus 40-60 bytes. For small directories, you will find it much larger ... due to the padding of the directory to an integer number of metadata segment-size chunks.

                      Counterpoint ... if you have very "flat" directory structures, with hundreds of thousands of files in a single folder, a larger metadata block would allow these multi-megabyte directories to be read with fewer IOs.

                      The key point is now in GPFS 3.5.x ... you don't need to accept a compromise. You can independently set the data and metadata blocks sizes, along with the inode size.

                      Hope this helps!

                      Dave B