• 2 replies
  • Latest Post - ‏2013-08-29T21:09:28Z by db808
7 Posts

Pinned topic Metadata NSD considerations while using Building Blocks

‏2013-06-11T17:27:25Z | gpfs lun metadata nsd replication ssd

Hi all,

We are building a new GPFS cluster as an evolution of our old one. This new cluster will be formed by 2 Building Blocks (later on it will probably be expanded to 4). Each BB is composed by:

  • 2x NSD servers  (4 servers total)
  • 2x IBM DS3512 NAS storage systems (4 total)
  • 48x 3Tb disks per storage system (96 disks per BB, 192 disks, ~430TB total)
  • 4x SAS wires, connecting the two servers to the two controllers (within the BB).

The building blocks are independent one from the other (servers on BB1 cannot see disks on BB2) and all the servers are connected to a dedicated infiniband network.

In our experience with our current GPFS cluster (2 BB, each one consisting of 2 servers and 1 IBM DS3512 storage system with 36 disks => 4 servers, 72 disks, 150 TB total), we have had some problems with metadata access performance. We reached a point where we had 29 million files in our filesystem, where 23 of them were 4k or less and some processes were accessing to an important part of those.

Our old filesystem is composed by 6 NSDs. Each NSD was defined as a Raid5 LUN using 8+1 disks. Data and Metadata were mixed in the NSDs. The controllers cache (2Gb per controller) was enabled for all the LUNs, but, as the NSDs contained both data and metadata, the caché was totally overwhelmed by data, thus rendering the metadata caching almost useless.

Willing to learn from our past mistakes, we are planning our new filesystem as follows:

  • 20x Raid5 (8+1) groups.
  • Each raid group contains a big (22TB) LUN for data and a small (50GB) LUN for metadata.
  • The controller's cache is enabled only for the small LUNs.
  • Splitting of data and metadata in different NSDs at filesystem level.

We are aware that having multiple LUNs in a single raid creates a lot of collisions, but this way we are allowed to use the controller's cache (2Gb x 2 controllers x 4 = 16GB) for metadata exclusively, and this way we have access to all (180) HDD headers.

Is this a good idea? Should we dismiss this, ignore/disable the controller's cache and mix data and metadata together in a single LUN/NSD? Should we go for the caching, but using metadata-exclusive raids? Even if that decreases the header count from 180 to, say, 9?
And what about snapshots or MD replication in any of these scenarios? (we are too poor to afford a proper backup of our data)

We were also thinking about buying some SSD disks for each server and using them as NSDs for metadata with replication among different servers. But we have some concerns about metadata integrity and its behaviour with building blocks:
We want the metadata disks in the 2 servers of BB1 to be identical and replicate each-other. We cannot allow metadata from BB1 to be replicated on BB2's disks, because if a server from BB1 fails and a server from BB2 fails too, all data disks would still be accessible, but the metadata would be compromised.
Is this true or is there a way to enable explicit replication from NSDx to NSDy ?

Thanks in advance,


  • sdenham
    70 Posts

    Re: Metadata NSD considerations while using Building Blocks


    Your scheme to divide each RAID group seems like an interesting idea - a trade off  between some possible conflict at the RAID group level and the benefit of devoting the cache exclusively to metadata.  I suspect it would depend on the workload - if the 16 GB of cache effectively contains most of the active metadata this could be a win.

    I've implemented an SSD scheme similar to what you describe, only using dedicated, SSD-rich NSD servers exclusively for the metadata.  This resulted in a substantial reduction of service times on metadata transactions over the previous scheme, and works reasonably well.  In the event of an outage of one of these NSD servers, however, half the metadata-only NSD's become unavailable and have to re-sync when the NSD server is restored, and the filesystem performance is pretty limited while this is going on.  A scheme that allows these metadata SSD NSD's to be twin-tailed would be much better.

  • db808
    87 Posts

    Re: Metadata NSD considerations while using Building Blocks



    First, I highly recommend separating data and metadata into different LUNs.  I like having a 1-to-1 mapping, with 1 metadata LUN for each data LUN.  

    This is true even if you end up co-locating the metadata and data LUNs on the same storage Raid group (which we do).  If you intermingle data and metatdata in the same LUNs, you can never separate them later.  If they are separate, you have much better performance reporting, since you can use simple Linux iostat to show the disk IO statistics ... and the data LUNs are separate from the metadata LUNs.  You may need to use a Perl script to post-process the output of iostat to group things together, but that is not difficult.

    Using separate data and metadata also allows you to tailor each, as GPFS now has separate data and metadata block sizes.

    Handling small files is challenging as it ultimately turns into an IOP exercise.  The first thing that I thought about when I read your posting was the new GPFS 3.5.x metadata features of the large size inodes.  With an inode size of 4K, you can fit about 3800 bytes of file data in the inode itself.  This serves to purposes.  First, it completely eliminates the added IO to fetch the data block, and second, it co-locates the data in the small file with the metadata ... so if you deploy your metadata on higher-IOP storage, you end up also accelerating the small files.

    The large inodes can also be used to house about 3800 bytes of directory entries for small directories.  Again, this saves an IO.  With filenames with an average of 20 chars, the 4K inode handles ~ 118 file names @ 20 chars.

    As to your storage system, you are running some low-end models, with the DS35xx.  Do you have the "Turbo" mode license key installed?  If not, your performance will be severely limited.  I don't think that non-turbo mode will handle 48 disks well.  You are also using near-line SAS disks ... which is ok ... but you must realize that they are only good for about 75 IOPs.

    I suggest that you should experiment with the storage cache settings.  I would NOT recommend disabling write caching, especially when using R5 or R6.  With write cache disabled, you turn those data writes into synchronous writes, and prevent any coalescing.

    We have many DCS3700's, with the base controller which is equivalent to a DS3500 with Turbo mode.  We are quite pleased with the performance levels ... given the cost of the system, and the limitations of the 7200 RPM nearline disks.

    You should enable read caching, but NOT read-ahead cache on the storage controller. GPFS will do a good job with read-ahead, even if the blocks are not contiguous.

    I have found that checking the GPFS IO history when you are running a typical workload (mmdiag --iohist) to be very useful.  By default, GPFS saves the last 512 IO requests, and you can increase the number if you need to.

    Looking at the IO history, you will see all the logical GPFS IOs.  In your case, you will likely see that the "data" IOs are probably a minority.  There will be much metadata IO activity as GPFS is opening/closing files, updating inodes, flushing metadata changes to the metadata log file, etc.

    You will also get a very different io history when you capture it from the viewpoint of the client GPFS node, than on the NSD server itself.

    If you are concerned about metadata performance, then try not to do metadata IO.  Cache many inodes on the client.  Be aware of the atime and mtime update overheads.  Also ... although it is not well documented, GPFS does some degree of double-buffering ... buffering on the GPFS client node, AND buffering on the NSD server node. From the client GPFS node, any IO in the IO history that takes just a few milliseconds was buffered on the NSD server.

    In general, the best case for a small file read would be directory entry access, an inode access to open the file, a data access, and an inode update for the timestamps.  That is potentially 3 metadata IOs and one data IO ... at 1/75 of a second each.  Now, what is the likelihood that any one of those IOs were cached?  It depends if the directory was already read, or if the file was recently opened.  Using large inodes, can reduce the number of IOs needed, especially for small files.  For writing of small files, you will see more IO steps, including some metadata journaling.

    You can also look at the disk IO statistics shown by mmfsadm dump iocounters.  It will show you a count of all the different IO types, and even histograms of the IO sizes.  You can dump the iocounters before and after a workload, subtract, and identify what was done for that workload.

    Working with small files, there will be significant latency, since there is not much "work" to engage GPFS's read ahead and write behind.  You can do may small file operations in parallel but the performance of a single thread will be limited to the underlying storage latency for small IOs ... for the IOs that can not be cached. 

    I'm working on a system with medium size files (a few MB), with 10-20 small xml-like description files.  Fortunately, the different types of files have different file name extensions, that I can use to trigger GPFS placement policies.  I have 3 storage pools.  The first pool is for metadata, including all directories and files less than 3800 bytes in size (by using 4K inodes).  The second pool is for small files, greater than 3800 bytes, who have known file extensions.  The third pool is for the multi-mbyte files, who are also known by their file name extensions.

    Using large inodes, directories less than 3800 bytes (about 118 files @20 chars) are within the inode.  What is the average directory size for directories that fit in 3800 bytes?  You can "dial in" the default directory size, since it is equal to one metadata fragment, or 1/32 of the metadata block size.  A metadata block size of 256kb yields a fragment of 8kb .. and a default directory size of 8kb ... all the way to a metadata block size of 1MB yielding a default directory size of 32kb.  If you have directories with tens of thousands of files ... then you want to use a metadata block size of 1 MB.  Note, GPFS will only use the first 32kb of the fragment for directory information, so using a metadata block size greater than 1MB is wasteful.

    Metadata, including a replica copy is very small ... about 0.1% and is allocated to multiple equal-size LUNs on portions of R1 SSDs.  We determined that the "working set" of small files is about 1TB, which is is a second storage pool, made of equal size LUNs on portions of the same R1 SSDs used for the metadata data.  By using multiple LUNs, for both the metadata LUNs and small-file LUNs, GPFS can enque many IOs per LUN to drive the SSDs very hard.  The third storage pool is for the medium-to-large files and is on 10-disk 8+2 RAID6.  We use either a 4MB or 8MB GPFS data blocksize, with a 4MB hardware stripe (512kb segment size).  If your files, not including the "small" ones are less than 2MB on average, you should use a 2MB GPFS block size, with a 2MB hardware stripe.

    WIth 4K inodes, and striped metadata logs enabled, we use a GPFS policies. 

    Metadata is in the system pool, and has SSD-based LUNs.

    All files are allocated first to the SSD small-file pool, unless they have a "large file" extension.  When the small-file pool reaches 80-ish percent, a GPFS callback is generated to sweep the least used files to the disk-based pool.  SSDs are most beneficial on metadata and small files, and this kind of policy keeps the "warmest" small files on the SSDs.  Large files, can be accessed quickly using large GPFS blocks (610 MB @ 8MB from single 10-disk RAID group), that they don't need to go on SSDs.

    You also might want to run the "filehist" script, found in the GPFS samples/debugtools.   It will report a histogram of file sizes for the file system..  We have been often surprised with the file size skew.  "Averages lie".  In one case, the "average" file size was 5.5 MB, but 26% of all the files was less than 128kb and counted for 0.6% of the data space. 24% of the files were 129-2304KB in size and accounted for 4.2% of the space.  Only 9% of the files were 4MB or larger, but accounted for 70% of all the space.  If you were unaware of a file size skew like this, your storage design could be less than optimal.

    Dave B