Topic
9 replies Latest Post - ‏2014-05-27T22:29:48Z by dlmcnabb
mduff
mduff
35 Posts
ACCEPTED ANSWER

Pinned topic Cluster vs Scatter performance on SATA filesystems

‏2013-03-22T05:20:25Z |
Hello,

With respect to choosing either cluster or scatter, I do know that the default is cluster when there are <= 8 nodes and <= 8 LUNs. I'm also aware of the possibility of creating "hot spots" when using cluster on a larger file system.

We recently found that changing from scatter to cluster on a file system with 32 nodes total, 8 of those are NSD servers, with 7 SATA LUNs we saw a jump in performance of over 2X. The drives have a rather high seek time compared to SAS, but does it seem reasonable that we see this big of an improvement when just changing from cluster to scatter?

At what point does the "hot spot" issue factor in? For example, if we had 12 Luns instead of 7, would we need to worry about hot spots?

Thank you
Updated on 2013-03-22T19:15:20Z at 2013-03-22T19:15:20Z by SystemAdmin
  • SystemAdmin
    SystemAdmin
    2092 Posts
    ACCEPTED ANSWER

    Re: Cluster vs Scatter performance on SATA filesystems

    ‏2013-03-22T14:50:03Z  in response to mduff
    > mduff wrote:
    > Hello,
    >
    > With respect to choosing either cluster or scatter, I do know that the default is cluster when there are <= 8 nodes and <= 8 LUNs. I'm also aware of the possibility of creating "hot spots" when using cluster on a larger file system.
    >
    > We recently found that changing from scatter to cluster on a file system with 32 nodes total, 8 of those are NSD servers, with
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    Do you mean the opposite, changing from cluster to scatter?
    > 7 SATA LUNs we saw a jump in performance of over 2X. The drives have a rather high seek time compared to SAS, but does it seem reasonable that we see this big of an improvement when just changing from cluster to scatter?

    I recently did 54 filesystem benchmarks (block sizes from 16K through 4MB, both cluster & scatter, different numbers of NSDs) in order to select 'better' settings for a new filesystem.

    Unfortunately, I wasn't able to duplicate our full environment (23 LUNs, mixture of SAS and SATA, ~50 nodes, 3 NSDs) while doing benchmarking. The test filesystem had only 3 LUNs, on SATA drives.

    Depending on the benchmark chosen (sequential read, write, mixed file sizes, system time vs. user time, etc.) I saw a change of -2x (slowdown) to an improvement of 7x to 12x in using 'scatter' instead of 'cluster' for the same block size.

    In almost every benchmark, the 'scatter' layout performed 1.25~6X better than the 'cluster' layout. I believe that very few the instances where the performance got worse were due to the particular interaction of blocksize & the RAID stripe size.

    I'm moving data to the new filesystem now, and can report on performance when all NSDs are assigned to the new filesystem.

    Your post and my limited testing does bring up some questions for me about the methods to tune GPFS. I know there are a number of tuning guides, and they are very helpful. However, some critical parameters (cluster vs scatter, blocksize, inode size) cannot be changed once the filesystem is created. I wonder if there are any recommendations for:

    :* Determining parameters for a new filesystem based on the performance and file characteristics (file sizes, directory sizes, etc. as determined from tools like filehist and tsinode) of an existing filesystem?

    :* Determining parameters for a new filesystem based on performance and application benchmarks on a prototype filesystem with a different number of LUNs (NSDs) than the final filesystem? For example, it may be impractical for most sites to create a 'test' filesystem with the same number of LUNs (NSDs), same volume of data, same application I/O activity, as their production filesystem in order to test different RAID groups, stripe sizes, block sizes, etc., etc. How realistic & valid are the test results from a smaller environment that will later have many more LUNs, more NSD servers attached, etc.?

    >
    > At what point does the "hot spot" issue factor in? For example, if we had 12 Luns instead of 7, would we need to worry about hot spots?
    >
    > Thank you
    • mduff
      mduff
      35 Posts
      ACCEPTED ANSWER

      Re: Cluster vs Scatter performance on SATA filesystems

      ‏2013-03-22T16:19:02Z  in response to SystemAdmin
      Thank you for that.

      I definitely meant going from scatter to cluster.
      • SystemAdmin
        SystemAdmin
        2092 Posts
        ACCEPTED ANSWER

        Re: Cluster vs Scatter performance on SATA filesystems

        ‏2013-03-22T19:15:20Z  in response to mduff
        It is very difficult to make sensible comments of your situation because you give not enough information.

        • What is your filesystem blocksize ?
        • What is your load, lots of small, randrom io ? few, but large io ?

        Some details don't compute well too, 8 NSD servers, but 7 LUNs ? There must be asymmetries hidden there.

        In general, if your load is optimized for high performance (large blocks, etc.) GPFS gives you what the hardware can do. If your load is an ill fit (thousands of tiny files), you'll suffer.

        One of the important aspects is long-term consistency. In a scatter configuration you are working is a worst-case configuration. If you put yourself into the situation of a single LUN you see mostly random IO. So the future evolution of your filesystem, fragmentation, etc. will not have an important impact. 3 years down the road you'll see the same performance.

        In your small-scale, unbalanced (7 LUN's over 8 NSDs) configuration you may well profit from better streaming effects due to data being clustered on the same LUN. But you'll never exceed the speed of a single LUN by much. Over time, with free space spread all over, these cluster effects will help less. Your performance will deteriorate.

        Markus
    • sabujp
      sabujp
      12 Posts
      ACCEPTED ANSWER

      Re: Cluster vs Scatter performance on SATA filesystems

      ‏2013-10-04T20:47:55Z  in response to SystemAdmin

      Regarding cluster vs scatter, if I run tests using dd on my nsd's, e.g. for a write (for a file that's larger than memory, 128GB below, so I don't get a cached read later)

       

      dd if=/dev/zero of=someRandomFile-$nsd-$threadID bs=1M count=128000

       

      and for the read of the same file 

       

      dd if=someRandomFile-$nsd-$threadID of=/dev/null bs=1M

       

      and then run several of these in parallel across all my NSD's and see that cluster allocation gives me 2-3x better aggregate bandwidth than scatter, is it safe to say that if I ran the same tests on GPFS clients, given enough bandwidth coming into the NSD's from the clients, that I'd see the same 2-3x difference in sequential I/O performance with cluster vs scatter from the clients?

      Updated on 2013-10-04T20:49:16Z at 2013-10-04T20:49:16Z by sabujp
      • dlmcnabb
        dlmcnabb
        1012 Posts
        ACCEPTED ANSWER

        Re: Cluster vs Scatter performance on SATA filesystems

        ‏2013-10-05T05:13:25Z  in response to sabujp

        Allocation of blocks on disk is always done on the node (client) where the application is running.

        In an empty filesystem and using a sequential single threaded test, cluster will always look better than scatter. With GPFS doing round robin allocation across the disks, good performance depends on the fact that the disk head has not moved or not moved very far since the last time that particular disk was accessed.

        In a large cluster with many applications running or where there are many LUNs in a filesystem, the likelihood of the disk head being near where it was before can become exceedingly small and clustering will not provide much advantage. Also, clustering may pick an area of the disk that are the slow tracks, so while the seek time is small, the time to read a set of tracks will be slower than if the cluster of allocations were all on the fast tracks of the disk.

        Over time as files are created and deleted, the set of contiguous tracks  becomes much smaller and so cluster performance times will approach scatter performance times.

        The reason scatter is recommended is that even though you will not get the best possible performance, you will get a statistically predictable performance over the life of the filesystem. The performance on the first day will be the same as the performance on the 10,000th day. You don't make promises that you cannot keep.

         

        • b.sqrd
          b.sqrd
          3 Posts
          ACCEPTED ANSWER

          Re: Cluster vs Scatter performance on SATA filesystems

          ‏2014-05-24T22:53:27Z  in response to dlmcnabb

          I would like some feedback on my thoughts regarding this critical configuration parameter:

          With many applications running concurrently across our GPFS Client Grids, I would venture to believe that the our workloads will produce very random, large block I/O requests to the underlying storage.  I can see that the requests are fairly large within the storage (as GPFS does a good job of coalescing I/O requests).  Thus I would not expect that we would in practice get performance benefits from a "Cluster" allocation.  We would only really see the performance improvement in synthetic benchmarks (e.g. IOzone, etc) which better emulate the workload Dan Mcnabb describes (e.g. single [or small number of ] threads streaming data off of the file system).

          Another thing to understand is that file data is striped round-robin across all NSDs.  We have some file systems that have 300+ NSDs in the single file system.  Thus a file would have to be 4MB (the FS blocksize) x 300 = 1200MB in size to even have two file data blocks allocated on the same NSD.  And there will be many other files that will have data written concurrently to the same NSDs that will cause a seek away from the place where a file of this magnitude ( >1.2GB) was originally written.  Resulting in, again, a basically random, large block I/O request workload to the underlying storage.

          QFS and Lustre have the general concept of "stripe groups" for their "NSDs", which limit the number of NSDs that a single file isround-robin'd across.  Perhaps if "stripe groups" were possible in GPFS then it might better benefit from a "Clustered" block allocation on each "stripe group".  It would also be nice if each GPFS storage pool could use a different allocation scheme more suited for the workload and file data being stored.

          Very interesting to read that the client node handles the Allocation of file data blocks as well.  How is this coordinated with the File System Manager for the file system??

          Does it allocate each client a distinct range of NSD blocks to use so there aren't any likely conflicts to other client nodes for each open file?

          Your explanations and details will be greatly appreciated!!

          -Bryan 

          • yuri
            yuri
            192 Posts
            ACCEPTED ANSWER

            Re: Cluster vs Scatter performance on SATA filesystems

            ‏2014-05-27T15:54:44Z  in response to b.sqrd

            GPFS has a fully distributed block allocation mechanism.  The block allocation map is laid out in a way that allows different nodes to lock and access map chunks (regions and segments) in parallel.  The file system manager nodes acts as the block allocation manager, and maintains a high-level view of the block allocation landscape.  Other nodes act as allocation clients, and request allocation regions from the manager as needed.  Once a given node has been granted an allocation region to work with, it'll be able to do block allocation and deallocation using that region independently, i.e. without other node's participation.  So all nodes in the cluster could be allocation blocks simultaneously without stepping on each other's toes.  Each allocation region can be used to allocate blocks on all disks in the storage pool that it belongs to.

            yuri

            • b.sqrd
              b.sqrd
              3 Posts
              ACCEPTED ANSWER

              Re: Cluster vs Scatter performance on SATA filesystems

              ‏2014-05-27T16:03:06Z  in response to yuri

              Thanks for the explanation, Yuri!

              I would still appreciate any feedback on my thoughts regarding the following:

              With many applications running concurrently across our GPFS Client Grids, I would venture to believe that the our workloads will produce very random, large block I/O requests to the underlying storage.  I can see that the requests are fairly large within the storage (as GPFS does a good job of coalescing I/O requests).  Thus I would not expect that we would in practice get performance benefits from a "Cluster" allocation.  We would only really see the performance improvement in synthetic benchmarks (e.g. IOzone, etc) which better emulate the workload Dan Mcnabb describes (e.g. single [or small number of ] threads streaming data off of the file system).

              Another thing to understand is that file data is striped round-robin across all NSDs.  We have some file systems that have 300+ NSDs in the single file system.  Thus a file would have to be 4MB (the FS blocksize) x 300 = 1200MB in size to even have two file data blocks allocated on the same NSD.  And there will be many other files that will have data written concurrently to the same NSDs that will cause a seek away from the place where a file of this magnitude ( >1.2GB) was originally written.  Resulting in, again, a basically random, large block I/O request workload to the underlying storage.

               

              Do you feel these statements are accurate?

              • dlmcnabb
                dlmcnabb
                1012 Posts
                ACCEPTED ANSWER

                Re: Cluster vs Scatter performance on SATA filesystems

                ‏2014-05-27T22:29:48Z  in response to b.sqrd

                Your description is correct. It will almost always look like random large block IO.

                That is why -j cluster is only the default in small clusters of 8 or less nodes, and 8 or less LUNs.