Topic
8 replies Latest Post - ‏2012-10-25T10:54:40Z by HajoEhlers
SystemAdmin
SystemAdmin
2092 Posts
ACCEPTED ANSWER

Pinned topic recommended block size, effect of maxblocksize on existing filesystems?

‏2012-10-19T21:33:10Z |
I'm using GPFS 3.5.0-3 under Linux and periodically experience serious performance problems. We've got a 3 GPFS servers, exporting a ~33TB filesystem via NFS to ~50 compute nodes in an HPC cluster. The typical file sizes vary from <1MB to several GB, and IO workloads are highly mixed and unpredictable. We see sporadic, extended load spikes on the GPFS/NFS servers and extremely poor performance on the NFS clients due to multiple compute jobs being in I/O intensive phases (reading input image files, writing intermediate results, etc.) simultaneously.

Most of the 14 NSDs are 4-disk RAID5 LUN on an 8Gb/s SAN.
As one part of trying to address this issue, I'm going to create a new filesystem with a larger block size (up from the 512Kb block size on the existing filesystem). I was planning to use 4Mb, but also see that an 8Mb block size is an option.

Question 1: Is there a recommendation for the filesystem block size for highly mixed workloads?

Question 2: Are there tools that I can use on our existing (4Mb) filesystem to determine an optimum block size, based on the existing usage patterns?

Question 3: Changing "maxblocksize" (with mmchconfig) is a prerequisite for creating the new filesystem with a larger block size. Will changing the block size have any affect on the existing 512Kb filesystem being used in production?
Updated on 2012-10-25T10:54:40Z at 2012-10-25T10:54:40Z by HajoEhlers
  • ezhong
    ezhong
    32 Posts
    ACCEPTED ANSWER

    Re: recommended block size, effect of maxblocksize on existing filesystems?

    ‏2012-10-20T19:50:37Z  in response to SystemAdmin
    > Question 1: Is there a recommendation for the filesystem block size for highly mixed workloads?

    It's better to use filesystems of different block sizes to meet different needs.

    > Question 2: Are there tools that I can use on our existing (4Mb) filesystem to determine an optimum block size, based on the existing usage patterns?

    Perhaps it's sufficient just to use common sense judgement.

    Question 3: Changing "maxblocksize" (with mmchconfig) is a prerequisite for creating the new filesystem with a larger block size. Will changing the block size have any affect on the existing 512Kb filesystem being used in production?

    I don't think so.
  • HajoEhlers
    HajoEhlers
    251 Posts
    ACCEPTED ANSWER

    Re: recommended block size, effect of maxblocksize on existing filesystems?

    ‏2012-10-23T09:28:33Z  in response to SystemAdmin
    Why do you think that a large block size will solve your problem ?

    Have you thought about that 50 nodes accessing around 50 disks ?
    Have you thought about your disk layout ( What is the current configuration and aka Metadata disks only yes/no , why have you selected your current hw a.s.o )

    Have you done any measurements
    - on your disk subsystem to determine iops, io queues and so on ?
    - with mmpmon or mmdiag --iohist ?
    - and of course have you checked your network.

    So try to find out what's the cause for your bottleneck.
    cheers
    Hajo
    • SystemAdmin
      SystemAdmin
      2092 Posts
      ACCEPTED ANSWER

      Re: recommended block size, effect of maxblocksize on existing filesystems?

      ‏2012-10-23T18:24:41Z  in response to HajoEhlers
      > HajoEhlers wrote:
      > Why do you think that a large block size will solve your problem ?

      Good question. I'm not sure. Changing the block size is one of several things
      I'm doing to work on improving performance.

      I plan to test a variety of block sizes, using data and processing that
      closely replicates the actual tasks in our lab.

      >
      > Have you thought about that 50 nodes accessing around 50 disks ?

      Yes. What are suggesting I think about?

      > Have you thought about your disk layout ( What is the current configuration and aka Metadata disks only yes/no , why have you selected your current hw a.s.o )

      There are no MetadatOnly disks.

      All MetaData is stored on 600GB 10K RPM SAS disks. There are 6 RAID-1 NSDs in
      the system pool.

      The bulk of the data is stored on either 1TB or 2TB 7.2K RPM SATA disks, in 4-disk RAID5
      groups. There are 11 of these dataOnly NSDs.

      The hardware selection was based on price, capacity, performance.

      >
      > Have you done any measurements
      > - on your disk subsystem to determine iops, io queues and so on ?

      A few...but absolute performance is not the problem--only performance
      under load from simultaneous use from multiple clients...and the problem
      seems to depend a lot on the type of access. We sometimes have 400 jobs
      running on 50 compute nodes with no problem, or 20 jobs running on 10
      nodes and a very high I/O wait.

      I'm planning some more thorough tests, but cannot deliberately stress the I/O
      system until after some lab deadlines (early November).

      > - with mmpmon or mmdiag --iohist ?
      > - and of course have you checked your network.

      The network hardware (SAN and Ethernet) doesn't show any errors, excessive
      retransmits, etc. However, I don't have a mechanism to measure Ethernet
      capacity, and that may be a bottleneck.

      >
      > So try to find out what's the cause for your bottleneck.

      Agreed.

      In addition to benchmarking different GPFS block sizes, I'm doing the
      following:

      Updating the Ethernet network from a flat 1Gb/s layout to a
      10Gb/s backbone between the GPFS servers and multiple switches,
      serving groups of 8~20 compute nodes (at 1Gb/s) each. This will
      also give the GPFS servers a 10Gb/s path to each other.

      Adding 2 more dataOnly NSDs, with the goals of:
      moving most data off the system pool (continuing to
      leave a small amount of shared binaries, databases,
      other heavily access/higher-performance data on the
      system disks).

      having more NSDs for GPFS to stripe across.
      Considering getting GPFS client licenses, so that the compute
      nodes can access the GPFS data more directly (still over Ethernet,
      but without the NFS layer).

      Also considering getting 2 additional GPFS server licenses,
      to spread the load from the compute nodes (whether NFS or GPFS
      clients).
      Thanks,

      Mark
      > cheers
      > Hajo
      • chr78
        chr78
        129 Posts
        ACCEPTED ANSWER

        Re: recommended block size, effect of maxblocksize on existing filesystems?

        ‏2012-10-23T21:30:47Z  in response to SystemAdmin
        why do you nfs-export your filesystems ? native gpfs access would be much leaner on your HPC clients...
        • chr78
          chr78
          129 Posts
          ACCEPTED ANSWER

          Re: recommended block size, effect of maxblocksize on existing filesystems?

          ‏2012-10-24T08:30:37Z  in response to chr78
          ah, I see - sorry for my last comment - you are already thinking of getting client licenses.
      • HajoEhlers
        HajoEhlers
        251 Posts
        ACCEPTED ANSWER

        Re: recommended block size, effect of maxblocksize on existing filesystems?

        ‏2012-10-24T09:10:30Z  in response to SystemAdmin
        > I'm doing to work on improving performance.
        performance can be different things.
        For one it is moving data around in 1TB/s for others a smooth reacting system....

        > I plan to test a variety of block sizes....
        The block size is like a truck in a highway. The larger the truck the more can be transported but the total amount of trucks and cars on a highway is limited.
        In a worst case the highway is full of trucks but no space is left for the porsche who is transporting the delivery sheet. ;-)

        > There are no MetadatOnly disks.
        > All MetaData is stored on 600GB 10K RPM SAS disks.
        > There are 6 RAID-1 NSDs in the system pool.

        Then I assume that you have metadataOnly and dataOnly disk. Poor mans storage pools called nowadays.
        And with 6 Raid1 i assume only 1500 iops.
        For example:
        - Our 1GbE NFS server requires about 1000 iops (avg) ( use mmdiag --iohist to get an overview )
        With your config i would run into problems pretty fast and you are using 3 NFS servers.

        > The hardware selection was based on price, capacity, performance.
        I assume in that order but with one important point left out "responsiveness"

        > In addition to benchmarking different GPFS block sizes, I'm doing the
        > following:

        I would strongly suggest that you monitor your current environment ( servers AND clients ) because then you see whats going on and you are able ( hopefully) to understand why the system behaves as it does.

        BTW: A Picture from the storage to the client can help as well. Like ( Simplified):
        
        SAN <----------------> NSD server ( 2 GB FC ) , 1 GbE            <---> Network <---> Clients IOPS: 1000             FC CMD: 50000       Eth Packets: 50000 (Cheep chip) STREAM: 1GB/s          FC Stream: 150MB/s  Eth Stream: 100MB/s
        

        Maybe your network is fine and your disk subsystem not - in this case spending money for a 10GbE network is spending money on the wrong place.

        Remarks
        1) nmon is a nice tool to monitor your nodes
        http://www.ibm.com/developerworks/forums/post!reply.jspa?messageID=14901115

        2) Also "mmdiag -iohist" gives right away a view about the disk usage in terms of data,inode,log,metadata a.s.o.
        As far as i underatnd the output from mmdiag --iohist everything not being "data" goes to the metadata disks.

        3) Some best practise
        - Nodes should keep intermediate data localy.
        - If data is used again on a given node try to keep it cached - ( Keep data as close as possible to the CPU )
        - Think about the usage of jumbo frames.

        cheers
        Hajo
        • SystemAdmin
          SystemAdmin
          2092 Posts
          ACCEPTED ANSWER

          Re: recommended block size, effect of maxblocksize on existing filesystems?

          ‏2012-10-24T14:43:55Z  in response to HajoEhlers
          > HajoEhlers wrote:
          > > I'm doing to work on improving performance.
          SNIP!
          >
          > > There are no MetadatOnly disks.
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

          > > All MetaData is stored on 600GB 10K RPM SAS disks.
          > > There are 6 RAID-1 NSDs in the system pool.
          >
          > Then I assume that you have metadataOnly and dataOnly disk. Poor mans storage pools called nowadays.

          No. There are no "metadataOnly" disks.

          > And with 6 Raid1 i assume only 1500 iops.
          > For example:
          > - Our 1GbE NFS server requires about 1000 iops (avg) ( use mmdiag --iohist to get an overview )

          You seem to be starting with an IOPS requirement, and then designing
          around that. We started with requirements for capacity, cost, and
          performance (from a user perspective, what you term "responsiveness"-a
          very good description), not any numerical requirements for the
          performance of different components (disk, SAN, ethernet, etc.).

          > With your config i would run into problems pretty fast and you are using 3 NFS servers.

          I don't understand your statement, as (1500 iops > 1000 iops) and the
          use of 3 NFS servers should distribute the load (using GPFS' CNFS)
          across the servers, ideally giving better performance for the clients.

          However, in our environment, it seems that either the disk or network
          systems are the bottleneck, not the GPFS servers, as the performance is
          about the same whether 1, 2, or 3 NFS servers is in operation. If the bottleneck
          was internal to the GPFS server (due to the CPU speed, memory, backplane or
          interface card limits in moving data from the NSDs on SAN to the NFS clients via
          Ethernet), then I would expect that adding more GPFS servers would improve
          responsiveness on the compute nodes. This has not been the case for us.

          Another point is that since CNFS uses RR DNS for "load balancing", and since we're using
          static NFS mounts (not automount), the NFS assignments tend to be unevenly distributed
          across the physical servers.

          >
          > > The hardware selection was based on price, capacity, performance.
          > I assume in that order but with one important point left out "responsiveness"

          Pretty much in that order. I'd consider "performance" to be
          "responsiveness". In fact, I like your word choice a lot, as the goal
          is to deliver a certain level of responsiveness to the end user, and
          the various performance components (disk RPM, IOPS, network speed,
          number of file servers, etc.) are a means to that end.

          >
          > > In addition to benchmarking different GPFS block sizes, I'm doing the
          > > following:
          >
          > I would strongly suggest that you monitor your current environment ( servers AND clients ) because then you see whats going on and you are able ( hopefully) to understand why the system behaves as it does.

          I agree completely. I've been seeking a method for correlating the
          performance problems to the I/O characteristics of end-user compute
          jobs, but I haven't found any way to link them together well. Yes, I
          can measure the performance drop on the servers, or on the clients, but
          there's little-or-no link between low-level numbers and the application
          behavior. For example, it would be very helpful to know that the system
          performs badly when there are 10 or more applications doing small random
          I/O that is write-intensive from 10 or more compute nodes, accessing
          the same directory at once. It is less helpful to know that when
          the system is performing badly (for some unknown reason), the symptom
          is that a particular NSD is performing 101.15 reads/second with an
          average wait time per IO operation of 1.85ms. Those type of performance
          numbers are fairly easy to obtain, but are very difficult to apply to
          the applications that cause the system to respond poorly.
          >
          > BTW: A Picture from the storage to the client can help as well. Like ( Simplified):
          >
          
          > SAN <----------------> NSD server ( 2 GB FC ) , 1 GbE            <---> Network <---> Clients > IOPS: 1000             FC CMD: 50000       Eth Packets: 50000 (Cheep chip) > STREAM: 1GB/s          FC Stream: 150MB/s  Eth Stream: 100MB/s >
          

          > Maybe your network is fine and your disk subsystem not - in this case spending money for a 10GbE network is spending money on the wrong place.

          Maybe...but that money has already been spent. :)

          >
          > Remarks
          > 1) nmon is a nice tool to monitor your nodes
          > http://www.ibm.com/developerworks/forums/post!reply.jspa?messageID=14901115

          Thanks for the recommendation. I like the idea that it saves data as a CSV
          file...somewhat easier than post-processing SAR or iostat output.

          >
          > 2) Also "mmdiag -iohist" gives right away a view about the disk usage in terms of data,inode,log,metadata a.s.o.
          > As far as i underatnd the output from mmdiag --iohist everything not being "data" goes to the metadata disks.

          Right, but that's an instantaneous snapshot from the GPFS server point
          of view. I'm not sure how to correlate those low-level numbers there
          with the type of client jobs and with the level of responsiveness.

          I've been collecting data (sar, iostat, GPFS "waiters") etc. There doesn't
          seem to be a strong link between those numbers and the poor responsiveness
          experienced by users.

          My general impression is that the performance problems are related
          to the job characteristics (ie. read-intensive vs. write intensive,
          sequential I/O vs random, small files vs. large) and more importantly,
          to the number of simultaneous jobs. However, there's a huge amount of
          variation (sometimes 400 running jobs are fine, sometimes 20 jobs of a
          different type are a problem).

          The end users have very little knowledge or awareness of the low-level
          I/O behavior of their own jobs (even for the code they are writing). Most
          of our data is medical images. That data--and the analysis results--are
          often stored as large sparse arrays. Different toolkits read/write those
          arrays sequentially vs. randomly, in large streams or as individual
          floating point values.

          The typical compute job will read all files in a directory (usually
          through a library or toolkit that we did not develop), process that
          data in one of many different ways, then write results--which may be a
          small text file of statistics, or new data files that are 10s or 100x
          times larger than the original. The "write" step also typically uses a
          external library or toolkit, giving us little knowledge or control of
          whther writes are sequential, random, using large blocks or small, etc.

          >
          > 3) Some best practise
          > - Nodes should keep intermediate data localy.

          We're already doing this where possible. In our environment there are
          about 45 people actively developing or running wildly different types of
          compute jobs. Everyone is strongly encouraged to use local disk space on
          each compute node for temporary results or for data that will be shared
          by different jobs on the same node...but there is no way to enforce
          that suggestion.

          > - If data is used again on a given node try to keep it cached - ( Keep data as close as possible to the CPU )
          > - Think about the usage of jumbo frames.

          Yes, jumbo frames will be enabled on the 10Gb network.

          Thanks again for your suggestions,

          Mark

          >
          > cheers
          > Hajo
          • HajoEhlers
            HajoEhlers
            251 Posts
            ACCEPTED ANSWER

            Re: recommended block size, effect of maxblocksize on existing filesystems?

            ‏2012-10-25T10:54:40Z  in response to SystemAdmin
            > No. There are no "metadataOnly" disks.
            So a single "system" pool ?

            > I don't understand your statement, as (1500 iops > 1000 iops) and the
            > use of 3 NFS servers should distribute the load (using GPFS' CNFS)
            > across the servers, ideally giving better performance for the clients.

            At the backend they are using the same disk subsystem thus they share the service capability/power of the storage. If one GPFS server is using "all" power then the other will suffer in your CNFS environment.
            Keep in mind that CNFS is NOT pNFS(parallel NFS). Your loadbalencing is on the NFS server side and not on the GPFS one.

            > ... I've been seeking a method for correlating the performance problems to the I/O characteristics
            > of end-user compute jobs
            I would do it in a different way because what is if everthing is fine but the user program is really a mess ? Meaning that the interaction between the Storage,GPFS, NFS is fine,fast and powerfull thus THERE is nothing to improve.

            Example: We had a program which wrote in 1k chunks which a sync for each write.
            The system had no problem to write that to the storage cache so on the wire everything was fine.
            But still the user complained ...

            So i would forget for a moment the user and start looking from the storage side upwards .
            As said before:
            - If possible measure directly on the storage io queues sizes, cache usage.
            - Measure on the GPFS server with nmon and mmdiag. ( or mmpmon )

            Use "mmdiag --iohist" in a loop. Even if it shows only the last 500 entries. With a little bit scripting you could check if "time ms" with >0.1 exist and if it is data access (grep -w data ) or none date (MetaData) ( grep -vw data).
            Example from a saturated disk susbsystem:
            
            I/O start time RW    Buf type disk:sectorNum     nSec  time ms  Type  Device/NSD ID         NSD server --------------- -- ----------- ----------------- -----  -------  ---- ------------------ --------------- 10:32:36.272755  R        inode   2:28244000         1   44.029  srv  hdisk14 10:32:36.328422  W        data    3:4236679168    2048   29.458  srv hdisk55 10:32:36.328777  R        data    7:1963823104    2048   27.249  srv  hdisk59 10:32:36.330033  R        data    7:217536512     2048   36.494  srv  hdisk59 10:32:36.330482  R        data    3:2464006144    1638   47.012  srv  hdisk55 10:32:36.332082  R        data    3:28366848      2048   37.023  srv  hdisk55 10:32:36.337591  W        data    3:30377984      2048   29.204  srv  hdisk55
            


            As we see the metadata access is slow AND the data r/w is slow.
            The reason was an overloaded storage cache ( could not flush fast enough ) for data.
            The reason for the metadata access ( Different storage ) was a change in the usage. ( Starting of having thousend of files within many directories. ) so the FC metadatadisk could not keep up.
            > ... That data--and the analysis results--are often stored as large sparse arrays.
            I would assume that the extension of a sparse file to a "real" file required pretty much metadata updates/changes thus during this process the access looks like blocked.

            Cheers
            Hajo