Topic
  • 8 replies
  • Latest Post - ‏2012-10-25T10:54:40Z by HajoEhlers
SystemAdmin
SystemAdmin
2092 Posts

Pinned topic recommended block size, effect of maxblocksize on existing filesystems?

‏2012-10-19T21:33:10Z |
I'm using GPFS 3.5.0-3 under Linux and periodically experience serious performance problems. We've got a 3 GPFS servers, exporting a ~33TB filesystem via NFS to ~50 compute nodes in an HPC cluster. The typical file sizes vary from <1MB to several GB, and IO workloads are highly mixed and unpredictable. We see sporadic, extended load spikes on the GPFS/NFS servers and extremely poor performance on the NFS clients due to multiple compute jobs being in I/O intensive phases (reading input image files, writing intermediate results, etc.) simultaneously.

Most of the 14 NSDs are 4-disk RAID5 LUN on an 8Gb/s SAN.
As one part of trying to address this issue, I'm going to create a new filesystem with a larger block size (up from the 512Kb block size on the existing filesystem). I was planning to use 4Mb, but also see that an 8Mb block size is an option.

Question 1: Is there a recommendation for the filesystem block size for highly mixed workloads?

Question 2: Are there tools that I can use on our existing (4Mb) filesystem to determine an optimum block size, based on the existing usage patterns?

Question 3: Changing "maxblocksize" (with mmchconfig) is a prerequisite for creating the new filesystem with a larger block size. Will changing the block size have any affect on the existing 512Kb filesystem being used in production?
Updated on 2012-10-25T10:54:40Z at 2012-10-25T10:54:40Z by HajoEhlers
  • ezhong
    ezhong
    33 Posts

    Re: recommended block size, effect of maxblocksize on existing filesystems?

    ‏2012-10-20T19:50:37Z  
    > Question 1: Is there a recommendation for the filesystem block size for highly mixed workloads?

    It's better to use filesystems of different block sizes to meet different needs.

    > Question 2: Are there tools that I can use on our existing (4Mb) filesystem to determine an optimum block size, based on the existing usage patterns?

    Perhaps it's sufficient just to use common sense judgement.

    Question 3: Changing "maxblocksize" (with mmchconfig) is a prerequisite for creating the new filesystem with a larger block size. Will changing the block size have any affect on the existing 512Kb filesystem being used in production?

    I don't think so.
  • HajoEhlers
    HajoEhlers
    253 Posts

    Re: recommended block size, effect of maxblocksize on existing filesystems?

    ‏2012-10-23T09:28:33Z  
    Why do you think that a large block size will solve your problem ?

    Have you thought about that 50 nodes accessing around 50 disks ?
    Have you thought about your disk layout ( What is the current configuration and aka Metadata disks only yes/no , why have you selected your current hw a.s.o )

    Have you done any measurements
    - on your disk subsystem to determine iops, io queues and so on ?
    - with mmpmon or mmdiag --iohist ?
    - and of course have you checked your network.

    So try to find out what's the cause for your bottleneck.
    cheers
    Hajo
  • SystemAdmin
    SystemAdmin
    2092 Posts

    Re: recommended block size, effect of maxblocksize on existing filesystems?

    ‏2012-10-23T18:24:41Z  
    Why do you think that a large block size will solve your problem ?

    Have you thought about that 50 nodes accessing around 50 disks ?
    Have you thought about your disk layout ( What is the current configuration and aka Metadata disks only yes/no , why have you selected your current hw a.s.o )

    Have you done any measurements
    - on your disk subsystem to determine iops, io queues and so on ?
    - with mmpmon or mmdiag --iohist ?
    - and of course have you checked your network.

    So try to find out what's the cause for your bottleneck.
    cheers
    Hajo
    > HajoEhlers wrote:
    > Why do you think that a large block size will solve your problem ?

    Good question. I'm not sure. Changing the block size is one of several things
    I'm doing to work on improving performance.

    I plan to test a variety of block sizes, using data and processing that
    closely replicates the actual tasks in our lab.

    >
    > Have you thought about that 50 nodes accessing around 50 disks ?

    Yes. What are suggesting I think about?

    > Have you thought about your disk layout ( What is the current configuration and aka Metadata disks only yes/no , why have you selected your current hw a.s.o )

    There are no MetadatOnly disks.

    All MetaData is stored on 600GB 10K RPM SAS disks. There are 6 RAID-1 NSDs in
    the system pool.

    The bulk of the data is stored on either 1TB or 2TB 7.2K RPM SATA disks, in 4-disk RAID5
    groups. There are 11 of these dataOnly NSDs.

    The hardware selection was based on price, capacity, performance.

    >
    > Have you done any measurements
    > - on your disk subsystem to determine iops, io queues and so on ?

    A few...but absolute performance is not the problem--only performance
    under load from simultaneous use from multiple clients...and the problem
    seems to depend a lot on the type of access. We sometimes have 400 jobs
    running on 50 compute nodes with no problem, or 20 jobs running on 10
    nodes and a very high I/O wait.

    I'm planning some more thorough tests, but cannot deliberately stress the I/O
    system until after some lab deadlines (early November).

    > - with mmpmon or mmdiag --iohist ?
    > - and of course have you checked your network.

    The network hardware (SAN and Ethernet) doesn't show any errors, excessive
    retransmits, etc. However, I don't have a mechanism to measure Ethernet
    capacity, and that may be a bottleneck.

    >
    > So try to find out what's the cause for your bottleneck.

    Agreed.

    In addition to benchmarking different GPFS block sizes, I'm doing the
    following:

    Updating the Ethernet network from a flat 1Gb/s layout to a
    10Gb/s backbone between the GPFS servers and multiple switches,
    serving groups of 8~20 compute nodes (at 1Gb/s) each. This will
    also give the GPFS servers a 10Gb/s path to each other.

    Adding 2 more dataOnly NSDs, with the goals of:
    moving most data off the system pool (continuing to
    leave a small amount of shared binaries, databases,
    other heavily access/higher-performance data on the
    system disks).

    having more NSDs for GPFS to stripe across.
    Considering getting GPFS client licenses, so that the compute
    nodes can access the GPFS data more directly (still over Ethernet,
    but without the NFS layer).

    Also considering getting 2 additional GPFS server licenses,
    to spread the load from the compute nodes (whether NFS or GPFS
    clients).
    Thanks,

    Mark
    > cheers
    > Hajo
  • chr78
    chr78
    132 Posts

    Re: recommended block size, effect of maxblocksize on existing filesystems?

    ‏2012-10-23T21:30:47Z  
    > HajoEhlers wrote:
    > Why do you think that a large block size will solve your problem ?

    Good question. I'm not sure. Changing the block size is one of several things
    I'm doing to work on improving performance.

    I plan to test a variety of block sizes, using data and processing that
    closely replicates the actual tasks in our lab.

    >
    > Have you thought about that 50 nodes accessing around 50 disks ?

    Yes. What are suggesting I think about?

    > Have you thought about your disk layout ( What is the current configuration and aka Metadata disks only yes/no , why have you selected your current hw a.s.o )

    There are no MetadatOnly disks.

    All MetaData is stored on 600GB 10K RPM SAS disks. There are 6 RAID-1 NSDs in
    the system pool.

    The bulk of the data is stored on either 1TB or 2TB 7.2K RPM SATA disks, in 4-disk RAID5
    groups. There are 11 of these dataOnly NSDs.

    The hardware selection was based on price, capacity, performance.

    >
    > Have you done any measurements
    > - on your disk subsystem to determine iops, io queues and so on ?

    A few...but absolute performance is not the problem--only performance
    under load from simultaneous use from multiple clients...and the problem
    seems to depend a lot on the type of access. We sometimes have 400 jobs
    running on 50 compute nodes with no problem, or 20 jobs running on 10
    nodes and a very high I/O wait.

    I'm planning some more thorough tests, but cannot deliberately stress the I/O
    system until after some lab deadlines (early November).

    > - with mmpmon or mmdiag --iohist ?
    > - and of course have you checked your network.

    The network hardware (SAN and Ethernet) doesn't show any errors, excessive
    retransmits, etc. However, I don't have a mechanism to measure Ethernet
    capacity, and that may be a bottleneck.

    >
    > So try to find out what's the cause for your bottleneck.

    Agreed.

    In addition to benchmarking different GPFS block sizes, I'm doing the
    following:

    Updating the Ethernet network from a flat 1Gb/s layout to a
    10Gb/s backbone between the GPFS servers and multiple switches,
    serving groups of 8~20 compute nodes (at 1Gb/s) each. This will
    also give the GPFS servers a 10Gb/s path to each other.

    Adding 2 more dataOnly NSDs, with the goals of:
    moving most data off the system pool (continuing to
    leave a small amount of shared binaries, databases,
    other heavily access/higher-performance data on the
    system disks).

    having more NSDs for GPFS to stripe across.
    Considering getting GPFS client licenses, so that the compute
    nodes can access the GPFS data more directly (still over Ethernet,
    but without the NFS layer).

    Also considering getting 2 additional GPFS server licenses,
    to spread the load from the compute nodes (whether NFS or GPFS
    clients).
    Thanks,

    Mark
    > cheers
    > Hajo
    why do you nfs-export your filesystems ? native gpfs access would be much leaner on your HPC clients...
  • chr78
    chr78
    132 Posts

    Re: recommended block size, effect of maxblocksize on existing filesystems?

    ‏2012-10-24T08:30:37Z  
    • chr78
    • ‏2012-10-23T21:30:47Z
    why do you nfs-export your filesystems ? native gpfs access would be much leaner on your HPC clients...
    ah, I see - sorry for my last comment - you are already thinking of getting client licenses.
  • HajoEhlers
    HajoEhlers
    253 Posts

    Re: recommended block size, effect of maxblocksize on existing filesystems?

    ‏2012-10-24T09:10:30Z  
    > HajoEhlers wrote:
    > Why do you think that a large block size will solve your problem ?

    Good question. I'm not sure. Changing the block size is one of several things
    I'm doing to work on improving performance.

    I plan to test a variety of block sizes, using data and processing that
    closely replicates the actual tasks in our lab.

    >
    > Have you thought about that 50 nodes accessing around 50 disks ?

    Yes. What are suggesting I think about?

    > Have you thought about your disk layout ( What is the current configuration and aka Metadata disks only yes/no , why have you selected your current hw a.s.o )

    There are no MetadatOnly disks.

    All MetaData is stored on 600GB 10K RPM SAS disks. There are 6 RAID-1 NSDs in
    the system pool.

    The bulk of the data is stored on either 1TB or 2TB 7.2K RPM SATA disks, in 4-disk RAID5
    groups. There are 11 of these dataOnly NSDs.

    The hardware selection was based on price, capacity, performance.

    >
    > Have you done any measurements
    > - on your disk subsystem to determine iops, io queues and so on ?

    A few...but absolute performance is not the problem--only performance
    under load from simultaneous use from multiple clients...and the problem
    seems to depend a lot on the type of access. We sometimes have 400 jobs
    running on 50 compute nodes with no problem, or 20 jobs running on 10
    nodes and a very high I/O wait.

    I'm planning some more thorough tests, but cannot deliberately stress the I/O
    system until after some lab deadlines (early November).

    > - with mmpmon or mmdiag --iohist ?
    > - and of course have you checked your network.

    The network hardware (SAN and Ethernet) doesn't show any errors, excessive
    retransmits, etc. However, I don't have a mechanism to measure Ethernet
    capacity, and that may be a bottleneck.

    >
    > So try to find out what's the cause for your bottleneck.

    Agreed.

    In addition to benchmarking different GPFS block sizes, I'm doing the
    following:

    Updating the Ethernet network from a flat 1Gb/s layout to a
    10Gb/s backbone between the GPFS servers and multiple switches,
    serving groups of 8~20 compute nodes (at 1Gb/s) each. This will
    also give the GPFS servers a 10Gb/s path to each other.

    Adding 2 more dataOnly NSDs, with the goals of:
    moving most data off the system pool (continuing to
    leave a small amount of shared binaries, databases,
    other heavily access/higher-performance data on the
    system disks).

    having more NSDs for GPFS to stripe across.
    Considering getting GPFS client licenses, so that the compute
    nodes can access the GPFS data more directly (still over Ethernet,
    but without the NFS layer).

    Also considering getting 2 additional GPFS server licenses,
    to spread the load from the compute nodes (whether NFS or GPFS
    clients).
    Thanks,

    Mark
    > cheers
    > Hajo
    > I'm doing to work on improving performance.
    performance can be different things.
    For one it is moving data around in 1TB/s for others a smooth reacting system....

    > I plan to test a variety of block sizes....
    The block size is like a truck in a highway. The larger the truck the more can be transported but the total amount of trucks and cars on a highway is limited.
    In a worst case the highway is full of trucks but no space is left for the porsche who is transporting the delivery sheet. ;-)

    > There are no MetadatOnly disks.
    > All MetaData is stored on 600GB 10K RPM SAS disks.
    > There are 6 RAID-1 NSDs in the system pool.

    Then I assume that you have metadataOnly and dataOnly disk. Poor mans storage pools called nowadays.
    And with 6 Raid1 i assume only 1500 iops.
    For example:
    - Our 1GbE NFS server requires about 1000 iops (avg) ( use mmdiag --iohist to get an overview )
    With your config i would run into problems pretty fast and you are using 3 NFS servers.

    > The hardware selection was based on price, capacity, performance.
    I assume in that order but with one important point left out "responsiveness"

    > In addition to benchmarking different GPFS block sizes, I'm doing the
    > following:

    I would strongly suggest that you monitor your current environment ( servers AND clients ) because then you see whats going on and you are able ( hopefully) to understand why the system behaves as it does.

    BTW: A Picture from the storage to the client can help as well. Like ( Simplified):
    
    SAN <----------------> NSD server ( 2 GB FC ) , 1 GbE            <---> Network <---> Clients IOPS: 1000             FC CMD: 50000       Eth Packets: 50000 (Cheep chip) STREAM: 1GB/s          FC Stream: 150MB/s  Eth Stream: 100MB/s
    

    Maybe your network is fine and your disk subsystem not - in this case spending money for a 10GbE network is spending money on the wrong place.

    Remarks
    1) nmon is a nice tool to monitor your nodes
    http://www.ibm.com/developerworks/forums/post!reply.jspa?messageID=14901115

    2) Also "mmdiag -iohist" gives right away a view about the disk usage in terms of data,inode,log,metadata a.s.o.
    As far as i underatnd the output from mmdiag --iohist everything not being "data" goes to the metadata disks.

    3) Some best practise
    - Nodes should keep intermediate data localy.
    - If data is used again on a given node try to keep it cached - ( Keep data as close as possible to the CPU )
    - Think about the usage of jumbo frames.

    cheers
    Hajo
  • SystemAdmin
    SystemAdmin
    2092 Posts

    Re: recommended block size, effect of maxblocksize on existing filesystems?

    ‏2012-10-24T14:43:55Z  
    > I'm doing to work on improving performance.
    performance can be different things.
    For one it is moving data around in 1TB/s for others a smooth reacting system....

    > I plan to test a variety of block sizes....
    The block size is like a truck in a highway. The larger the truck the more can be transported but the total amount of trucks and cars on a highway is limited.
    In a worst case the highway is full of trucks but no space is left for the porsche who is transporting the delivery sheet. ;-)

    > There are no MetadatOnly disks.
    > All MetaData is stored on 600GB 10K RPM SAS disks.
    > There are 6 RAID-1 NSDs in the system pool.

    Then I assume that you have metadataOnly and dataOnly disk. Poor mans storage pools called nowadays.
    And with 6 Raid1 i assume only 1500 iops.
    For example:
    - Our 1GbE NFS server requires about 1000 iops (avg) ( use mmdiag --iohist to get an overview )
    With your config i would run into problems pretty fast and you are using 3 NFS servers.

    > The hardware selection was based on price, capacity, performance.
    I assume in that order but with one important point left out "responsiveness"

    > In addition to benchmarking different GPFS block sizes, I'm doing the
    > following:

    I would strongly suggest that you monitor your current environment ( servers AND clients ) because then you see whats going on and you are able ( hopefully) to understand why the system behaves as it does.

    BTW: A Picture from the storage to the client can help as well. Like ( Simplified):
    <pre class="jive-pre"> SAN <----------------> NSD server ( 2 GB FC ) , 1 GbE <---> Network <---> Clients IOPS: 1000 FC CMD: 50000 Eth Packets: 50000 (Cheep chip) STREAM: 1GB/s FC Stream: 150MB/s Eth Stream: 100MB/s </pre>
    Maybe your network is fine and your disk subsystem not - in this case spending money for a 10GbE network is spending money on the wrong place.

    Remarks
    1) nmon is a nice tool to monitor your nodes
    http://www.ibm.com/developerworks/forums/post!reply.jspa?messageID=14901115

    2) Also "mmdiag -iohist" gives right away a view about the disk usage in terms of data,inode,log,metadata a.s.o.
    As far as i underatnd the output from mmdiag --iohist everything not being "data" goes to the metadata disks.

    3) Some best practise
    - Nodes should keep intermediate data localy.
    - If data is used again on a given node try to keep it cached - ( Keep data as close as possible to the CPU )
    - Think about the usage of jumbo frames.

    cheers
    Hajo
    > HajoEhlers wrote:
    > > I'm doing to work on improving performance.
    SNIP!
    >
    > > There are no MetadatOnly disks.
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    > > All MetaData is stored on 600GB 10K RPM SAS disks.
    > > There are 6 RAID-1 NSDs in the system pool.
    >
    > Then I assume that you have metadataOnly and dataOnly disk. Poor mans storage pools called nowadays.

    No. There are no "metadataOnly" disks.

    > And with 6 Raid1 i assume only 1500 iops.
    > For example:
    > - Our 1GbE NFS server requires about 1000 iops (avg) ( use mmdiag --iohist to get an overview )

    You seem to be starting with an IOPS requirement, and then designing
    around that. We started with requirements for capacity, cost, and
    performance (from a user perspective, what you term "responsiveness"-a
    very good description), not any numerical requirements for the
    performance of different components (disk, SAN, ethernet, etc.).

    > With your config i would run into problems pretty fast and you are using 3 NFS servers.

    I don't understand your statement, as (1500 iops > 1000 iops) and the
    use of 3 NFS servers should distribute the load (using GPFS' CNFS)
    across the servers, ideally giving better performance for the clients.

    However, in our environment, it seems that either the disk or network
    systems are the bottleneck, not the GPFS servers, as the performance is
    about the same whether 1, 2, or 3 NFS servers is in operation. If the bottleneck
    was internal to the GPFS server (due to the CPU speed, memory, backplane or
    interface card limits in moving data from the NSDs on SAN to the NFS clients via
    Ethernet), then I would expect that adding more GPFS servers would improve
    responsiveness on the compute nodes. This has not been the case for us.

    Another point is that since CNFS uses RR DNS for "load balancing", and since we're using
    static NFS mounts (not automount), the NFS assignments tend to be unevenly distributed
    across the physical servers.

    >
    > > The hardware selection was based on price, capacity, performance.
    > I assume in that order but with one important point left out "responsiveness"

    Pretty much in that order. I'd consider "performance" to be
    "responsiveness". In fact, I like your word choice a lot, as the goal
    is to deliver a certain level of responsiveness to the end user, and
    the various performance components (disk RPM, IOPS, network speed,
    number of file servers, etc.) are a means to that end.

    >
    > > In addition to benchmarking different GPFS block sizes, I'm doing the
    > > following:
    >
    > I would strongly suggest that you monitor your current environment ( servers AND clients ) because then you see whats going on and you are able ( hopefully) to understand why the system behaves as it does.

    I agree completely. I've been seeking a method for correlating the
    performance problems to the I/O characteristics of end-user compute
    jobs, but I haven't found any way to link them together well. Yes, I
    can measure the performance drop on the servers, or on the clients, but
    there's little-or-no link between low-level numbers and the application
    behavior. For example, it would be very helpful to know that the system
    performs badly when there are 10 or more applications doing small random
    I/O that is write-intensive from 10 or more compute nodes, accessing
    the same directory at once. It is less helpful to know that when
    the system is performing badly (for some unknown reason), the symptom
    is that a particular NSD is performing 101.15 reads/second with an
    average wait time per IO operation of 1.85ms. Those type of performance
    numbers are fairly easy to obtain, but are very difficult to apply to
    the applications that cause the system to respond poorly.
    >
    > BTW: A Picture from the storage to the client can help as well. Like ( Simplified):
    >
    
    > SAN <----------------> NSD server ( 2 GB FC ) , 1 GbE            <---> Network <---> Clients > IOPS: 1000             FC CMD: 50000       Eth Packets: 50000 (Cheep chip) > STREAM: 1GB/s          FC Stream: 150MB/s  Eth Stream: 100MB/s >
    

    > Maybe your network is fine and your disk subsystem not - in this case spending money for a 10GbE network is spending money on the wrong place.

    Maybe...but that money has already been spent. :)

    >
    > Remarks
    > 1) nmon is a nice tool to monitor your nodes
    > http://www.ibm.com/developerworks/forums/post!reply.jspa?messageID=14901115

    Thanks for the recommendation. I like the idea that it saves data as a CSV
    file...somewhat easier than post-processing SAR or iostat output.

    >
    > 2) Also "mmdiag -iohist" gives right away a view about the disk usage in terms of data,inode,log,metadata a.s.o.
    > As far as i underatnd the output from mmdiag --iohist everything not being "data" goes to the metadata disks.

    Right, but that's an instantaneous snapshot from the GPFS server point
    of view. I'm not sure how to correlate those low-level numbers there
    with the type of client jobs and with the level of responsiveness.

    I've been collecting data (sar, iostat, GPFS "waiters") etc. There doesn't
    seem to be a strong link between those numbers and the poor responsiveness
    experienced by users.

    My general impression is that the performance problems are related
    to the job characteristics (ie. read-intensive vs. write intensive,
    sequential I/O vs random, small files vs. large) and more importantly,
    to the number of simultaneous jobs. However, there's a huge amount of
    variation (sometimes 400 running jobs are fine, sometimes 20 jobs of a
    different type are a problem).

    The end users have very little knowledge or awareness of the low-level
    I/O behavior of their own jobs (even for the code they are writing). Most
    of our data is medical images. That data--and the analysis results--are
    often stored as large sparse arrays. Different toolkits read/write those
    arrays sequentially vs. randomly, in large streams or as individual
    floating point values.

    The typical compute job will read all files in a directory (usually
    through a library or toolkit that we did not develop), process that
    data in one of many different ways, then write results--which may be a
    small text file of statistics, or new data files that are 10s or 100x
    times larger than the original. The "write" step also typically uses a
    external library or toolkit, giving us little knowledge or control of
    whther writes are sequential, random, using large blocks or small, etc.

    >
    > 3) Some best practise
    > - Nodes should keep intermediate data localy.

    We're already doing this where possible. In our environment there are
    about 45 people actively developing or running wildly different types of
    compute jobs. Everyone is strongly encouraged to use local disk space on
    each compute node for temporary results or for data that will be shared
    by different jobs on the same node...but there is no way to enforce
    that suggestion.

    > - If data is used again on a given node try to keep it cached - ( Keep data as close as possible to the CPU )
    > - Think about the usage of jumbo frames.

    Yes, jumbo frames will be enabled on the 10Gb network.

    Thanks again for your suggestions,

    Mark

    >
    > cheers
    > Hajo
  • HajoEhlers
    HajoEhlers
    253 Posts

    Re: recommended block size, effect of maxblocksize on existing filesystems?

    ‏2012-10-25T10:54:40Z  
    > HajoEhlers wrote:
    > > I'm doing to work on improving performance.
    SNIP!
    >
    > > There are no MetadatOnly disks.
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    > > All MetaData is stored on 600GB 10K RPM SAS disks.
    > > There are 6 RAID-1 NSDs in the system pool.
    >
    > Then I assume that you have metadataOnly and dataOnly disk. Poor mans storage pools called nowadays.

    No. There are no "metadataOnly" disks.

    > And with 6 Raid1 i assume only 1500 iops.
    > For example:
    > - Our 1GbE NFS server requires about 1000 iops (avg) ( use mmdiag --iohist to get an overview )

    You seem to be starting with an IOPS requirement, and then designing
    around that. We started with requirements for capacity, cost, and
    performance (from a user perspective, what you term "responsiveness"-a
    very good description), not any numerical requirements for the
    performance of different components (disk, SAN, ethernet, etc.).

    > With your config i would run into problems pretty fast and you are using 3 NFS servers.

    I don't understand your statement, as (1500 iops > 1000 iops) and the
    use of 3 NFS servers should distribute the load (using GPFS' CNFS)
    across the servers, ideally giving better performance for the clients.

    However, in our environment, it seems that either the disk or network
    systems are the bottleneck, not the GPFS servers, as the performance is
    about the same whether 1, 2, or 3 NFS servers is in operation. If the bottleneck
    was internal to the GPFS server (due to the CPU speed, memory, backplane or
    interface card limits in moving data from the NSDs on SAN to the NFS clients via
    Ethernet), then I would expect that adding more GPFS servers would improve
    responsiveness on the compute nodes. This has not been the case for us.

    Another point is that since CNFS uses RR DNS for "load balancing", and since we're using
    static NFS mounts (not automount), the NFS assignments tend to be unevenly distributed
    across the physical servers.

    >
    > > The hardware selection was based on price, capacity, performance.
    > I assume in that order but with one important point left out "responsiveness"

    Pretty much in that order. I'd consider "performance" to be
    "responsiveness". In fact, I like your word choice a lot, as the goal
    is to deliver a certain level of responsiveness to the end user, and
    the various performance components (disk RPM, IOPS, network speed,
    number of file servers, etc.) are a means to that end.

    >
    > > In addition to benchmarking different GPFS block sizes, I'm doing the
    > > following:
    >
    > I would strongly suggest that you monitor your current environment ( servers AND clients ) because then you see whats going on and you are able ( hopefully) to understand why the system behaves as it does.

    I agree completely. I've been seeking a method for correlating the
    performance problems to the I/O characteristics of end-user compute
    jobs, but I haven't found any way to link them together well. Yes, I
    can measure the performance drop on the servers, or on the clients, but
    there's little-or-no link between low-level numbers and the application
    behavior. For example, it would be very helpful to know that the system
    performs badly when there are 10 or more applications doing small random
    I/O that is write-intensive from 10 or more compute nodes, accessing
    the same directory at once. It is less helpful to know that when
    the system is performing badly (for some unknown reason), the symptom
    is that a particular NSD is performing 101.15 reads/second with an
    average wait time per IO operation of 1.85ms. Those type of performance
    numbers are fairly easy to obtain, but are very difficult to apply to
    the applications that cause the system to respond poorly.
    >
    > BTW: A Picture from the storage to the client can help as well. Like ( Simplified):
    > <pre class="jive-pre"> > SAN <----------------> NSD server ( 2 GB FC ) , 1 GbE <---> Network <---> Clients > IOPS: 1000 FC CMD: 50000 Eth Packets: 50000 (Cheep chip) > STREAM: 1GB/s FC Stream: 150MB/s Eth Stream: 100MB/s > </pre>
    > Maybe your network is fine and your disk subsystem not - in this case spending money for a 10GbE network is spending money on the wrong place.

    Maybe...but that money has already been spent. :)

    >
    > Remarks
    > 1) nmon is a nice tool to monitor your nodes
    > http://www.ibm.com/developerworks/forums/post!reply.jspa?messageID=14901115

    Thanks for the recommendation. I like the idea that it saves data as a CSV
    file...somewhat easier than post-processing SAR or iostat output.

    >
    > 2) Also "mmdiag -iohist" gives right away a view about the disk usage in terms of data,inode,log,metadata a.s.o.
    > As far as i underatnd the output from mmdiag --iohist everything not being "data" goes to the metadata disks.

    Right, but that's an instantaneous snapshot from the GPFS server point
    of view. I'm not sure how to correlate those low-level numbers there
    with the type of client jobs and with the level of responsiveness.

    I've been collecting data (sar, iostat, GPFS "waiters") etc. There doesn't
    seem to be a strong link between those numbers and the poor responsiveness
    experienced by users.

    My general impression is that the performance problems are related
    to the job characteristics (ie. read-intensive vs. write intensive,
    sequential I/O vs random, small files vs. large) and more importantly,
    to the number of simultaneous jobs. However, there's a huge amount of
    variation (sometimes 400 running jobs are fine, sometimes 20 jobs of a
    different type are a problem).

    The end users have very little knowledge or awareness of the low-level
    I/O behavior of their own jobs (even for the code they are writing). Most
    of our data is medical images. That data--and the analysis results--are
    often stored as large sparse arrays. Different toolkits read/write those
    arrays sequentially vs. randomly, in large streams or as individual
    floating point values.

    The typical compute job will read all files in a directory (usually
    through a library or toolkit that we did not develop), process that
    data in one of many different ways, then write results--which may be a
    small text file of statistics, or new data files that are 10s or 100x
    times larger than the original. The "write" step also typically uses a
    external library or toolkit, giving us little knowledge or control of
    whther writes are sequential, random, using large blocks or small, etc.

    >
    > 3) Some best practise
    > - Nodes should keep intermediate data localy.

    We're already doing this where possible. In our environment there are
    about 45 people actively developing or running wildly different types of
    compute jobs. Everyone is strongly encouraged to use local disk space on
    each compute node for temporary results or for data that will be shared
    by different jobs on the same node...but there is no way to enforce
    that suggestion.

    > - If data is used again on a given node try to keep it cached - ( Keep data as close as possible to the CPU )
    > - Think about the usage of jumbo frames.

    Yes, jumbo frames will be enabled on the 10Gb network.

    Thanks again for your suggestions,

    Mark

    >
    > cheers
    > Hajo
    > No. There are no "metadataOnly" disks.
    So a single "system" pool ?

    > I don't understand your statement, as (1500 iops > 1000 iops) and the
    > use of 3 NFS servers should distribute the load (using GPFS' CNFS)
    > across the servers, ideally giving better performance for the clients.

    At the backend they are using the same disk subsystem thus they share the service capability/power of the storage. If one GPFS server is using "all" power then the other will suffer in your CNFS environment.
    Keep in mind that CNFS is NOT pNFS(parallel NFS). Your loadbalencing is on the NFS server side and not on the GPFS one.

    > ... I've been seeking a method for correlating the performance problems to the I/O characteristics
    > of end-user compute jobs
    I would do it in a different way because what is if everthing is fine but the user program is really a mess ? Meaning that the interaction between the Storage,GPFS, NFS is fine,fast and powerfull thus THERE is nothing to improve.

    Example: We had a program which wrote in 1k chunks which a sync for each write.
    The system had no problem to write that to the storage cache so on the wire everything was fine.
    But still the user complained ...

    So i would forget for a moment the user and start looking from the storage side upwards .
    As said before:
    - If possible measure directly on the storage io queues sizes, cache usage.
    - Measure on the GPFS server with nmon and mmdiag. ( or mmpmon )

    Use "mmdiag --iohist" in a loop. Even if it shows only the last 500 entries. With a little bit scripting you could check if "time ms" with >0.1 exist and if it is data access (grep -w data ) or none date (MetaData) ( grep -vw data).
    Example from a saturated disk susbsystem:
    
    I/O start time RW    Buf type disk:sectorNum     nSec  time ms  Type  Device/NSD ID         NSD server --------------- -- ----------- ----------------- -----  -------  ---- ------------------ --------------- 10:32:36.272755  R        inode   2:28244000         1   44.029  srv  hdisk14 10:32:36.328422  W        data    3:4236679168    2048   29.458  srv hdisk55 10:32:36.328777  R        data    7:1963823104    2048   27.249  srv  hdisk59 10:32:36.330033  R        data    7:217536512     2048   36.494  srv  hdisk59 10:32:36.330482  R        data    3:2464006144    1638   47.012  srv  hdisk55 10:32:36.332082  R        data    3:28366848      2048   37.023  srv  hdisk55 10:32:36.337591  W        data    3:30377984      2048   29.204  srv  hdisk55
    


    As we see the metadata access is slow AND the data r/w is slow.
    The reason was an overloaded storage cache ( could not flush fast enough ) for data.
    The reason for the metadata access ( Different storage ) was a change in the usage. ( Starting of having thousend of files within many directories. ) so the FC metadatadisk could not keep up.
    > ... That data--and the analysis results--are often stored as large sparse arrays.
    I would assume that the extension of a sparse file to a "real" file required pretty much metadata updates/changes thus during this process the access looks like blocked.

    Cheers
    Hajo