Topic
  • 9 replies
  • Latest Post - ‏2016-04-23T01:56:46Z by db808
SchelePierre
SchelePierre
20 Posts

Pinned topic mmdiag --iohist showing outliers on read traffic

‏2012-04-20T14:29:42Z |
Hello,

We are working with a mixed Windows 2008R2/SLES10SP1 GPFS Cluster (3.4.0.11 on SLES, still 3.4.0.3 on Win, FS level 3.4.0.3)

2 Linux machines are cross-connected to two DS3500 storage systems, each providing 6 data (RAID6 8+P+Q) LUNs and 6 metadata (RAID1) LUNs used as NSDs
The Linux machines are NSD servers, connected over 10Gbe switch to two Windows cluster nodes, which are NSD clients. A third Linux machine (not directly connected to the sotrage but to the same 10GbE network) has a tiebreaker role (server/quorum)

Upon examining a performance issue, we have stumbled on the following:

Looking at mmdiag --iohist output on a Windows NSD client we see

16:12:17.088625 R data 2:29253025792 4096 2.500 cli 0AF30511:4D9
C6C64 172.31.0.13
16:12:17.091625 R data 6:8121241600 4096 1.500 cli 0AF30511:4D9C
6C67 172.31.0.13
16:12:17.091625 R data 5:26244268032 4096 2.500 cli 0AF38411:4D9
C6C66 172.31.0.12
16:12:17.094625 R data 6:1393774592 4096 2.500 cli 0AF30511:4D9C
6C67 172.31.0.13
16:12:17.095125 R data 3:20890284032 4096 2.500 cli 0AF38411:4D9
C6C65 172.31.0.12
16:12:17.097125 R data 2:24636592128 4096 2.500 cli 0AF30511:4D9
C6C64 172.31.0.13
16:12:17.099125 R data 5:7258157056 4096 902.500 cli 0AF38411:4D9C
What's worrying me is that every now and then, the read time for 4096 sectors jumps to >900ms and is 1ms-2ms other times. There are no in-between values, just either 1-2ms or 9XY ms.

Would those times correspond to requesting sectors already in the NSD servers pagepool vs requesting uncached sectors from a disk?
At the same time, the output of mmdiag --iohist on one of the NSD servers reveals

16:15:23.275696 R data 1:28558798848 4096 21.264 srv sdl
16:15:23.284877 R data 5:17129132032 4096 21.715 srv sdk
16:15:23.320031 R data 5:14012674048 4096 17.181 srv sdk
16:15:23.338776 R data 3:29748744192 4096 18.032 srv sdm
16:15:23.365516 R data 3:13482860544 4096 23.435 srv sdm
16:15:23.376936 R data 3:23377121280 4096 27.741 srv sdm
16:15:23.523171 R data 3:16868118528 4096 18.209 srv sdm
16:15:23.480319 R data 1:12508475392 4096 99.471 srv sdl
16:15:23.564068 R data 1:14530838528 4096 19.497 srv sdl
16:15:23.484793 R data 1:11982221312 4096 108.894 srv sdl
16:15:23.599717 R data 1:17835274240 4096 19.777 srv sdl
16:15:23.555420 R data 3:17808695296 4096 68.679

where disk access times are usually around 20ms, with outliers around 100ms.

I have not yet been able to establish whether or not the 900ms NSD client outliers correspond somehow to the NSD server 100ms outliers.

My questions are:
- Does the mmdiag --iohist output on the NSD server show any prefetching done by GPFS ?
- Does the mmdiag --iohist on the NSD client mean time between request sent to server and response returned? Since times reported on the NSD client are 2ms, but "normal" values on the server are around 20ms read from disk, what does this mean wrt to the cached/non-cached theory?
- How does one explain the 900ms time on NSD client, from the numbers seen on the server?
- is a >100ms server disk access time to be expected, if load is << max throughput of storage boxes? (I would expect fairly constant disk read times on the server in a low load situation)

Filesystem blocksize is 2MB, other config settings include
minReleaseLevel 3.4.0.3
dmapiFileHandleSize 32
maxblocksize 4M
maxMBpS 1600
prefetchThreads 100
worker1Threads 200
linux nodes
pagepool 12G
maxFilesToCache 100000
windows nodes
pagepool 4G
common
adminMode central
Filesystem:
flag value description

------------------------
-f 65536 Minimum fragment size in bytes
-i 512 Inode size in bytes
-I 32768 Indirect block size in bytes
-m 2 Default number of metadata replicas
-M 2 Maximum number of metadata replicas
-r 2 Default number of data replicas
-R 2 Maximum number of data replicas
-j scatter Block allocation type
-D nfs4 File locking semantics in effect
-k all ACL semantics in effect
-n 30 Estimated number of nodes that will mount file system
-B 2097152 Block size
-Q none Quotas enforced
none Default quotas enabled
--filesetdf no Fileset df enabled?
-V 12.06 (3.4.0.3) File system version
--create-time Fri May 20 15:26:56 2011 File system creation time
-u yes Support for large LUNs?
-z no Is DMAPI enabled?
-L 4194304 Logfile size
-E yes Exact mtime mount option
-S yes Suppress atime mount option
-K always Strict replica allocation option
--fastea yes Fast external attributes enabled?
--inode-limit 134217728 Maximum number of inodes
-P system Disk storage pools in file system
-A yes Automatic mount option
-o none Additional mount options
--mount-priority 0

Any opinions on expected mmdiag --iohist output, and specifically, the large outliers I have observed?

Thanks alot!
Pieter
Updated on 2012-05-01T20:24:59Z at 2012-05-01T20:24:59Z by osc
  • dlmcnabb
    dlmcnabb
    1012 Posts

    Re: mmdiag --iohist showing outliers on read traffic

    ‏2012-04-20T14:33:49Z  
    Those IO times are the actual IOs to the disks, not caches. If they are large, then check out the disk subsystem for configuration problems or things like raid rebuilds.
  • SchelePierre
    SchelePierre
    20 Posts

    Re: mmdiag --iohist showing outliers on read traffic

    ‏2012-04-21T10:40:06Z  
    • dlmcnabb
    • ‏2012-04-20T14:33:49Z
    Those IO times are the actual IOs to the disks, not caches. If they are large, then check out the disk subsystem for configuration problems or things like raid rebuilds.
    @dlmcnabb:

    thank you for your response.
    You say the iotimes are the actual times it takes to read data from the disk. Are you talking about hte nulmbers on the NSD server (traffic marked as "srv") or on the NSD client (traffic marked as "cli") or both?

    I'm confused on correctly interpreting the numbers:

    1) The numbers reported on the "cli" traffic" are LOWER than the numbers on the "srv" traffic (ie actual disk reads) - but of course, before data ends up on the NSD client it was first read from disk and then sent over the network...

    2) I remember having read somewhere that NSD clients don't actually pull data from the NSD server's pagepool, is that a correct statement - when is the NSD server's pagepool used in a NSD server - NSD client setup? Is all prefetching done on the NSD client (in that case, I'm more confused on the numbers - see question 1)
    Thanks a lot for your time!
    Pieter
  • dlmcnabb
    dlmcnabb
    1012 Posts

    Re: mmdiag --iohist showing outliers on read traffic

    ‏2012-04-22T05:07:49Z  
    @dlmcnabb:

    thank you for your response.
    You say the iotimes are the actual times it takes to read data from the disk. Are you talking about hte nulmbers on the NSD server (traffic marked as "srv") or on the NSD client (traffic marked as "cli") or both?

    I'm confused on correctly interpreting the numbers:

    1) The numbers reported on the "cli" traffic" are LOWER than the numbers on the "srv" traffic (ie actual disk reads) - but of course, before data ends up on the NSD client it was first read from disk and then sent over the network...

    2) I remember having read somewhere that NSD clients don't actually pull data from the NSD server's pagepool, is that a correct statement - when is the NSD server's pagepool used in a NSD server - NSD client setup? Is all prefetching done on the NSD client (in that case, I'm more confused on the numbers - see question 1)
    Thanks a lot for your time!
    Pieter
    On a client node ths iohist times are what the client sees for doing an IO as far as it is concerned. If the IO goes to an NSD server then it includes the network time between the nodes.

    NSD servers do not cache any data. The pagepool is just for buffers as the NSD worker threads read/write data to the underlying disk subsystem.

    Only client nodes decide what to prefetch and cache. They hold GPFS byte-range tokens on the files in order to know what they are allowed to cache.
  • SchelePierre
    SchelePierre
    20 Posts

    Re: mmdiag --iohist showing outliers on read traffic

    ‏2012-04-22T09:29:17Z  
    • dlmcnabb
    • ‏2012-04-22T05:07:49Z
    On a client node ths iohist times are what the client sees for doing an IO as far as it is concerned. If the IO goes to an NSD server then it includes the network time between the nodes.

    NSD servers do not cache any data. The pagepool is just for buffers as the NSD worker threads read/write data to the underlying disk subsystem.

    Only client nodes decide what to prefetch and cache. They hold GPFS byte-range tokens on the files in order to know what they are allowed to cache.
    Thanks.
    So let me get things straight:

    1) NSD clients do not read cached data from NSD servers, because data is not cached there. This means if NSD client asks NSD server for data this results in a disk read. So there is sort of a 1:1 map between mddiag --iohist read requests as shown in the NSD client and as shown in the NSD server?

    2) NSD client mmdiag --iohist output includes both actual disk read and network transfer time between NSD server and NSD client

    This leaves me with the fact that the reported times on the NSD client (Windows GPFS 3.4.0.3) are SMALLER than the times reported on the NSD server (Linux GPFS 3.4.0.11). According to my interpretation of the above, these numbers should be BIGGER as NSD client reported times include server time spent on raw disk IO.
    Is mmdiag --iohist output on Windows reliable (being off by a factor 10 could explain and correlate the numbers between NSD client and NSD server)?
    Pieter
  • dlmcnabb
    dlmcnabb
    1012 Posts

    Re: mmdiag --iohist showing outliers on read traffic

    ‏2012-04-22T16:21:36Z  
    Thanks.
    So let me get things straight:

    1) NSD clients do not read cached data from NSD servers, because data is not cached there. This means if NSD client asks NSD server for data this results in a disk read. So there is sort of a 1:1 map between mddiag --iohist read requests as shown in the NSD client and as shown in the NSD server?

    2) NSD client mmdiag --iohist output includes both actual disk read and network transfer time between NSD server and NSD client

    This leaves me with the fact that the reported times on the NSD client (Windows GPFS 3.4.0.3) are SMALLER than the times reported on the NSD server (Linux GPFS 3.4.0.11). According to my interpretation of the above, these numbers should be BIGGER as NSD client reported times include server time spent on raw disk IO.
    Is mmdiag --iohist output on Windows reliable (being off by a factor 10 could explain and correlate the numbers between NSD client and NSD server)?
    Pieter
    You are looking at averages that do not represent the same IO.

    Each IO request on a client will match exactly one request on an NSD server. If you have more than one NSD server, the client will be on average be talking to each NSD server the same amount of time. If you have some fast and some slow NSD servers, then a particular client could on average have faster IO than a particular slow NSD server.
  • SchelePierre
    SchelePierre
    20 Posts

    Re: mmdiag --iohist showing outliers on read traffic

    ‏2012-04-22T18:29:35Z  
    • dlmcnabb
    • ‏2012-04-22T16:21:36Z
    You are looking at averages that do not represent the same IO.

    Each IO request on a client will match exactly one request on an NSD server. If you have more than one NSD server, the client will be on average be talking to each NSD server the same amount of time. If you have some fast and some slow NSD servers, then a particular client could on average have faster IO than a particular slow NSD server.
    I'm sorry, I'm not following there.

    What I see is:
    On NSD client: mmdiag --iohist contains entries per disk read of N (mostly 4096) sectors (which would correspond to my filesystem block size of 2M), per IO operation including such info as the NSD server read from, the NSD disk and the starting sector of the operation (read) plus the time in milliseconds (so it says). I'm not sure what you mean by "averages": to me it looks like there is one entry for every IO operation - is this correct?

    On the NSD server referenced in the above IO statistics line (ie the NSD server corresponding to the line in the client's mmdiag --iohist output), I see the actual physical disk read operation, the Linux device involved (/dev/sdX) and the time it takes to complete the read of 4096 sectors on the disk.
    The reported time for the physical disk read is 10x higher than the corresponding time entry on the NSD client, which I'm unable to interpret.

    I make the following assumptions in my reasoning:

    1) A read of 4096 sectors of an NSD from a specific NSD server gives rise to one line of output in "mmdiag --iohist" on the NSD client, and on the relevant NSD server (mentioned in the mmdiag --iohist line on the NSD client) I should see a corresponding read operation of the same disk:sector number.

    Am I correct so far, that is, is this way (correlating mmdiag --iohist output lines between NSD client and server) a good way of tracing back IO performance issues from NSD client to NSD server?

    2) If 1) is correct, and NSD client mmdiag --iohist output includes the physical disk read time and the network transfer time between NSD server and NSD client, so it's physically impossible to observe lower IO completion times on the NSD client when compared to the IO completion time on the NSD server for the same operation.

    So comparing the mmdiag --iohist output, what I conclude by reasoning (point 2) contradicts my actual observation regarding the relation between mmdiag --iohist IO completion times reported on the NSD server and the NSD client for the same operation - which more than likely means I'm making false assumptions here.
  • osc
    osc
    11 Posts

    Re: mmdiag --iohist showing outliers on read traffic

    ‏2012-05-01T20:24:59Z  
    I see you are using 4M blocks on the linux boxes. What are the scsi device settings for :

    /sys/block/${i}/queue/max_sectors_kb
    /sys/block/${i}/queue/read_ahead_kb

    where $i=/dev/sd?? devices. Linux will suffer greatly is these are smaller (defaults for these can be, depending on your device) than the GPFS blocks.
  • db808
    db808
    87 Posts

    Re: mmdiag --iohist showing outliers on read traffic

    ‏2013-09-04T21:00:39Z  

    Was the root cause of the 910 millisecond network IOs ever identified?

    We're seeing similar network traffic patterns from our WIndows Server 2008R2 10GbE connected GPFS client nodes.  mmdiag --iohist shows less than 4 millisecond times for most IO to the NSD servers which faster than possible for the disk storage being used.  We're seeing bursts of 30-400 MB of good network traffic at over 800 MB/sec, then a delay of 901-914 milliseconds.  The delay happens about 9-11 times per 512 IOs, reducing the overall throughput from 840 MB/sec to about 125 MB/sec.

    We have tested 4 GPFS client nodes connected to 2 NSD servers, with the NSDs alternated between the two NSD servers.  So we have a 4-to-2 network topology, with all nodes connected to the same network switch.  No ISLs involved.

    The 910 millisecond delays occur on both read and write, and to/from both NSD servers.

    It appears to be a network congestion problem where there is a n-to-1 burst of traffic.  I suspect that the TCP congestion algorithms are throttling back the bandwidth (by reducing the TCP window size), and then slowly ramping back up.  The 910 millisecond is delay is the congestion recovery time.

    This all assumes that the IO times recorded in the GPFS iohistory are accurate.

    Is anyone else seeing these ~910 millisecond delays?

    Dave B

  • db808
    db808
    87 Posts

    Re: mmdiag --iohist showing outliers on read traffic

    ‏2016-04-23T01:56:46Z  

    I know that it has been a while, but I wanted to post what I believe to be the root cause of the 100 millisecond latency reported by GPFS iohist communicating with Windows-based GPFS clients.

     

    The Windows TCP default delayed ack timeout is 100 milliseconds.  This value is fixed and can not be easily changed, to my knowledge on contemporary Windows versions.  Some ancient Windows versions could alter this value.

     

    BTW:  The default TCP delayed ack timeout for Linux is 40 milliseconds, and is fixed, and can not be changed without modifying a kernel config file, and recompiling the kernel.

     

    So, to further troubleshoot this 100-ish millisecond times from the GPFS iohist log entries from Windows GPFS clients, you should follow procedures to troubleshoot Windows TCP delayed ack behaviors.  This will likely include some packet traces, and analysis by a package such as Wireshark.

     

    Note... TCP time stamps are enabled on a TCP session if both sides advertise TCP time stamp capability.  TCP time stamps increase the TCP header by 4 bytes, and therefore reduce the packet payload by 4 bytes.  For large TCP transfers, like a GPFS data transfer of 4 or 8 mbytes (plus GPFS headers), this reduction of the packet payload size by 4 bytes, can change the total number of packets used for this large GPFS data transfer.  If the number of packets changes from an odd number to an even number, a potential TCP delayed ack is prevented.

    I won't debate the usefulness of using TCP time stamps for more accurate congestion handling vs. the extra overhead of generating the time stamp and slight loss of payload in this discussion.  Many NIC vendors recommend disabling TCP time stamps for improved throughput.  However, both Windows and MacOS, enable TCP time stamps by default.  

     

    A simple packet trace from the Windows system will identify if TCP time stamps are indeed being generated or not.  Whatever the case, changing the TCP time stamp setting on the Windows or GPFS server side to toggle its state, will add or subtract 4 bytes to the TCP header and change the payload size, and where the packet boundaries break.

    Toggle the TCP time stamp setting to cause the TCP time stamp behavior to be opposite of the initial condition, and check the GPFS iohist logs.  Are there still DATA transfers occurring with 100 millisecond times?  If yes, the TCP delayed ack behavior was not improved by changing the payload size by 4 bytes.  However, the 100 millisecond timeouts may also disappear.

     

    A second source of TCP delayed asks that has been widely reported is the TCP segment offload feature on transmits, and the Large Receive Offload offload feature which coalesces the handling of multiple packets together for CPU efficiency. This may change when acks are sent and processed compared with the native OS TCP stack.  This "different" handling may result in a different packet and ack stream resulting in an odd number of packets, and a delayed ACK timeout.

    Experimenting with the TSO and LRO NIC offloads may eliminate the TCP delayed ack.  However, disabling TSO and LRO on high bandwidth NICs adds significant overhead to the TCP stack in the CPU, and can lower throughput.

    There are other techniques to reduce TCP delayed acks that I will not detail here.

    I would also review and re-validate the NIC-vendor specific networking configuration recommendations of GPFS for the Windows client and the NSD server.

     

    In my case, we were using a non-standard MTU size of 8000 bytes (not the typical 9000 bytes) due to a restriction of the Nexten NICs that were being used. This difference alone could be the cause of the increased TCP delayed acks, I doubt if many people are testing 4MB and 8MB GPFS NSD protocol sensitivity to TCP delayed acks with 8000 byte MTUs.

     

    So ... In summary, if you see 100 millisecond times being reported by GPFS iohist from WIndows clients, or 40 millisecond times from Linux clients ... You are seeing artifacts of TCP delayed acks.

     

    Dave B.