Topic
8 replies Latest Post - ‏2013-09-04T21:00:39Z by db808
SchelePierre
SchelePierre
20 Posts
ACCEPTED ANSWER

Pinned topic mmdiag --iohist showing outliers on read traffic

‏2012-04-20T14:29:42Z |
Hello,

We are working with a mixed Windows 2008R2/SLES10SP1 GPFS Cluster (3.4.0.11 on SLES, still 3.4.0.3 on Win, FS level 3.4.0.3)

2 Linux machines are cross-connected to two DS3500 storage systems, each providing 6 data (RAID6 8+P+Q) LUNs and 6 metadata (RAID1) LUNs used as NSDs
The Linux machines are NSD servers, connected over 10Gbe switch to two Windows cluster nodes, which are NSD clients. A third Linux machine (not directly connected to the sotrage but to the same 10GbE network) has a tiebreaker role (server/quorum)

Upon examining a performance issue, we have stumbled on the following:

Looking at mmdiag --iohist output on a Windows NSD client we see

16:12:17.088625 R data 2:29253025792 4096 2.500 cli 0AF30511:4D9
C6C64 172.31.0.13
16:12:17.091625 R data 6:8121241600 4096 1.500 cli 0AF30511:4D9C
6C67 172.31.0.13
16:12:17.091625 R data 5:26244268032 4096 2.500 cli 0AF38411:4D9
C6C66 172.31.0.12
16:12:17.094625 R data 6:1393774592 4096 2.500 cli 0AF30511:4D9C
6C67 172.31.0.13
16:12:17.095125 R data 3:20890284032 4096 2.500 cli 0AF38411:4D9
C6C65 172.31.0.12
16:12:17.097125 R data 2:24636592128 4096 2.500 cli 0AF30511:4D9
C6C64 172.31.0.13
16:12:17.099125 R data 5:7258157056 4096 902.500 cli 0AF38411:4D9C
What's worrying me is that every now and then, the read time for 4096 sectors jumps to >900ms and is 1ms-2ms other times. There are no in-between values, just either 1-2ms or 9XY ms.

Would those times correspond to requesting sectors already in the NSD servers pagepool vs requesting uncached sectors from a disk?
At the same time, the output of mmdiag --iohist on one of the NSD servers reveals

16:15:23.275696 R data 1:28558798848 4096 21.264 srv sdl
16:15:23.284877 R data 5:17129132032 4096 21.715 srv sdk
16:15:23.320031 R data 5:14012674048 4096 17.181 srv sdk
16:15:23.338776 R data 3:29748744192 4096 18.032 srv sdm
16:15:23.365516 R data 3:13482860544 4096 23.435 srv sdm
16:15:23.376936 R data 3:23377121280 4096 27.741 srv sdm
16:15:23.523171 R data 3:16868118528 4096 18.209 srv sdm
16:15:23.480319 R data 1:12508475392 4096 99.471 srv sdl
16:15:23.564068 R data 1:14530838528 4096 19.497 srv sdl
16:15:23.484793 R data 1:11982221312 4096 108.894 srv sdl
16:15:23.599717 R data 1:17835274240 4096 19.777 srv sdl
16:15:23.555420 R data 3:17808695296 4096 68.679

where disk access times are usually around 20ms, with outliers around 100ms.

I have not yet been able to establish whether or not the 900ms NSD client outliers correspond somehow to the NSD server 100ms outliers.

My questions are:
- Does the mmdiag --iohist output on the NSD server show any prefetching done by GPFS ?
- Does the mmdiag --iohist on the NSD client mean time between request sent to server and response returned? Since times reported on the NSD client are 2ms, but "normal" values on the server are around 20ms read from disk, what does this mean wrt to the cached/non-cached theory?
- How does one explain the 900ms time on NSD client, from the numbers seen on the server?
- is a >100ms server disk access time to be expected, if load is << max throughput of storage boxes? (I would expect fairly constant disk read times on the server in a low load situation)

Filesystem blocksize is 2MB, other config settings include
minReleaseLevel 3.4.0.3
dmapiFileHandleSize 32
maxblocksize 4M
maxMBpS 1600
prefetchThreads 100
worker1Threads 200
linux nodes
pagepool 12G
maxFilesToCache 100000
windows nodes
pagepool 4G
common
adminMode central
Filesystem:
flag value description

------------------------
-f 65536 Minimum fragment size in bytes
-i 512 Inode size in bytes
-I 32768 Indirect block size in bytes
-m 2 Default number of metadata replicas
-M 2 Maximum number of metadata replicas
-r 2 Default number of data replicas
-R 2 Maximum number of data replicas
-j scatter Block allocation type
-D nfs4 File locking semantics in effect
-k all ACL semantics in effect
-n 30 Estimated number of nodes that will mount file system
-B 2097152 Block size
-Q none Quotas enforced
none Default quotas enabled
--filesetdf no Fileset df enabled?
-V 12.06 (3.4.0.3) File system version
--create-time Fri May 20 15:26:56 2011 File system creation time
-u yes Support for large LUNs?
-z no Is DMAPI enabled?
-L 4194304 Logfile size
-E yes Exact mtime mount option
-S yes Suppress atime mount option
-K always Strict replica allocation option
--fastea yes Fast external attributes enabled?
--inode-limit 134217728 Maximum number of inodes
-P system Disk storage pools in file system
-A yes Automatic mount option
-o none Additional mount options
--mount-priority 0

Any opinions on expected mmdiag --iohist output, and specifically, the large outliers I have observed?

Thanks alot!
Pieter
Updated on 2012-05-01T20:24:59Z at 2012-05-01T20:24:59Z by osc
  • dlmcnabb
    dlmcnabb
    1012 Posts
    ACCEPTED ANSWER

    Re: mmdiag --iohist showing outliers on read traffic

    ‏2012-04-20T14:33:49Z  in response to SchelePierre
    Those IO times are the actual IOs to the disks, not caches. If they are large, then check out the disk subsystem for configuration problems or things like raid rebuilds.
    • SchelePierre
      SchelePierre
      20 Posts
      ACCEPTED ANSWER

      Re: mmdiag --iohist showing outliers on read traffic

      ‏2012-04-21T10:40:06Z  in response to dlmcnabb
      @dlmcnabb:

      thank you for your response.
      You say the iotimes are the actual times it takes to read data from the disk. Are you talking about hte nulmbers on the NSD server (traffic marked as "srv") or on the NSD client (traffic marked as "cli") or both?

      I'm confused on correctly interpreting the numbers:

      1) The numbers reported on the "cli" traffic" are LOWER than the numbers on the "srv" traffic (ie actual disk reads) - but of course, before data ends up on the NSD client it was first read from disk and then sent over the network...

      2) I remember having read somewhere that NSD clients don't actually pull data from the NSD server's pagepool, is that a correct statement - when is the NSD server's pagepool used in a NSD server - NSD client setup? Is all prefetching done on the NSD client (in that case, I'm more confused on the numbers - see question 1)
      Thanks a lot for your time!
      Pieter
      • dlmcnabb
        dlmcnabb
        1012 Posts
        ACCEPTED ANSWER

        Re: mmdiag --iohist showing outliers on read traffic

        ‏2012-04-22T05:07:49Z  in response to SchelePierre
        On a client node ths iohist times are what the client sees for doing an IO as far as it is concerned. If the IO goes to an NSD server then it includes the network time between the nodes.

        NSD servers do not cache any data. The pagepool is just for buffers as the NSD worker threads read/write data to the underlying disk subsystem.

        Only client nodes decide what to prefetch and cache. They hold GPFS byte-range tokens on the files in order to know what they are allowed to cache.
        • SchelePierre
          SchelePierre
          20 Posts
          ACCEPTED ANSWER

          Re: mmdiag --iohist showing outliers on read traffic

          ‏2012-04-22T09:29:17Z  in response to dlmcnabb
          Thanks.
          So let me get things straight:

          1) NSD clients do not read cached data from NSD servers, because data is not cached there. This means if NSD client asks NSD server for data this results in a disk read. So there is sort of a 1:1 map between mddiag --iohist read requests as shown in the NSD client and as shown in the NSD server?

          2) NSD client mmdiag --iohist output includes both actual disk read and network transfer time between NSD server and NSD client

          This leaves me with the fact that the reported times on the NSD client (Windows GPFS 3.4.0.3) are SMALLER than the times reported on the NSD server (Linux GPFS 3.4.0.11). According to my interpretation of the above, these numbers should be BIGGER as NSD client reported times include server time spent on raw disk IO.
          Is mmdiag --iohist output on Windows reliable (being off by a factor 10 could explain and correlate the numbers between NSD client and NSD server)?
          Pieter
          • dlmcnabb
            dlmcnabb
            1012 Posts
            ACCEPTED ANSWER

            Re: mmdiag --iohist showing outliers on read traffic

            ‏2012-04-22T16:21:36Z  in response to SchelePierre
            You are looking at averages that do not represent the same IO.

            Each IO request on a client will match exactly one request on an NSD server. If you have more than one NSD server, the client will be on average be talking to each NSD server the same amount of time. If you have some fast and some slow NSD servers, then a particular client could on average have faster IO than a particular slow NSD server.
            • SchelePierre
              SchelePierre
              20 Posts
              ACCEPTED ANSWER

              Re: mmdiag --iohist showing outliers on read traffic

              ‏2012-04-22T18:29:35Z  in response to dlmcnabb
              I'm sorry, I'm not following there.

              What I see is:
              On NSD client: mmdiag --iohist contains entries per disk read of N (mostly 4096) sectors (which would correspond to my filesystem block size of 2M), per IO operation including such info as the NSD server read from, the NSD disk and the starting sector of the operation (read) plus the time in milliseconds (so it says). I'm not sure what you mean by "averages": to me it looks like there is one entry for every IO operation - is this correct?

              On the NSD server referenced in the above IO statistics line (ie the NSD server corresponding to the line in the client's mmdiag --iohist output), I see the actual physical disk read operation, the Linux device involved (/dev/sdX) and the time it takes to complete the read of 4096 sectors on the disk.
              The reported time for the physical disk read is 10x higher than the corresponding time entry on the NSD client, which I'm unable to interpret.

              I make the following assumptions in my reasoning:

              1) A read of 4096 sectors of an NSD from a specific NSD server gives rise to one line of output in "mmdiag --iohist" on the NSD client, and on the relevant NSD server (mentioned in the mmdiag --iohist line on the NSD client) I should see a corresponding read operation of the same disk:sector number.

              Am I correct so far, that is, is this way (correlating mmdiag --iohist output lines between NSD client and server) a good way of tracing back IO performance issues from NSD client to NSD server?

              2) If 1) is correct, and NSD client mmdiag --iohist output includes the physical disk read time and the network transfer time between NSD server and NSD client, so it's physically impossible to observe lower IO completion times on the NSD client when compared to the IO completion time on the NSD server for the same operation.

              So comparing the mmdiag --iohist output, what I conclude by reasoning (point 2) contradicts my actual observation regarding the relation between mmdiag --iohist IO completion times reported on the NSD server and the NSD client for the same operation - which more than likely means I'm making false assumptions here.
  • osc
    osc
    11 Posts
    ACCEPTED ANSWER

    Re: mmdiag --iohist showing outliers on read traffic

    ‏2012-05-01T20:24:59Z  in response to SchelePierre
    I see you are using 4M blocks on the linux boxes. What are the scsi device settings for :

    /sys/block/${i}/queue/max_sectors_kb
    /sys/block/${i}/queue/read_ahead_kb

    where $i=/dev/sd?? devices. Linux will suffer greatly is these are smaller (defaults for these can be, depending on your device) than the GPFS blocks.
  • db808
    db808
    86 Posts
    ACCEPTED ANSWER

    Re: mmdiag --iohist showing outliers on read traffic

    ‏2013-09-04T21:00:39Z  in response to SchelePierre

    Was the root cause of the 910 millisecond network IOs ever identified?

    We're seeing similar network traffic patterns from our WIndows Server 2008R2 10GbE connected GPFS client nodes.  mmdiag --iohist shows less than 4 millisecond times for most IO to the NSD servers which faster than possible for the disk storage being used.  We're seeing bursts of 30-400 MB of good network traffic at over 800 MB/sec, then a delay of 901-914 milliseconds.  The delay happens about 9-11 times per 512 IOs, reducing the overall throughput from 840 MB/sec to about 125 MB/sec.

    We have tested 4 GPFS client nodes connected to 2 NSD servers, with the NSDs alternated between the two NSD servers.  So we have a 4-to-2 network topology, with all nodes connected to the same network switch.  No ISLs involved.

    The 910 millisecond delays occur on both read and write, and to/from both NSD servers.

    It appears to be a network congestion problem where there is a n-to-1 burst of traffic.  I suspect that the TCP congestion algorithms are throttling back the bandwidth (by reducing the TCP window size), and then slowly ramping back up.  The 910 millisecond is delay is the congestion recovery time.

    This all assumes that the IO times recorded in the GPFS iohistory are accurate.

    Is anyone else seeing these ~910 millisecond delays?

    Dave B