Topic
5 replies Latest Post - ‏2013-07-22T14:40:12Z by dlmcnabb
ufa
ufa
104 Posts
ACCEPTED ANSWER

Pinned topic mmtrace / direct IO : how to understand

‏2013-07-15T12:59:38Z |

Dear all,

I just try to understand the GPFS trace from an AIX Oracle RAC system (not so that I could easily read traces - thus I am asking). Background is that the system exhibits considerable IO times for the redo log sync writes.

Direct IO is used.

I do have a short trace (default classes and level) run through trsum.awk. When picking out just the records with the disk names of the redo log file system, I get sections like

70124221       0.227214726 QIO: write data tag 331543 14823 buf 0x0700000020AE8E00 disk 8000001200000132 da 37:51170176 nSectors 79 align 0 by (Direct I/O handler) qlen 1
70124221       0.227289218 QIO: write data tag 331543 14823 buf 0x0700000020AE8E00 disk 8000001200000079 da 17:51170304 nSectors 79 align 0 by (Direct I/O handler) qlen 1
70124221       0.231426960 FIO: write data tag 331543 14823 buf 0x0700000020AE8E00 disk 8000001200000132 da 37:51170176 nSectors 79 qlen 0 duration 4212.234 us
70124221       0.231441511 FIO: write data tag 331543 14823 buf 0x0700000020AE8E00 disk 8000001200000079 da 17:51170304 nSectors 79 qlen 0 duration 4152.293 us

I wonder what is involved in these writes: the QIO records do have everything needed to send the request to the SCSI layer, and the request is so small it should easily be absorbed by the write cache of the storage system. Nevertheless it takes more than 4ms to come back.

Am I right that GPFS itself is, from the QIO to the FIO record, not involved in servicing these requests? I .e. if one sees long Durations there, GPFS cannot be blamed, but the sources of delays are further down in the stack? Or could there still be something odd in GPFS?

I realise QIO comes with (unbuffered?) writes, DIOQIO with (unbuffered?) read, and FIO seems related to the completion of the IO.

EDIT / UPDATE Jul 16 2013: Found the assumption that the diff between QIO and FIO is the (gpfs-independent) block device IO time confirmed in another post: https://www.ibm.com/developerworks/community/forums/html/topic?id=77777777-0000-0000-0000-000014213603.

As for rdwr entries in the trsum output, that  appears to be the time a file was held open in RDWR, e.g. in the trace looking for  things related to

91161229       0.024665781 rdwr                    23855.442 us +

I find 307  91161229    0.000810339       0.005214   MMFS VNODE: gpfs_v_rdwr enter: vP 0xF1000A0425A10130 op 0 flags 0x0C400003 offset 0x000000012577A000 len 8192 ext 0 vinfoP 0xF1000A043B71A180

where 0.024665781-(23855.442/1000000)=. 000810339

and at 0.024665781 the file might have been closed again:

307  91161229    0.024665781       0.003791   MMFS VNODE: gpfs_v_rdwr exit: new offset 0x000000012577C000 resid 0 code 0 rc 0
 

Some questions remain:

Which queue length is given by qlen? Is it the queue for the respective block device? I suppose that.

Is there a way to relate a WRITE or a  READ to the IO records it generates? I could check the disk and sector the IO goes to, find the file occupying this, and get the file names related to the inodes in the high level request (BTW: how do I see the file system from the high level request?). But there might be a simpler way.

 

ufa

 

 

Updated on 2013-07-16T12:04:25Z at 2013-07-16T12:04:25Z by ufa
  • gcorneau
    gcorneau
    145 Posts
    ACCEPTED ANSWER

    Re: mmtrace / direct IO : how to understand

    ‏2013-07-15T22:20:21Z  in response to ufa

    While not an exact answer to your question, have you seen the trsum.awk script shipped with GPFS in /usr/lpp/mmfs/samples/debugtools ?

    It's a very useful script for looking at I/O times in conjunction with a GPFS I/O trace (mmtrace trace=io).

    Give it a shot and see if the output answers some of your questions.

    • ufa
      ufa
      104 Posts
      ACCEPTED ANSWER

      Re: mmtrace / direct IO : how to understand

      ‏2013-07-15T23:06:09Z  in response to gcorneau

      Dear gcorneau, thank you for that suggestion.

      I have used that trsum script a few times so far, it is indeed very nice, but at the end it just evaluates what's in the trace. I want to better understand what that actually is or means. My question was directed to find out whether it could be GPFS's fault if I see long io times of the kind I indicated (I guess not, but want to be sure).

      As always people are blaming GPFS, and telling me "the storage is flawless" and that stuff - which seems not to be the case here.

      As for the question how i could find, from a file IO op, the block IO operations related: There are suspicions that with the direct IO used by the DB system and the (not well matching) block sizes in Oracle and on the GPFS FS we might have unaligned IO. I am not sure how to find out for sure if there is unaligned IO - would there be an error message or would one see just one block IO request more than expected from the size of the file io request?

      ufa

      • dlmcnabb
        dlmcnabb
        1012 Posts
        ACCEPTED ANSWER

        Re: mmtrace / direct IO : how to understand

        ‏2013-07-19T16:02:19Z  in response to ufa

        QIO - FIO is all outside GPFS control. So slowness here is due to (network if NSD)/driver/controller/disk.

        Trsum will calculate (and add to the FIO line the "duration" and add that to the disk IO histograms. The qlen shown on the QIO line is how many IOs are already outstanding  to the devoice according to the trace.

        The trace lines you shw indicate that you have replication of data blocks. DIO in this case has to be done by a GPFS daemon thread since it has to force a log entry saying that it is doing two writes in parallel. When the two IOs finish another log record is spooled (not forced) saying that the IO is done. This is for recovery log replay in case the node dies in the middle of the IO. If recovery replay does not see a "done record", it will copy one of the blocks over the other one so that both replicas are the same.

        Since the DIO was handled, it does meet the GPFS requirements of having sector (512B) aligned IO requests and sector aligned memory addresses. If they weren't, you would see a trace that has the word "dio .. pretest" showing the unaligned stuff. If the pretest fails, then GPFS follows the non-DIO path (liked a pagepool buffered request) followed by an FSYNC trace to force the modified sectors to disk.

        The only way to match a rdwr-entry/READ or WRITE/rdwr-exit with an IO is to match the inode number from the READ or WRITE line to the first tag on the QIO/FIO lines. For non-DIO, the QIO/FIO lines can happen at very different times after the actual READ/WRITE request.

        QIO: write data tag 331543 14823

        The tag numbers are the inode number and the block number of that file. If there was a lookup for that inode, it will be in the end of the trsum output. Or you can use tsfindinode to get the pathname for a particular inode.

        • ufa
          ufa
          104 Posts
          ACCEPTED ANSWER

          Re: mmtrace / direct IO : how to understand

          ‏2013-07-21T08:04:57Z  in response to dlmcnabb

          Hi dlmcnabb,

          thank you very much for your answer. So GPFS does not require direct IO to be aligned to the filesystem sub-blocks (which are 2kiB in my case), but just to 512B sectors?

          ufa

          • dlmcnabb
            dlmcnabb
            1012 Posts
            ACCEPTED ANSWER

            Re: mmtrace / direct IO : how to understand

            ‏2013-07-22T14:40:12Z  in response to ufa

            Correct. 512B alignment.