Topic
IC4NOTICE: developerWorks Community will be offline May 29-30, 2015 while we upgrade to the latest version of IBM Connections. For more information, read our upgrade FAQ.
7 replies Latest Post - ‏2013-10-02T19:08:04Z by db808
zero.silikon
zero.silikon
3 Posts
ACCEPTED ANSWER

Pinned topic Diagnosing slow metadata performance

‏2013-06-03T21:19:19Z |

All,

Looking for some strategies to debug certain metadata-related performance issues within a small GPFS cluster.

 

Some initial information:

AIX 6.1.8.1

GPFS 3.5.0.9 (32TB filesystem, 6 filesets with no placement policy, 18 disks holding both metadata and data, connected to a DS8300 storage array via multiple 4Gb/s FC)

I'm having problems troubleshooting a problem that's been plaguing the environment for some time;  data writes and reads are excellent, and within expected data rates.  When it comes to scripts and other programs running things like 'ls -l' and things of that nature, performance suffers.

How best would one go about determining what is causing this, and potential ideas for remediation?

  • HajoEhlers
    HajoEhlers
    253 Posts
    ACCEPTED ANSWER

    Re: Diagnosing slow metadata performance

    ‏2013-06-05T16:23:12Z  in response to zero.silikon

    Check the size of the directory being access ( All directories includes in the users PATH as well ) and let us know.

    Also the File cache settings might not sufficient . See via $ mmdiag --stats

    Keep in mind that 18 disk ( I assume 2TB disk size ) are not able to support large IOPS request. ( 18 X 50 IOPS =< 1000 IOPS )

    Nice wiki https://www.ibm.com/developerworks/community/wikis/home/wiki/General%20Parallel%20File%20System%20(GPFS)?lang=en including tuning tips.

     

    cheers

    Hajo

    • zero.silikon
      zero.silikon
      3 Posts
      ACCEPTED ANSWER

      Re: Diagnosing slow metadata performance

      ‏2013-06-06T23:28:58Z  in response to HajoEhlers

      The mmdiag stuff was the first thing I went to.  I've been working on this for a few weeks, perhaps I didn't properly convey this.  The paths are relatively shallow;  no more than 10 subs deep;  however there are files in excess of the hundreds of thousands in a handful of places.

      Also referencing the 18 disks;  I'm certain that I mis-represented.  Those 18 disks (volumes) are attached to a DS8300 storage array in triplicate, and I'm certain that I have tens of thousands of IOPS at my disposal.  As a point of reference, there are also volumes from this storage array configuration attached to other "stuff" (Oracle ASM, etc) and I can quite easily push a large volume of IOPS.

      I was hoping that others have run into this issue and have insight, as to alternative or other methods that they could share that might suggest ideas I have yet to contemplate, to isolate the source of the slow behavior.

      I should stress again, that this happens only on metadata-intensive operations...

      • HajoEhlers
        HajoEhlers
        253 Posts
        ACCEPTED ANSWER

        Re: Diagnosing slow metadata performance

        ‏2013-06-07T06:11:43Z  in response to zero.silikon

        1 ) > ... 18 disks holding both metadata and data

        Standard procedure for use is to separate data and metadata. Think about what it means if your metadata is scattered around TB luns.....

        ( Put 1000 marbles on a football field and try to pick them up or put them on your desk and pick them up )

        2) > ...The paths are relatively shallow; 

        I mean that the user PATH variable does NOT includes directories with a large amount of files.

        3) I have seen that a online defrag could help for metadata operations since it looks like that subblock scattered around luns will be put together. YMMV.

        4) Monitor the DS8300 since its the best place to see what going on. ( LUN usage, I/O queue a.s.o )

        Note: In case you can a afford it it: Go for SSDs for your metadata . If it does not solve your problem you have a problem ^_^ 

        Happy weekend.

        Hajo

        • zero.silikon
          zero.silikon
          3 Posts
          ACCEPTED ANSWER

          Re: Diagnosing slow metadata performance

          ‏2013-06-07T14:06:50Z  in response to HajoEhlers

          I'm not sure the analogy you provided for the metadata description applies here;  regardless of volume, GPFS knows where that metadata is stored and should be able to reference it quickly.  Having it dispersed among multiple volumes should benefit the cluster as it reduces hotspots, bottlenecks, and things of that nature.

          As a point of reference, these users' were working much quicker with the same scripts performing the same routines.  Hundreds of records processed per second, versus sub-100's (coming from a JFS2/NFS world).  The only thing that has changed (and I've verified this time and time again) is the fact that they were relocated to a GPFS filesystem to eliminate the need for all this data to be accessed via NFS mount, and to give them much more accessibility to the data in terms of sheer speed and less latency.

          Although I would enjoy using SSD's, that's a relative impossibility;  what I am trying to do is to pinpoint where the marked decrease in performance is and work back from there.  If it's determined that it in fact is the volumes that make up the cluster then, fine -- but I would really appreciate understanding how people have, in past iterations, narrowed any perceived decrease in performance using whatever tools were at their disposal, versus condemning the current configuration or hardware or even software for that matter without empirical evidence to support the condemnation.

           

          • HajoEhlers
            HajoEhlers
            253 Posts
            ACCEPTED ANSWER

            Re: Diagnosing slow metadata performance

            ‏2013-06-07T15:14:31Z  in response to zero.silikon

            How are your clients connected ? Direct or via NSD

            Have you checked the cache usage via mmdiag ( mmdiag --stats), what's not in the cache must be loaded from disk

            Have you check for GPFS waiters during time of slow access or even monitor the waiters. ( mmlsnode -N waiters -L , mmdiag --waiters , mmdiag --iohist ), if it must be read from disk, how long does it take.

            Have monitored your DSxxxx regarding read/write sizes, queue sizes and wait time a.s.o

            Have you thought about using a dedicated lun(s) for the metadata to

            1) not to interfere with large read/writes going to the same lun

            2) having the metadata close together and might have benefits from read ahead and dedicated lun caching

            Note

            > I'm certain that I have tens of thousands of IOPS at my disposal. 

            Start to calculate and verify who is using them when ..... 

            Happy weekend

            Hajo

             

             

            • dlmcnabb
              dlmcnabb
              1012 Posts
              ACCEPTED ANSWER

              Re: Diagnosing slow metadata performance

              ‏2013-06-07T17:55:25Z  in response to HajoEhlers

              GPFS is not the same as NFS since NFS has a single node providing a common directory manager. GPFS is distributed and the nodes only communicate changes about directories in files via disk IO.

              If you have a very large directory that many nodes are making changes to then things like readdir (ls) can take a long time revoking tokens held by other nodes and forcing disk IO.

              GPFS has implemented what is called Fine Grain Directory Locking (FGDL) that allows many nodes to lock particular entries in a directory using the metanode as the real manager of the directory. FGDL allows name lookups, creation, and deletion of files in a directory to be very fast. However it you run a readdir, rename (mv), or mkdir commands in that directory, it has to revoke the FGDL tokens from all the other nodes and everything slows down.

              To make performance better, designing the applications on each node to work in their own subdirectories is much more preferable.

              The only other way to make things work faster is to have high speed metadata disks (SSD) so that the IO is much faster.

               

              • db808
                db808
                86 Posts
                ACCEPTED ANSWER

                Re: Diagnosing slow metadata performance

                ‏2013-10-02T19:08:04Z  in response to dlmcnabb

                I'd like to jump in.

                The original poster indicated that performance has dropped to below 100 file operations per second .... using 18+ disks.  He is not complaining about thousands or tens of thousands of file operations per second ... but sub-100 operations per second.  He suspects something is broken (perhaps external to GPFS), and/or grossly misconfigured and is looking for troubleshooting guidance.

                Unfortunately, the poster has not responded to requests for more information about the configuration, so we can not see if there is some grossly mis-configured value. 

                I think that dlmcnabb asked the critical question.  How are the clients connected?  Direct FC connections to all disks, or through a network to NSD servers?  Since the poster mentioned using NFS before, there is a critical point that the clients may be connecting via the network ... and apparently this may be the first time that such network connections have been stressed .... leading to networking issues potentially restricting these file operations.

                If the clients accessing the files via an NSD server, how fast does similar scripts run when executed on the NSD servers themselves?  If performance is "normal", then it is likely the client-to-NSD server connection issue.

                I have found value in looking at the mmdiag --iohist output ... from the client perspective.  You may want to increase the iohist size from 512 to 1024 to get a longer perspective.

                When I was having performance issues with some network-connected clients, the client-side iohist was valuable.  It showed many dozens of requests with very short times (a few milliseconds), (and very good performance during this time) and then a "pause", where there were 1 or 2 requests that had very long times (hundreds of milliseconds).  This only happened about 1 out of 50 to 1 out of 100 times.  The overall performance impact of the delay was over 4-fold in my case, but your mileage will vary.

                With the GPFS iohist being able to mark the exact time of the pauses ... we were able then also run some packet traces, and correlate the network activity around when the "pause" occurred.  It ended up being some packet drops that caused a TCP-level backoff and recovery.  When we diagnosed and corrected the networking issues, the performance anomalies disappeared.

                I have also seen mtime and atime settings have dramatic impact on performance as you are trying to post timestamp changes cluster-wide, and invalidating cached metadata. 

                I would also suggest double-checking the GPFS file system mount options, and comparing them to the previous NFS mounts that were being used.  It is likely that the NFS mount had some "go fast" options enabled, and those same options are not enabled on the GPFS file system.

                You can easily explain a sub-100 file operations per second ... from a single thread ... if the metadata is not cached.  Most of those style of file operations need to do at least a (get directory entry) and a (read inode) operation per file.  If those pieces of metadata are not cached. you could have 2 metadata IOs per file.  If a single 2TB disk does 70 IOPs ... then you have 35 file IOPs per second ..... per thread.

                GPFS has some patented inode prefetch algorithms, but they don't kick in on small directories.  In my case,  one of our GPFS cluster has an average of 6 files per directory, and we get just about (1+1/6) metadata IOs per file, plus the data IOs, if there are any.   If you have good inode caching, the grandparent and higher directory nodes will hopefully be cached.

                When you were experiencing thousands of file operations per second previously (under NFS), you are either accessing cached directory entries and cached inodes, running hundreds of threads in parallel, and/or have the metadata on very fast storage devices.  The effectiveness of client-side caching (and how often client-side cache is invalidated) depends on the client-side GPFS configuration, atime, and mtime settings.  Less-than-efficient settings of these GPFS parameters would be identified with the GPFS configuration information that was never provided.

                Hope this helps.