Topic
  • 7 replies
  • Latest Post - ‏2013-10-02T19:08:04Z by db808
zero.silikon
zero.silikon
3 Posts

Pinned topic Diagnosing slow metadata performance

‏2013-06-03T21:19:19Z |

All,

Looking for some strategies to debug certain metadata-related performance issues within a small GPFS cluster.

 

Some initial information:

AIX 6.1.8.1

GPFS 3.5.0.9 (32TB filesystem, 6 filesets with no placement policy, 18 disks holding both metadata and data, connected to a DS8300 storage array via multiple 4Gb/s FC)

I'm having problems troubleshooting a problem that's been plaguing the environment for some time;  data writes and reads are excellent, and within expected data rates.  When it comes to scripts and other programs running things like 'ls -l' and things of that nature, performance suffers.

How best would one go about determining what is causing this, and potential ideas for remediation?

  • HajoEhlers
    HajoEhlers
    253 Posts

    Re: Diagnosing slow metadata performance

    ‏2013-06-05T16:23:12Z  

    Check the size of the directory being access ( All directories includes in the users PATH as well ) and let us know.

    Also the File cache settings might not sufficient . See via $ mmdiag --stats

    Keep in mind that 18 disk ( I assume 2TB disk size ) are not able to support large IOPS request. ( 18 X 50 IOPS =< 1000 IOPS )

    Nice wiki https://www.ibm.com/developerworks/community/wikis/home/wiki/General%20Parallel%20File%20System%20(GPFS)?lang=en including tuning tips.

     

    cheers

    Hajo

  • zero.silikon
    zero.silikon
    3 Posts

    Re: Diagnosing slow metadata performance

    ‏2013-06-06T23:28:58Z  

    Check the size of the directory being access ( All directories includes in the users PATH as well ) and let us know.

    Also the File cache settings might not sufficient . See via $ mmdiag --stats

    Keep in mind that 18 disk ( I assume 2TB disk size ) are not able to support large IOPS request. ( 18 X 50 IOPS =< 1000 IOPS )

    Nice wiki https://www.ibm.com/developerworks/community/wikis/home/wiki/General%20Parallel%20File%20System%20(GPFS)?lang=en including tuning tips.

     

    cheers

    Hajo

    The mmdiag stuff was the first thing I went to.  I've been working on this for a few weeks, perhaps I didn't properly convey this.  The paths are relatively shallow;  no more than 10 subs deep;  however there are files in excess of the hundreds of thousands in a handful of places.

    Also referencing the 18 disks;  I'm certain that I mis-represented.  Those 18 disks (volumes) are attached to a DS8300 storage array in triplicate, and I'm certain that I have tens of thousands of IOPS at my disposal.  As a point of reference, there are also volumes from this storage array configuration attached to other "stuff" (Oracle ASM, etc) and I can quite easily push a large volume of IOPS.

    I was hoping that others have run into this issue and have insight, as to alternative or other methods that they could share that might suggest ideas I have yet to contemplate, to isolate the source of the slow behavior.

    I should stress again, that this happens only on metadata-intensive operations...

  • HajoEhlers
    HajoEhlers
    253 Posts

    Re: Diagnosing slow metadata performance

    ‏2013-06-07T06:11:43Z  

    The mmdiag stuff was the first thing I went to.  I've been working on this for a few weeks, perhaps I didn't properly convey this.  The paths are relatively shallow;  no more than 10 subs deep;  however there are files in excess of the hundreds of thousands in a handful of places.

    Also referencing the 18 disks;  I'm certain that I mis-represented.  Those 18 disks (volumes) are attached to a DS8300 storage array in triplicate, and I'm certain that I have tens of thousands of IOPS at my disposal.  As a point of reference, there are also volumes from this storage array configuration attached to other "stuff" (Oracle ASM, etc) and I can quite easily push a large volume of IOPS.

    I was hoping that others have run into this issue and have insight, as to alternative or other methods that they could share that might suggest ideas I have yet to contemplate, to isolate the source of the slow behavior.

    I should stress again, that this happens only on metadata-intensive operations...

    1 ) > ... 18 disks holding both metadata and data

    Standard procedure for use is to separate data and metadata. Think about what it means if your metadata is scattered around TB luns.....

    ( Put 1000 marbles on a football field and try to pick them up or put them on your desk and pick them up )

    2) > ...The paths are relatively shallow; 

    I mean that the user PATH variable does NOT includes directories with a large amount of files.

    3) I have seen that a online defrag could help for metadata operations since it looks like that subblock scattered around luns will be put together. YMMV.

    4) Monitor the DS8300 since its the best place to see what going on. ( LUN usage, I/O queue a.s.o )

    Note: In case you can a afford it it: Go for SSDs for your metadata . If it does not solve your problem you have a problem ^_^ 

    Happy weekend.

    Hajo

  • zero.silikon
    zero.silikon
    3 Posts

    Re: Diagnosing slow metadata performance

    ‏2013-06-07T14:06:50Z  

    1 ) > ... 18 disks holding both metadata and data

    Standard procedure for use is to separate data and metadata. Think about what it means if your metadata is scattered around TB luns.....

    ( Put 1000 marbles on a football field and try to pick them up or put them on your desk and pick them up )

    2) > ...The paths are relatively shallow; 

    I mean that the user PATH variable does NOT includes directories with a large amount of files.

    3) I have seen that a online defrag could help for metadata operations since it looks like that subblock scattered around luns will be put together. YMMV.

    4) Monitor the DS8300 since its the best place to see what going on. ( LUN usage, I/O queue a.s.o )

    Note: In case you can a afford it it: Go for SSDs for your metadata . If it does not solve your problem you have a problem ^_^ 

    Happy weekend.

    Hajo

    I'm not sure the analogy you provided for the metadata description applies here;  regardless of volume, GPFS knows where that metadata is stored and should be able to reference it quickly.  Having it dispersed among multiple volumes should benefit the cluster as it reduces hotspots, bottlenecks, and things of that nature.

    As a point of reference, these users' were working much quicker with the same scripts performing the same routines.  Hundreds of records processed per second, versus sub-100's (coming from a JFS2/NFS world).  The only thing that has changed (and I've verified this time and time again) is the fact that they were relocated to a GPFS filesystem to eliminate the need for all this data to be accessed via NFS mount, and to give them much more accessibility to the data in terms of sheer speed and less latency.

    Although I would enjoy using SSD's, that's a relative impossibility;  what I am trying to do is to pinpoint where the marked decrease in performance is and work back from there.  If it's determined that it in fact is the volumes that make up the cluster then, fine -- but I would really appreciate understanding how people have, in past iterations, narrowed any perceived decrease in performance using whatever tools were at their disposal, versus condemning the current configuration or hardware or even software for that matter without empirical evidence to support the condemnation.

     

  • HajoEhlers
    HajoEhlers
    253 Posts

    Re: Diagnosing slow metadata performance

    ‏2013-06-07T15:14:31Z  

    I'm not sure the analogy you provided for the metadata description applies here;  regardless of volume, GPFS knows where that metadata is stored and should be able to reference it quickly.  Having it dispersed among multiple volumes should benefit the cluster as it reduces hotspots, bottlenecks, and things of that nature.

    As a point of reference, these users' were working much quicker with the same scripts performing the same routines.  Hundreds of records processed per second, versus sub-100's (coming from a JFS2/NFS world).  The only thing that has changed (and I've verified this time and time again) is the fact that they were relocated to a GPFS filesystem to eliminate the need for all this data to be accessed via NFS mount, and to give them much more accessibility to the data in terms of sheer speed and less latency.

    Although I would enjoy using SSD's, that's a relative impossibility;  what I am trying to do is to pinpoint where the marked decrease in performance is and work back from there.  If it's determined that it in fact is the volumes that make up the cluster then, fine -- but I would really appreciate understanding how people have, in past iterations, narrowed any perceived decrease in performance using whatever tools were at their disposal, versus condemning the current configuration or hardware or even software for that matter without empirical evidence to support the condemnation.

     

    How are your clients connected ? Direct or via NSD

    Have you checked the cache usage via mmdiag ( mmdiag --stats), what's not in the cache must be loaded from disk

    Have you check for GPFS waiters during time of slow access or even monitor the waiters. ( mmlsnode -N waiters -L , mmdiag --waiters , mmdiag --iohist ), if it must be read from disk, how long does it take.

    Have monitored your DSxxxx regarding read/write sizes, queue sizes and wait time a.s.o

    Have you thought about using a dedicated lun(s) for the metadata to

    1) not to interfere with large read/writes going to the same lun

    2) having the metadata close together and might have benefits from read ahead and dedicated lun caching

    Note

    > I'm certain that I have tens of thousands of IOPS at my disposal. 

    Start to calculate and verify who is using them when ..... 

    Happy weekend

    Hajo

     

     

  • dlmcnabb
    dlmcnabb
    1012 Posts

    Re: Diagnosing slow metadata performance

    ‏2013-06-07T17:55:25Z  

    How are your clients connected ? Direct or via NSD

    Have you checked the cache usage via mmdiag ( mmdiag --stats), what's not in the cache must be loaded from disk

    Have you check for GPFS waiters during time of slow access or even monitor the waiters. ( mmlsnode -N waiters -L , mmdiag --waiters , mmdiag --iohist ), if it must be read from disk, how long does it take.

    Have monitored your DSxxxx regarding read/write sizes, queue sizes and wait time a.s.o

    Have you thought about using a dedicated lun(s) for the metadata to

    1) not to interfere with large read/writes going to the same lun

    2) having the metadata close together and might have benefits from read ahead and dedicated lun caching

    Note

    > I'm certain that I have tens of thousands of IOPS at my disposal. 

    Start to calculate and verify who is using them when ..... 

    Happy weekend

    Hajo

     

     

    GPFS is not the same as NFS since NFS has a single node providing a common directory manager. GPFS is distributed and the nodes only communicate changes about directories in files via disk IO.

    If you have a very large directory that many nodes are making changes to then things like readdir (ls) can take a long time revoking tokens held by other nodes and forcing disk IO.

    GPFS has implemented what is called Fine Grain Directory Locking (FGDL) that allows many nodes to lock particular entries in a directory using the metanode as the real manager of the directory. FGDL allows name lookups, creation, and deletion of files in a directory to be very fast. However it you run a readdir, rename (mv), or mkdir commands in that directory, it has to revoke the FGDL tokens from all the other nodes and everything slows down.

    To make performance better, designing the applications on each node to work in their own subdirectories is much more preferable.

    The only other way to make things work faster is to have high speed metadata disks (SSD) so that the IO is much faster.

     

  • db808
    db808
    86 Posts

    Re: Diagnosing slow metadata performance

    ‏2013-10-02T19:08:04Z  
    • dlmcnabb
    • ‏2013-06-07T17:55:25Z

    GPFS is not the same as NFS since NFS has a single node providing a common directory manager. GPFS is distributed and the nodes only communicate changes about directories in files via disk IO.

    If you have a very large directory that many nodes are making changes to then things like readdir (ls) can take a long time revoking tokens held by other nodes and forcing disk IO.

    GPFS has implemented what is called Fine Grain Directory Locking (FGDL) that allows many nodes to lock particular entries in a directory using the metanode as the real manager of the directory. FGDL allows name lookups, creation, and deletion of files in a directory to be very fast. However it you run a readdir, rename (mv), or mkdir commands in that directory, it has to revoke the FGDL tokens from all the other nodes and everything slows down.

    To make performance better, designing the applications on each node to work in their own subdirectories is much more preferable.

    The only other way to make things work faster is to have high speed metadata disks (SSD) so that the IO is much faster.

     

    I'd like to jump in.

    The original poster indicated that performance has dropped to below 100 file operations per second .... using 18+ disks.  He is not complaining about thousands or tens of thousands of file operations per second ... but sub-100 operations per second.  He suspects something is broken (perhaps external to GPFS), and/or grossly misconfigured and is looking for troubleshooting guidance.

    Unfortunately, the poster has not responded to requests for more information about the configuration, so we can not see if there is some grossly mis-configured value. 

    I think that dlmcnabb asked the critical question.  How are the clients connected?  Direct FC connections to all disks, or through a network to NSD servers?  Since the poster mentioned using NFS before, there is a critical point that the clients may be connecting via the network ... and apparently this may be the first time that such network connections have been stressed .... leading to networking issues potentially restricting these file operations.

    If the clients accessing the files via an NSD server, how fast does similar scripts run when executed on the NSD servers themselves?  If performance is "normal", then it is likely the client-to-NSD server connection issue.

    I have found value in looking at the mmdiag --iohist output ... from the client perspective.  You may want to increase the iohist size from 512 to 1024 to get a longer perspective.

    When I was having performance issues with some network-connected clients, the client-side iohist was valuable.  It showed many dozens of requests with very short times (a few milliseconds), (and very good performance during this time) and then a "pause", where there were 1 or 2 requests that had very long times (hundreds of milliseconds).  This only happened about 1 out of 50 to 1 out of 100 times.  The overall performance impact of the delay was over 4-fold in my case, but your mileage will vary.

    With the GPFS iohist being able to mark the exact time of the pauses ... we were able then also run some packet traces, and correlate the network activity around when the "pause" occurred.  It ended up being some packet drops that caused a TCP-level backoff and recovery.  When we diagnosed and corrected the networking issues, the performance anomalies disappeared.

    I have also seen mtime and atime settings have dramatic impact on performance as you are trying to post timestamp changes cluster-wide, and invalidating cached metadata. 

    I would also suggest double-checking the GPFS file system mount options, and comparing them to the previous NFS mounts that were being used.  It is likely that the NFS mount had some "go fast" options enabled, and those same options are not enabled on the GPFS file system.

    You can easily explain a sub-100 file operations per second ... from a single thread ... if the metadata is not cached.  Most of those style of file operations need to do at least a (get directory entry) and a (read inode) operation per file.  If those pieces of metadata are not cached. you could have 2 metadata IOs per file.  If a single 2TB disk does 70 IOPs ... then you have 35 file IOPs per second ..... per thread.

    GPFS has some patented inode prefetch algorithms, but they don't kick in on small directories.  In my case,  one of our GPFS cluster has an average of 6 files per directory, and we get just about (1+1/6) metadata IOs per file, plus the data IOs, if there are any.   If you have good inode caching, the grandparent and higher directory nodes will hopefully be cached.

    When you were experiencing thousands of file operations per second previously (under NFS), you are either accessing cached directory entries and cached inodes, running hundreds of threads in parallel, and/or have the metadata on very fast storage devices.  The effectiveness of client-side caching (and how often client-side cache is invalidated) depends on the client-side GPFS configuration, atime, and mtime settings.  Less-than-efficient settings of these GPFS parameters would be identified with the GPFS configuration information that was never provided.

    Hope this helps.