Topic
  • 13 replies
  • Latest Post - ‏2016-02-12T17:40:34Z by AcedoM
zero.silikon
zero.silikon
3 Posts

Pinned topic Diagnosing slow metadata performance

‏2013-06-03T21:19:19Z |

All,

Looking for some strategies to debug certain metadata-related performance issues within a small GPFS cluster.

 

Some initial information:

AIX 6.1.8.1

GPFS 3.5.0.9 (32TB filesystem, 6 filesets with no placement policy, 18 disks holding both metadata and data, connected to a DS8300 storage array via multiple 4Gb/s FC)

I'm having problems troubleshooting a problem that's been plaguing the environment for some time;  data writes and reads are excellent, and within expected data rates.  When it comes to scripts and other programs running things like 'ls -l' and things of that nature, performance suffers.

How best would one go about determining what is causing this, and potential ideas for remediation?

  • HajoEhlers
    HajoEhlers
    253 Posts

    Re: Diagnosing slow metadata performance

    ‏2013-06-05T16:23:12Z  

    Check the size of the directory being access ( All directories includes in the users PATH as well ) and let us know.

    Also the File cache settings might not sufficient . See via $ mmdiag --stats

    Keep in mind that 18 disk ( I assume 2TB disk size ) are not able to support large IOPS request. ( 18 X 50 IOPS =< 1000 IOPS )

    Nice wiki https://www.ibm.com/developerworks/community/wikis/home/wiki/General%20Parallel%20File%20System%20(GPFS)?lang=en including tuning tips.

     

    cheers

    Hajo

  • zero.silikon
    zero.silikon
    3 Posts

    Re: Diagnosing slow metadata performance

    ‏2013-06-06T23:28:58Z  

    Check the size of the directory being access ( All directories includes in the users PATH as well ) and let us know.

    Also the File cache settings might not sufficient . See via $ mmdiag --stats

    Keep in mind that 18 disk ( I assume 2TB disk size ) are not able to support large IOPS request. ( 18 X 50 IOPS =< 1000 IOPS )

    Nice wiki https://www.ibm.com/developerworks/community/wikis/home/wiki/General%20Parallel%20File%20System%20(GPFS)?lang=en including tuning tips.

     

    cheers

    Hajo

    The mmdiag stuff was the first thing I went to.  I've been working on this for a few weeks, perhaps I didn't properly convey this.  The paths are relatively shallow;  no more than 10 subs deep;  however there are files in excess of the hundreds of thousands in a handful of places.

    Also referencing the 18 disks;  I'm certain that I mis-represented.  Those 18 disks (volumes) are attached to a DS8300 storage array in triplicate, and I'm certain that I have tens of thousands of IOPS at my disposal.  As a point of reference, there are also volumes from this storage array configuration attached to other "stuff" (Oracle ASM, etc) and I can quite easily push a large volume of IOPS.

    I was hoping that others have run into this issue and have insight, as to alternative or other methods that they could share that might suggest ideas I have yet to contemplate, to isolate the source of the slow behavior.

    I should stress again, that this happens only on metadata-intensive operations...

  • HajoEhlers
    HajoEhlers
    253 Posts

    Re: Diagnosing slow metadata performance

    ‏2013-06-07T06:11:43Z  

    The mmdiag stuff was the first thing I went to.  I've been working on this for a few weeks, perhaps I didn't properly convey this.  The paths are relatively shallow;  no more than 10 subs deep;  however there are files in excess of the hundreds of thousands in a handful of places.

    Also referencing the 18 disks;  I'm certain that I mis-represented.  Those 18 disks (volumes) are attached to a DS8300 storage array in triplicate, and I'm certain that I have tens of thousands of IOPS at my disposal.  As a point of reference, there are also volumes from this storage array configuration attached to other "stuff" (Oracle ASM, etc) and I can quite easily push a large volume of IOPS.

    I was hoping that others have run into this issue and have insight, as to alternative or other methods that they could share that might suggest ideas I have yet to contemplate, to isolate the source of the slow behavior.

    I should stress again, that this happens only on metadata-intensive operations...

    1 ) > ... 18 disks holding both metadata and data

    Standard procedure for use is to separate data and metadata. Think about what it means if your metadata is scattered around TB luns.....

    ( Put 1000 marbles on a football field and try to pick them up or put them on your desk and pick them up )

    2) > ...The paths are relatively shallow; 

    I mean that the user PATH variable does NOT includes directories with a large amount of files.

    3) I have seen that a online defrag could help for metadata operations since it looks like that subblock scattered around luns will be put together. YMMV.

    4) Monitor the DS8300 since its the best place to see what going on. ( LUN usage, I/O queue a.s.o )

    Note: In case you can a afford it it: Go for SSDs for your metadata . If it does not solve your problem you have a problem ^_^ 

    Happy weekend.

    Hajo

  • zero.silikon
    zero.silikon
    3 Posts

    Re: Diagnosing slow metadata performance

    ‏2013-06-07T14:06:50Z  

    1 ) > ... 18 disks holding both metadata and data

    Standard procedure for use is to separate data and metadata. Think about what it means if your metadata is scattered around TB luns.....

    ( Put 1000 marbles on a football field and try to pick them up or put them on your desk and pick them up )

    2) > ...The paths are relatively shallow; 

    I mean that the user PATH variable does NOT includes directories with a large amount of files.

    3) I have seen that a online defrag could help for metadata operations since it looks like that subblock scattered around luns will be put together. YMMV.

    4) Monitor the DS8300 since its the best place to see what going on. ( LUN usage, I/O queue a.s.o )

    Note: In case you can a afford it it: Go for SSDs for your metadata . If it does not solve your problem you have a problem ^_^ 

    Happy weekend.

    Hajo

    I'm not sure the analogy you provided for the metadata description applies here;  regardless of volume, GPFS knows where that metadata is stored and should be able to reference it quickly.  Having it dispersed among multiple volumes should benefit the cluster as it reduces hotspots, bottlenecks, and things of that nature.

    As a point of reference, these users' were working much quicker with the same scripts performing the same routines.  Hundreds of records processed per second, versus sub-100's (coming from a JFS2/NFS world).  The only thing that has changed (and I've verified this time and time again) is the fact that they were relocated to a GPFS filesystem to eliminate the need for all this data to be accessed via NFS mount, and to give them much more accessibility to the data in terms of sheer speed and less latency.

    Although I would enjoy using SSD's, that's a relative impossibility;  what I am trying to do is to pinpoint where the marked decrease in performance is and work back from there.  If it's determined that it in fact is the volumes that make up the cluster then, fine -- but I would really appreciate understanding how people have, in past iterations, narrowed any perceived decrease in performance using whatever tools were at their disposal, versus condemning the current configuration or hardware or even software for that matter without empirical evidence to support the condemnation.

     

  • HajoEhlers
    HajoEhlers
    253 Posts

    Re: Diagnosing slow metadata performance

    ‏2013-06-07T15:14:31Z  

    I'm not sure the analogy you provided for the metadata description applies here;  regardless of volume, GPFS knows where that metadata is stored and should be able to reference it quickly.  Having it dispersed among multiple volumes should benefit the cluster as it reduces hotspots, bottlenecks, and things of that nature.

    As a point of reference, these users' were working much quicker with the same scripts performing the same routines.  Hundreds of records processed per second, versus sub-100's (coming from a JFS2/NFS world).  The only thing that has changed (and I've verified this time and time again) is the fact that they were relocated to a GPFS filesystem to eliminate the need for all this data to be accessed via NFS mount, and to give them much more accessibility to the data in terms of sheer speed and less latency.

    Although I would enjoy using SSD's, that's a relative impossibility;  what I am trying to do is to pinpoint where the marked decrease in performance is and work back from there.  If it's determined that it in fact is the volumes that make up the cluster then, fine -- but I would really appreciate understanding how people have, in past iterations, narrowed any perceived decrease in performance using whatever tools were at their disposal, versus condemning the current configuration or hardware or even software for that matter without empirical evidence to support the condemnation.

     

    How are your clients connected ? Direct or via NSD

    Have you checked the cache usage via mmdiag ( mmdiag --stats), what's not in the cache must be loaded from disk

    Have you check for GPFS waiters during time of slow access or even monitor the waiters. ( mmlsnode -N waiters -L , mmdiag --waiters , mmdiag --iohist ), if it must be read from disk, how long does it take.

    Have monitored your DSxxxx regarding read/write sizes, queue sizes and wait time a.s.o

    Have you thought about using a dedicated lun(s) for the metadata to

    1) not to interfere with large read/writes going to the same lun

    2) having the metadata close together and might have benefits from read ahead and dedicated lun caching

    Note

    > I'm certain that I have tens of thousands of IOPS at my disposal. 

    Start to calculate and verify who is using them when ..... 

    Happy weekend

    Hajo

     

     

  • dlmcnabb
    dlmcnabb
    1012 Posts

    Re: Diagnosing slow metadata performance

    ‏2013-06-07T17:55:25Z  

    How are your clients connected ? Direct or via NSD

    Have you checked the cache usage via mmdiag ( mmdiag --stats), what's not in the cache must be loaded from disk

    Have you check for GPFS waiters during time of slow access or even monitor the waiters. ( mmlsnode -N waiters -L , mmdiag --waiters , mmdiag --iohist ), if it must be read from disk, how long does it take.

    Have monitored your DSxxxx regarding read/write sizes, queue sizes and wait time a.s.o

    Have you thought about using a dedicated lun(s) for the metadata to

    1) not to interfere with large read/writes going to the same lun

    2) having the metadata close together and might have benefits from read ahead and dedicated lun caching

    Note

    > I'm certain that I have tens of thousands of IOPS at my disposal. 

    Start to calculate and verify who is using them when ..... 

    Happy weekend

    Hajo

     

     

    GPFS is not the same as NFS since NFS has a single node providing a common directory manager. GPFS is distributed and the nodes only communicate changes about directories in files via disk IO.

    If you have a very large directory that many nodes are making changes to then things like readdir (ls) can take a long time revoking tokens held by other nodes and forcing disk IO.

    GPFS has implemented what is called Fine Grain Directory Locking (FGDL) that allows many nodes to lock particular entries in a directory using the metanode as the real manager of the directory. FGDL allows name lookups, creation, and deletion of files in a directory to be very fast. However it you run a readdir, rename (mv), or mkdir commands in that directory, it has to revoke the FGDL tokens from all the other nodes and everything slows down.

    To make performance better, designing the applications on each node to work in their own subdirectories is much more preferable.

    The only other way to make things work faster is to have high speed metadata disks (SSD) so that the IO is much faster.

     

  • db808
    db808
    87 Posts

    Re: Diagnosing slow metadata performance

    ‏2013-10-02T19:08:04Z  
    • dlmcnabb
    • ‏2013-06-07T17:55:25Z

    GPFS is not the same as NFS since NFS has a single node providing a common directory manager. GPFS is distributed and the nodes only communicate changes about directories in files via disk IO.

    If you have a very large directory that many nodes are making changes to then things like readdir (ls) can take a long time revoking tokens held by other nodes and forcing disk IO.

    GPFS has implemented what is called Fine Grain Directory Locking (FGDL) that allows many nodes to lock particular entries in a directory using the metanode as the real manager of the directory. FGDL allows name lookups, creation, and deletion of files in a directory to be very fast. However it you run a readdir, rename (mv), or mkdir commands in that directory, it has to revoke the FGDL tokens from all the other nodes and everything slows down.

    To make performance better, designing the applications on each node to work in their own subdirectories is much more preferable.

    The only other way to make things work faster is to have high speed metadata disks (SSD) so that the IO is much faster.

     

    I'd like to jump in.

    The original poster indicated that performance has dropped to below 100 file operations per second .... using 18+ disks.  He is not complaining about thousands or tens of thousands of file operations per second ... but sub-100 operations per second.  He suspects something is broken (perhaps external to GPFS), and/or grossly misconfigured and is looking for troubleshooting guidance.

    Unfortunately, the poster has not responded to requests for more information about the configuration, so we can not see if there is some grossly mis-configured value. 

    I think that dlmcnabb asked the critical question.  How are the clients connected?  Direct FC connections to all disks, or through a network to NSD servers?  Since the poster mentioned using NFS before, there is a critical point that the clients may be connecting via the network ... and apparently this may be the first time that such network connections have been stressed .... leading to networking issues potentially restricting these file operations.

    If the clients accessing the files via an NSD server, how fast does similar scripts run when executed on the NSD servers themselves?  If performance is "normal", then it is likely the client-to-NSD server connection issue.

    I have found value in looking at the mmdiag --iohist output ... from the client perspective.  You may want to increase the iohist size from 512 to 1024 to get a longer perspective.

    When I was having performance issues with some network-connected clients, the client-side iohist was valuable.  It showed many dozens of requests with very short times (a few milliseconds), (and very good performance during this time) and then a "pause", where there were 1 or 2 requests that had very long times (hundreds of milliseconds).  This only happened about 1 out of 50 to 1 out of 100 times.  The overall performance impact of the delay was over 4-fold in my case, but your mileage will vary.

    With the GPFS iohist being able to mark the exact time of the pauses ... we were able then also run some packet traces, and correlate the network activity around when the "pause" occurred.  It ended up being some packet drops that caused a TCP-level backoff and recovery.  When we diagnosed and corrected the networking issues, the performance anomalies disappeared.

    I have also seen mtime and atime settings have dramatic impact on performance as you are trying to post timestamp changes cluster-wide, and invalidating cached metadata. 

    I would also suggest double-checking the GPFS file system mount options, and comparing them to the previous NFS mounts that were being used.  It is likely that the NFS mount had some "go fast" options enabled, and those same options are not enabled on the GPFS file system.

    You can easily explain a sub-100 file operations per second ... from a single thread ... if the metadata is not cached.  Most of those style of file operations need to do at least a (get directory entry) and a (read inode) operation per file.  If those pieces of metadata are not cached. you could have 2 metadata IOs per file.  If a single 2TB disk does 70 IOPs ... then you have 35 file IOPs per second ..... per thread.

    GPFS has some patented inode prefetch algorithms, but they don't kick in on small directories.  In my case,  one of our GPFS cluster has an average of 6 files per directory, and we get just about (1+1/6) metadata IOs per file, plus the data IOs, if there are any.   If you have good inode caching, the grandparent and higher directory nodes will hopefully be cached.

    When you were experiencing thousands of file operations per second previously (under NFS), you are either accessing cached directory entries and cached inodes, running hundreds of threads in parallel, and/or have the metadata on very fast storage devices.  The effectiveness of client-side caching (and how often client-side cache is invalidated) depends on the client-side GPFS configuration, atime, and mtime settings.  Less-than-efficient settings of these GPFS parameters would be identified with the GPFS configuration information that was never provided.

    Hope this helps.

     

  • AcedoM
    AcedoM
    37 Posts

    Re: Diagnosing slow metadata performance

    ‏2016-02-12T16:37:27Z  

     

     

    Experiencing a Hang with linux mv command in GPFS filesystem on an otherwise running system. Observed several times where a 'mv' command performed on a GPFS client node will hang on a specific node, but work fine on others. We strace and mmtracectl while the hang is occurring. Looking to see if this particular mv hanging has any know work around.  This system has SSD drivers and seperate MetaData Pools.     Has maxFilesToCache=40000, but the amount of files can be 1 large to serveral directories of files. 

     

    # cd /research/rgs01/resgen/raw_data/fastq/HA/hpcf-download/HJMF7CCXX/lsf/
    # ps -ef | grep 7633
    jmcmurry 7633 7629 0 Feb08 pts/2 00:00:00 mv -v /research/rgs01/resgen/raw_data/fastq/HA/hpcf-download/HJMF7CCXX/lsf/HJMF7CCXX-local.md5 /research/rgs01/resgen/raw_data/fastq/HA/hpcf-download/HJMF7CCXX
    root 8665 959 0 18:40 pts/106 00:00:00 grep 7633
    # pwd
    /research/rgs01/resgen/raw_data/fastq/HA/hpcf-download/HJMF7CCXX/lsf
    [root@r-head002 lsf]# ls -al

  • oester
    oester
    188 Posts

    Re: Diagnosing slow metadata performance

    ‏2016-02-12T16:47:34Z  
    • AcedoM
    • ‏2016-02-12T16:37:27Z

     

     

    Experiencing a Hang with linux mv command in GPFS filesystem on an otherwise running system. Observed several times where a 'mv' command performed on a GPFS client node will hang on a specific node, but work fine on others. We strace and mmtracectl while the hang is occurring. Looking to see if this particular mv hanging has any know work around.  This system has SSD drivers and seperate MetaData Pools.     Has maxFilesToCache=40000, but the amount of files can be 1 large to serveral directories of files. 

     

    # cd /research/rgs01/resgen/raw_data/fastq/HA/hpcf-download/HJMF7CCXX/lsf/
    # ps -ef | grep 7633
    jmcmurry 7633 7629 0 Feb08 pts/2 00:00:00 mv -v /research/rgs01/resgen/raw_data/fastq/HA/hpcf-download/HJMF7CCXX/lsf/HJMF7CCXX-local.md5 /research/rgs01/resgen/raw_data/fastq/HA/hpcf-download/HJMF7CCXX
    root 8665 959 0 18:40 pts/106 00:00:00 grep 7633
    # pwd
    /research/rgs01/resgen/raw_data/fastq/HA/hpcf-download/HJMF7CCXX/lsf
    [root@r-head002 lsf]# ls -al

    Waking up an old thread here!

    I would be interesting to know more about the actual cluster connectivity and NSD configuration. On the slow node in question, an iohist dump "mmfsadm dump iohist" and a dump of the waiter "mmfsadm dump waiters" when it's occurring might yield  some clues on what the wait would be. YOu may have some RPC waiters due to tokens in that directory for example, or other IO related delays.

     

    Bob Oesterlin

    Nuance HPC Grid

  • AcedoM
    AcedoM
    37 Posts

    Re: Diagnosing slow metadata performance

    ‏2016-02-12T16:54:21Z  

    Hi There Bob Great to Hear back,  I have a new thread going (link Below),  But have not heard back from anyone yet.   I have actually been hitting this in two different incidence.   Waiters Looked really good at the time we tested it in one case.   I want to do some more metadata tuning, but not sure that will help.  This seems to be like a known issue, but will also have to check the network as mentioned above.  But to answer your question yeah, I always check waiters over and over and havnt seen any like i said in one of the cases, but feel the mv is a known High overhead command.

     

    https://www.ibm.com/developerworks/community/forums/html/topic?id=3531328a-a2a7-4a70-9fa3-97c770bf3121&ps=25

  • oester
    oester
    188 Posts

    Re: Diagnosing slow metadata performance

    ‏2016-02-12T17:05:31Z  
    • AcedoM
    • ‏2016-02-12T16:54:21Z

    Hi There Bob Great to Hear back,  I have a new thread going (link Below),  But have not heard back from anyone yet.   I have actually been hitting this in two different incidence.   Waiters Looked really good at the time we tested it in one case.   I want to do some more metadata tuning, but not sure that will help.  This seems to be like a known issue, but will also have to check the network as mentioned above.  But to answer your question yeah, I always check waiters over and over and havnt seen any like i said in one of the cases, but feel the mv is a known High overhead command.

     

    https://www.ibm.com/developerworks/community/forums/html/topic?id=3531328a-a2a7-4a70-9fa3-97c770bf3121&ps=25

    When you say "waiters look good" - I'm not sure what that means. If the client is hanging, it's probably an RPC wait or an IO wait, hence my question on what each of those show. Regarding it being a know issue - can you point me to where you have seen this? I'm running metadata on SSD/Flash, and I never see these long hangs unless there is some other underlying issue (waiters, contention)

     

    Bob

  • AcedoM
    AcedoM
    37 Posts

    Re: Diagnosing slow metadata performance

    ‏2016-02-12T17:27:57Z  

    Thanks Bob, In one On the node with the hung mv command the waiters were non existent.   mmlsnod collected cluster wide the waiters were all sub .05 seconds accross the board.  

     

    # mmdiag --waiters

    === mmdiag: waiters ===

     

  • AcedoM
    AcedoM
    37 Posts

    Re: Diagnosing slow metadata performance

    ‏2016-02-12T17:40:34Z  

    Thanks Bob, will continue to test and debug network and nodes for more info.   Thanks