Topic
  • 28 replies
  • Latest Post - ‏2014-06-06T20:11:49Z by marc_of_GPFS
fleers
fleers
24 Posts

Pinned topic delete policy crawling through 200m small files

‏2014-05-31T00:33:50Z |

Hello,

I am watching a delete policy as it churns through 200 million files of roughly 1KB size at a rate of about 350000 files an hour. (linear projection = 23 days to completion)  I have 8 NSD servers employed in the mmapplypolicy job.  GPFS 3.5.0.17, the filesystem blocksize is 1M, scatter alloc., inodes at default 512B, metadata is replicated, data is not.  We have 40 dataAndMetadata NSDs in 2 failure groups.  The GPFS disks are built on h/w RAID6 8+2 SATA LUNs with a 1MB full-stripe size. 

The 200 million files are arranged such that 150,000 files are contained per directory, all directories are at the same level in the tree.

The rule looks like this:

Evaluating MIGRATE/DELETE/EXCLUDE rules with CURRENT_TIMESTAMP = 2014-05-30@19:16:08 UTC
parsed 0 Placement Rules, 0 Restore Rules, 1 Migrate/Delete/Exclude Rules,
        0 List Rules, 0 External Pool/List Rules
     RULE 'deleteme' DELETE WHERE (DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME) > 3)
        AND PATH_NAME LIKE '%/Summary.within.intermediate-file/%'

The inode and directory scan phases are relatively quick.

I am noticing 100x the IOPS at 32KB (subblock size) to one of the LUNs/NSDs, rotating through all of the NSDs, for 20s per, then on to the next NSD.  Similarly for 4KB and smaller (metadata OPS) but not as pronounced, maybe a 10x difference between one of the LUNs and the other 39. 

I am trying to determine why GPFS is cycling through the NSDs in this manner as opposed to accessing blocks in a more scatter-like fashion.

I am also scratching my head relative to the rate of progress.  I know there is much locking and overhead in general with this is -  a metadata ops small-block workload, producing lousy read-modify-write backend I/O, but this seems a bit extreme.

Here's a typical waiters snapshot in time.  Nothing over 0.5s as long as I've been watching.  Seems like mostly overhead -

0.169476086 UnusedInodePrefetchThread: on ThMutex 0x7F708CBABC58
0.167694036 BackgroundFileDeletionThread: on ThCond 0x7F704C0034A8
0.166714624 CreateHandlerThread: on ThMutex 0x7F708CBABC58
0.166476811 CreateHandlerThread: on ThMutex 0x7F708CBABC58
0.160214317 RemoveHandlerThread: on ThMutex 0x7F708CBABC58
0.150010403 CreateHandlerThread: on ThMutex 0x7F708CBABC58
0.076717055 BackgroundFileDeletionThread: on ThCond 0x1F2FC08
0.072386493 RemoveHandlerThread: on ThCond 0x7F7B5C015548
0.064595558 RemoveHandlerThread: on ThCond 0x7F2BEC009478
0.062348572 RemoveHandlerThread: on ThMutex 0x7F7148000C58
0.053785491 CreateHandlerThread: on ThMutex 0x7FFB28004AF8
0.049346755 RemoveHandlerThread: on ThCond 0x7FFB10003888
0.032642000 PrefetchWorkerThread: for I/O completion
0.020745000 PrefetchWorkerThread: for I/O completion
0.014435000 PrefetchWorkerThread: for I/O completion
0.014066000 PrefetchWorkerThread: for I/O completion
0.013694000 PrefetchWorkerThread: for I/O completion
0.013328000 PrefetchWorkerThread: for I/O completion
0.012768000 PrefetchWorkerThread: for I/O completion
0.011909000 PrefetchWorkerThread: for I/O completion
0.011909000 PrefetchWorkerThread: for I/O completion
0.011529000 PrefetchWorkerThread: for I/O completion
0.009646000 WritebehindWorkerThread: for I/O completion
0.008371000 DirBlockReadFetchHandlerThread: for I/O completion
0.007544450 FsyncHandlerThread: on ThCond 0x1800A408730
0.006597000 SharedHashTabFetchHandlerThread: for I/O completion
0.006234137 UnusedInodePrefetchThread: on ThMutex 0x7F1F38000C58
0.006205930 CreateHandlerThread: on ThMutex 0x7F1F38000C58
0.003444985 RemoveHandlerThread: on ThCond 0x2CB7F68
0.003154000 PrefetchWorkerThread: for I/O completion
0.003024000 PrefetchWorkerThread: for I/O completion
0.002758000 PrefetchWorkerThread: for I/O completion
0.001638000 SharedHashTabFetchHandlerThread: for I/O completion
0.000946735 RemoveHandlerThread: on ThCond 0x7F4F4C02D898
0.000351000 InodeAllocRevokeWorkerThread: for I/O completion
0.000226164 RemoveHandlerThread: on ThCond 0x1C00229E430

And a sample of aggregate iohist type and counts from the 8 NSD servers:

   1344 logData W
    688 logWrap W
    593 inode W
    377 iallocSeg W
    335 data W
    199 inode R
    180 iallocSeg R
    134 metadata W
     99 data R
     74 indBlock W
     41 metadata R
     20 allocSeg W
     12 logDesc W

thanks for any insight as I dig into this a bit more ...

  • marc_of_GPFS
    marc_of_GPFS
    33 Posts

    Re: delete policy crawling through 200m small files

    ‏2014-06-02T15:04:21Z  

    If you got through the directory and inode scans, then the command is done "churning" through your 200M files. Now it is doing the deletes with unlink(2) system calls.    This is a directory and directory lock intensive work load. How many "compute" servers are you using? What are your -N and -m parameters for mmapplypolicy?  -N governs which nodes will be executing the mmapplypolicy command in parallel, including those unlink()s. If you did not set -N, read the doc... "The default is to run on the node where the mmapplypolicy command is running or the current value of the defaultHelperNodes parameter of the mmchconfig command."  

    With multiple nodes trying to do a large number of deletes, in no particular order wrt the directories, you're likely going to have directory lock thrashing. And your strategy of packing 150K files per directory is probably working against you here!

    You can minimize the directory lock thrashing by "bunching" together deletes by directory - thereby amortizing the time and "expense" of obtaining a directory lock.    This is most easily done by adding a WEIGHT(DIRECTORY_HASH) clause to your DELETE rule.  This was introduced in Release 3.5 TL1 (3.5.0.11)

    The Advanced Admin Guide give this guidance:

    This rule:

    RULE 'purge' DELETE WEIGHT(DIRECTORY_HASH) WHERE (deletion-criteria)

    causes files within the same directory to be grouped and processed together during deletion, which may improve the performance of GPFS directory locking and caching.

    If you have not upgraded to 3.5.0.11 (or higher) -- you still have the alternative of producing the list of files to be deleted with the -I defer or -I prepare option using -f to say where to put the list.  Then sort by pathname, then process the list with a script of your own design (-I defer) OR restart the job with the -r option, specifying the sorted file list.

    yours,

    Marc-the-policy-guy

  • fleers
    fleers
    24 Posts

    Re: delete policy crawling through 200m small files

    ‏2014-06-02T17:23:36Z  

    If you got through the directory and inode scans, then the command is done "churning" through your 200M files. Now it is doing the deletes with unlink(2) system calls.    This is a directory and directory lock intensive work load. How many "compute" servers are you using? What are your -N and -m parameters for mmapplypolicy?  -N governs which nodes will be executing the mmapplypolicy command in parallel, including those unlink()s. If you did not set -N, read the doc... "The default is to run on the node where the mmapplypolicy command is running or the current value of the defaultHelperNodes parameter of the mmchconfig command."  

    With multiple nodes trying to do a large number of deletes, in no particular order wrt the directories, you're likely going to have directory lock thrashing. And your strategy of packing 150K files per directory is probably working against you here!

    You can minimize the directory lock thrashing by "bunching" together deletes by directory - thereby amortizing the time and "expense" of obtaining a directory lock.    This is most easily done by adding a WEIGHT(DIRECTORY_HASH) clause to your DELETE rule.  This was introduced in Release 3.5 TL1 (3.5.0.11)

    The Advanced Admin Guide give this guidance:

    This rule:

    RULE 'purge' DELETE WEIGHT(DIRECTORY_HASH) WHERE (deletion-criteria)

    causes files within the same directory to be grouped and processed together during deletion, which may improve the performance of GPFS directory locking and caching.

    If you have not upgraded to 3.5.0.11 (or higher) -- you still have the alternative of producing the list of files to be deleted with the -I defer or -I prepare option using -f to say where to put the list.  Then sort by pathname, then process the list with a script of your own design (-I defer) OR restart the job with the -r option, specifying the sorted file list.

    yours,

    Marc-the-policy-guy

    Hi Marc-the-policy-guy,

     

    Thanks for the reply.

    How many "compute" servers are you using? - this is a cNFS cluster with a relatively light client population and workload(less than 100 NFS clients).

    What are your -N and -m parameters for mmapplypolicy? - -N is inferred above (8 NSD servers, (-N nsdnodes)).  -m was initially left default (24), we've tuned it down to -m 12 at the moment.

    If you have not upgraded to 3.5.0.11 (or higher) - we are at .17 as mentioned above.

    Thanks - we will try the WEIGHT(DIRECTORY_HASH) directive to see if grouping files within a directory eases up some of the contention for directory locks.

     

     

     

  • marc_of_GPFS
    marc_of_GPFS
    33 Posts

    Re: delete policy crawling through 200m small files

    ‏2014-06-02T17:31:42Z  
    • fleers
    • ‏2014-06-02T17:23:36Z

    Hi Marc-the-policy-guy,

     

    Thanks for the reply.

    How many "compute" servers are you using? - this is a cNFS cluster with a relatively light client population and workload(less than 100 NFS clients).

    What are your -N and -m parameters for mmapplypolicy? - -N is inferred above (8 NSD servers, (-N nsdnodes)).  -m was initially left default (24), we've tuned it down to -m 12 at the moment.

    If you have not upgraded to 3.5.0.11 (or higher) - we are at .17 as mentioned above.

    Thanks - we will try the WEIGHT(DIRECTORY_HASH) directive to see if grouping files within a directory eases up some of the contention for directory locks.

     

     

     

    Great. There is no harm in cancelling/killing the currently running command and restarting with WEIGHT(DIRECTORY_HASH) added to your DELETE rule.  We've done some lab testing. But we'd like to know how much this helps in practice.

    So please let us know how it works for you. After mmapplypolicy has "churned" through the inodescan it should tell you how many files it is about to delete -- let it go a few more minutes and then you can estimate the deletion rate and project the time til completion.

  • fleers
    fleers
    24 Posts

    Re: delete policy crawling through 200m small files

    ‏2014-06-02T20:29:27Z  

    Hi - No relief from adding  WEIGHT(DIRECTORY_HASH)

    Any thoughts on my original observation?  I continue to see this behavior when the policy is running :

    I am noticing 100x the IOPS at 32KB (subblock size) to one of the LUNs/NSDs, rotating through all of the NSDs, for 20s per, then on to the next NSD.  Similarly for 4KB and smaller (metadata OPS) but not as pronounced, maybe a 10x difference between one of the LUNs and the other 39. 

    I am trying to determine why GPFS is cycling through the NSDs in this manner as opposed to accessing blocks in a more scatter-like fashion.

    I've determined that this is due to the majority of IOPS (according to iohist), are logData and logWrap writes, which are cycling through a pair of GPFS disk numbers sequentially.  I say a pair, because we have -m 2 for this filesystem.

    Here is an example:

    sector 22943229580 of disk 24 and sector 22942457484 of disk 19:

    I/O start time RW    Buf type disk:sectorNum     nSec  time ms  Type  Device/NSD ID
    --------------- -- ----------- ----------------- -----  -------  ---- ------------------

    13:16:59.126087  W     logData   24:22943229580      1    0.514  lcl  vd23
    13:16:59.126087  W     logData   19:22942457484      1    0.516  lcl  vd18
    13:16:59.273496  W     logData   24:22943229580      1    0.541  lcl  vd23
    13:16:59.273498  W     logData   19:22942457484      1    0.541  lcl  vd18
    13:16:59.275220  W     logData   24:22943229580      1    0.541  lcl  vd23
    13:16:59.275222  W     logData   19:22942457484      1    0.541  lcl  vd18
    13:16:59.335279  W     logData   24:22943229580      1    2.471  lcl  vd23
    13:16:59.337212  W     logData   19:22942457484      1    0.540  lcl  vd18
    13:16:59.344740  W     logData   24:22943229580      1    0.537  lcl  vd23
    13:16:59.344740  W     logData   19:22942457484      1    0.543  lcl  vd18
    13:16:59.352290  W     logData   24:22943229580      1    0.518  lcl  vd23
    13:16:59.352290  W     logData   19:22942457484      1    0.520  lcl  vd18
    13:16:59.359801  W     logData   24:22943229580      2    1.148  lcl  vd23
    13:16:59.359807  W     logData   19:22942457484      2    1.142  lcl  vd18

    ...then we move on to the next sectors of those two disks, sector 22943229581 of disk 24 and sector 22942457485 of disk 19:
    13:16:59.514335  W     logData   24:22943229581      1    0.536  lcl  vd23
    13:16:59.514335  W     logData   19:22942457485      1    0.538  lcl  vd18
    13:16:59.519955  W     logData   24:22943229581      1    0.543  lcl  vd23
    13:16:59.519955  W     logData   19:22942457485      1    0.549  lcl  vd18
    13:16:59.607419  W     logData   24:22943229581      1    0.543  lcl  vd23
    13:16:59.607422  W     logData   19:22942457485      1    0.542  lcl  vd18
    13:16:59.611041  W     logData   24:22943229581      1    0.539  lcl  vd23
    13:16:59.611043  W     logData   19:22942457485      1    0.538  lcl  vd18
    13:16:59.615485  W     logData   24:22943229581      2    0.545  lcl  vd23
    13:16:59.615487  W     logData   19:22942457485      2    0.545  lcl  vd18

    logWrap seems to be following a similar pattern, although not exactly the same.

    This is obviously one contributor to the bottleneck (not the replication necessarily, but the overhead from the log I/O for all of these small file deletions). 

    BTW, it may not be readily obvious from the above snippet, but the IO type (on all nodes working on this policy) is dominated by logData and logWrap.

    comments? (aside from the obvious of moving metadata to SSD)

    thanks

    Updated on 2014-06-02T20:32:40Z at 2014-06-02T20:32:40Z by fleers
  • marc_of_GPFS
    marc_of_GPFS
    33 Posts

    Re: delete policy crawling through 200m small files

    ‏2014-06-03T11:36:47Z  
    • fleers
    • ‏2014-06-02T20:29:27Z

    Hi - No relief from adding  WEIGHT(DIRECTORY_HASH)

    Any thoughts on my original observation?  I continue to see this behavior when the policy is running :

    I am noticing 100x the IOPS at 32KB (subblock size) to one of the LUNs/NSDs, rotating through all of the NSDs, for 20s per, then on to the next NSD.  Similarly for 4KB and smaller (metadata OPS) but not as pronounced, maybe a 10x difference between one of the LUNs and the other 39. 

    I am trying to determine why GPFS is cycling through the NSDs in this manner as opposed to accessing blocks in a more scatter-like fashion.

    I've determined that this is due to the majority of IOPS (according to iohist), are logData and logWrap writes, which are cycling through a pair of GPFS disk numbers sequentially.  I say a pair, because we have -m 2 for this filesystem.

    Here is an example:

    sector 22943229580 of disk 24 and sector 22942457484 of disk 19:

    I/O start time RW    Buf type disk:sectorNum     nSec  time ms  Type  Device/NSD ID
    --------------- -- ----------- ----------------- -----  -------  ---- ------------------

    13:16:59.126087  W     logData   24:22943229580      1    0.514  lcl  vd23
    13:16:59.126087  W     logData   19:22942457484      1    0.516  lcl  vd18
    13:16:59.273496  W     logData   24:22943229580      1    0.541  lcl  vd23
    13:16:59.273498  W     logData   19:22942457484      1    0.541  lcl  vd18
    13:16:59.275220  W     logData   24:22943229580      1    0.541  lcl  vd23
    13:16:59.275222  W     logData   19:22942457484      1    0.541  lcl  vd18
    13:16:59.335279  W     logData   24:22943229580      1    2.471  lcl  vd23
    13:16:59.337212  W     logData   19:22942457484      1    0.540  lcl  vd18
    13:16:59.344740  W     logData   24:22943229580      1    0.537  lcl  vd23
    13:16:59.344740  W     logData   19:22942457484      1    0.543  lcl  vd18
    13:16:59.352290  W     logData   24:22943229580      1    0.518  lcl  vd23
    13:16:59.352290  W     logData   19:22942457484      1    0.520  lcl  vd18
    13:16:59.359801  W     logData   24:22943229580      2    1.148  lcl  vd23
    13:16:59.359807  W     logData   19:22942457484      2    1.142  lcl  vd18

    ...then we move on to the next sectors of those two disks, sector 22943229581 of disk 24 and sector 22942457485 of disk 19:
    13:16:59.514335  W     logData   24:22943229581      1    0.536  lcl  vd23
    13:16:59.514335  W     logData   19:22942457485      1    0.538  lcl  vd18
    13:16:59.519955  W     logData   24:22943229581      1    0.543  lcl  vd23
    13:16:59.519955  W     logData   19:22942457485      1    0.549  lcl  vd18
    13:16:59.607419  W     logData   24:22943229581      1    0.543  lcl  vd23
    13:16:59.607422  W     logData   19:22942457485      1    0.542  lcl  vd18
    13:16:59.611041  W     logData   24:22943229581      1    0.539  lcl  vd23
    13:16:59.611043  W     logData   19:22942457485      1    0.538  lcl  vd18
    13:16:59.615485  W     logData   24:22943229581      2    0.545  lcl  vd23
    13:16:59.615487  W     logData   19:22942457485      2    0.545  lcl  vd18

    logWrap seems to be following a similar pattern, although not exactly the same.

    This is obviously one contributor to the bottleneck (not the replication necessarily, but the overhead from the log I/O for all of these small file deletions). 

    BTW, it may not be readily obvious from the above snippet, but the IO type (on all nodes working on this policy) is dominated by logData and logWrap.

    comments? (aside from the obvious of moving metadata to SSD)

    thanks

    Okay, so now the problem is not policy per se, but that every directory update (unlink in this case) is logged AND the iops to the log files seem to be holding up the works.   My knowledge in this area of GPFS is limited, but here are a few things to ponder and investigate.

    You write that you have 40 nsds - are they all set as 'system' pool?  (Check with mmdf).
    As a rule of thumb for smallish clusters, for good metadata performance on conventional spinning disks, you should have about 2 or more system disks per node. 

    Very important: Does you Raid controller have NVRAM that can absorb those write ops quickly?  Are you sure it's activated/enabled/configured?
    Otherwise Raid performance for writes that are less than a complete Raid stripe will be horrifically bad - because each sector update may require a read/update/write cycle over several spinning disks.  If you have enough NVRAM the writes go into that and then the Raid controller can defer the writes...

    Also, are your nsds=GPFS LUNs each mapped to different spinning disks within your raid box(es)?  If you have 2 LUNs that map onto the same disks, such that an IOP to one may be held up waiting for the disk arms of the other -- then that's useless -- the 2 LUNs cannot perform any better than one -- and probably will perform worse -- because the software above is trying to be smart and schedule disk ops for each device as if they were independent.

    And yes, you may in future be able to accelerate metadata performance by dedicating a set of faster disks (could be SSDs) to system pool.  
    If you do reconfigure your GPFS disks in future, do some testing up front of small write performance - as I wrote just above - Raids are not good at this - you're better off just mapping each GPFS disk to one real Disk and relying on GPFS replication for fault protection.  Even so, best if you have fast NVRAM buffers - especially on the disks to which GPFS is writing log data.  In this regard GPFS provisioning is similar to any other journaling file system or a  data base with transaction logging.

  • fleers
    fleers
    24 Posts

    Re: delete policy crawling through 200m small files

    ‏2014-06-03T20:12:11Z  

    Okay, so now the problem is not policy per se, but that every directory update (unlink in this case) is logged AND the iops to the log files seem to be holding up the works.   My knowledge in this area of GPFS is limited, but here are a few things to ponder and investigate.

    You write that you have 40 nsds - are they all set as 'system' pool?  (Check with mmdf).
    As a rule of thumb for smallish clusters, for good metadata performance on conventional spinning disks, you should have about 2 or more system disks per node. 

    Very important: Does you Raid controller have NVRAM that can absorb those write ops quickly?  Are you sure it's activated/enabled/configured?
    Otherwise Raid performance for writes that are less than a complete Raid stripe will be horrifically bad - because each sector update may require a read/update/write cycle over several spinning disks.  If you have enough NVRAM the writes go into that and then the Raid controller can defer the writes...

    Also, are your nsds=GPFS LUNs each mapped to different spinning disks within your raid box(es)?  If you have 2 LUNs that map onto the same disks, such that an IOP to one may be held up waiting for the disk arms of the other -- then that's useless -- the 2 LUNs cannot perform any better than one -- and probably will perform worse -- because the software above is trying to be smart and schedule disk ops for each device as if they were independent.

    And yes, you may in future be able to accelerate metadata performance by dedicating a set of faster disks (could be SSDs) to system pool.  
    If you do reconfigure your GPFS disks in future, do some testing up front of small write performance - as I wrote just above - Raids are not good at this - you're better off just mapping each GPFS disk to one real Disk and relying on GPFS replication for fault protection.  Even so, best if you have fast NVRAM buffers - especially on the disks to which GPFS is writing log data.  In this regard GPFS provisioning is similar to any other journaling file system or a  data base with transaction logging.

    Thanks for the reply.  This boils down to billions of single-sector log writes into RAID6 LUNs which are on spinning SATA and optimized for 1MB IOs.  Sure, write cache can absorb some of that but the back-end read-modify-write operations become a bottleneck as the cache de-stages.

     Rule#  Hit_Cnt         KB_Hit          Chosen          KB_Chosen       KB_Ill  Rule
      0     240176967       7685688256      240176967       7685688256      0       RULE 'purge' DELETE WEIGHT(.) WHERE(.)

    [I]2014-06-02@20:14:52.282 Policy execution. 481259 files dispatched.  .!......
    [I]2014-06-02@23:28:07.665 Policy execution. 2218366 files dispatched.  ....*...
    [I]2014-06-03@15:01:39.368 Policy execution. 11070475 files dispatched.  \.......
    [I]2014-06-03@20:04:09.786 Policy execution. 13210620 files dispatched.  ......*.

    ~24 hours of progress = 5% ... we're looking at 20 days of runtime

  • marc_of_GPFS
    marc_of_GPFS
    33 Posts

    Re: delete policy crawling through 200m small files

    ‏2014-06-03T20:22:56Z  
    • fleers
    • ‏2014-06-03T20:12:11Z

    Thanks for the reply.  This boils down to billions of single-sector log writes into RAID6 LUNs which are on spinning SATA and optimized for 1MB IOs.  Sure, write cache can absorb some of that but the back-end read-modify-write operations become a bottleneck as the cache de-stages.

     Rule#  Hit_Cnt         KB_Hit          Chosen          KB_Chosen       KB_Ill  Rule
      0     240176967       7685688256      240176967       7685688256      0       RULE 'purge' DELETE WEIGHT(.) WHERE(.)

    [I]2014-06-02@20:14:52.282 Policy execution. 481259 files dispatched.  .!......
    [I]2014-06-02@23:28:07.665 Policy execution. 2218366 files dispatched.  ....*...
    [I]2014-06-03@15:01:39.368 Policy execution. 11070475 files dispatched.  \.......
    [I]2014-06-03@20:04:09.786 Policy execution. 13210620 files dispatched.  ......*.

    ~24 hours of progress = 5% ... we're looking at 20 days of runtime

    No, what's supposed to happen is that you accumulate enough contiguous sectors in the NVRAM so that the controller recognizes it can flush to disk with a single "full track" writes, in parallel to each spinning disk/arm assembly.  That's how the disk controller can "keep up" with the log writes.

    Can your Raid controller do that? Is it configured to do so?  Are the GPFS logfiles mapping to large blocks of contiguous sectors?

    I believe (although I could be wrong ;-( ) that's how the database boys play the game...

  • marc_of_GPFS
    marc_of_GPFS
    33 Posts

    Re: delete policy crawling through 200m small files

    ‏2014-06-03T20:27:14Z  

    No, what's supposed to happen is that you accumulate enough contiguous sectors in the NVRAM so that the controller recognizes it can flush to disk with a single "full track" writes, in parallel to each spinning disk/arm assembly.  That's how the disk controller can "keep up" with the log writes.

    Can your Raid controller do that? Is it configured to do so?  Are the GPFS logfiles mapping to large blocks of contiguous sectors?

    I believe (although I could be wrong ;-( ) that's how the database boys play the game...

    Hmmm.. I also see you really are trying to delete 240 Million files in one (long slow) blow.  Wow.  How many files will that leave you with? Equivalently, how many files or inodes did policy report inode-scanned?  If 240 Million is most everything, you may be better off copying the good stuff into a new filesystem and then wiping the old filesystem with a quick zero-out of the first few sectors of each drive...

  • db808
    db808
    86 Posts

    Re: delete policy crawling through 200m small files

    ‏2014-06-03T22:18:17Z  

    Hmmm.. I also see you really are trying to delete 240 Million files in one (long slow) blow.  Wow.  How many files will that leave you with? Equivalently, how many files or inodes did policy report inode-scanned?  If 240 Million is most everything, you may be better off copying the good stuff into a new filesystem and then wiping the old filesystem with a quick zero-out of the first few sectors of each drive...

    Interesting discussion. 

    Just a hint that we picked up via some battle scars from doing it the wrong way.

    Do you need your GPFS disk statistics?  If so, save a mmfsadm dump disk output to a file for a reference.

    Here is my thought.

    reset the GPFS disk statistics "mmfsadm resetstatistics"

    Let the system run for a few minutes, with your deletes crawling in the background.

    Run "mmfsadm dump disk > disk file"

    Create a Perl (or whatever) script to read the disk statistics and summarize the metadata and data IO per NSD.

    Is there a NSD hot spot?  It is likely the metadata journal file.

    Even if the metadata journal file is being well handled by write-cache on the disk array, the intense activity will monopolize the channels (SAS or FC), and starve "normal" metadata activity on the same NSD.

    You might be able to retrofit GPFS striped logs onto the file system, IFF the GPFS log file size (default of 4MB) is larger than the metadata block size.

    I have been told by IBM (but never done myself), that if you perform a clean shutdown of the GPFS file system, you can use mmfsck to delete the existing log files (MUST HAVE A CLEAN SHUTDOWN), and then use mmchfs to add the --striped-logs option to the file system.  When you then bring the file system up, GPFS will create new log files, that will be striped.

    This does not eliminate the hot spot, but causes the hot spot to "rotate" across multiple NSDs, such that no one NSD is being starved by competing with intense journalling.  If your block size is 1MB, and the log file is file size is the default of 4MB ... you will get 4-fold better dispersion of the log file IO.  Also, with multiple NSDs now involved with the journalling, one NSD can be flushing write cache while the next one is servicing the current journalling IO.

    We made the mistake of not doing this ... and our metadata block size happened to equal the log file size ... so even with striping ... only one NSD would be used.

    We have 120 metadata NSDs.  50% of the total metadata IO is to the log file, and it is concentrated on 4 NSD server's journal file, and their replicas.  So in our BAD case, 8 metadata NSDs are handling 50% of the NSD activity, and the other 112 NSDs are handling the other 50%.

    Our new file systems use an explicit 256kb metadata block size, with a 16 MB log file size, allowing a 64-way stripe of the journal file(s).

    Dave B

  • marc_of_GPFS
    marc_of_GPFS
    33 Posts

    Re: delete policy crawling through 200m small files

    ‏2014-06-04T02:06:40Z  
    • db808
    • ‏2014-06-03T22:18:17Z

    Interesting discussion. 

    Just a hint that we picked up via some battle scars from doing it the wrong way.

    Do you need your GPFS disk statistics?  If so, save a mmfsadm dump disk output to a file for a reference.

    Here is my thought.

    reset the GPFS disk statistics "mmfsadm resetstatistics"

    Let the system run for a few minutes, with your deletes crawling in the background.

    Run "mmfsadm dump disk > disk file"

    Create a Perl (or whatever) script to read the disk statistics and summarize the metadata and data IO per NSD.

    Is there a NSD hot spot?  It is likely the metadata journal file.

    Even if the metadata journal file is being well handled by write-cache on the disk array, the intense activity will monopolize the channels (SAS or FC), and starve "normal" metadata activity on the same NSD.

    You might be able to retrofit GPFS striped logs onto the file system, IFF the GPFS log file size (default of 4MB) is larger than the metadata block size.

    I have been told by IBM (but never done myself), that if you perform a clean shutdown of the GPFS file system, you can use mmfsck to delete the existing log files (MUST HAVE A CLEAN SHUTDOWN), and then use mmchfs to add the --striped-logs option to the file system.  When you then bring the file system up, GPFS will create new log files, that will be striped.

    This does not eliminate the hot spot, but causes the hot spot to "rotate" across multiple NSDs, such that no one NSD is being starved by competing with intense journalling.  If your block size is 1MB, and the log file is file size is the default of 4MB ... you will get 4-fold better dispersion of the log file IO.  Also, with multiple NSDs now involved with the journalling, one NSD can be flushing write cache while the next one is servicing the current journalling IO.

    We made the mistake of not doing this ... and our metadata block size happened to equal the log file size ... so even with striping ... only one NSD would be used.

    We have 120 metadata NSDs.  50% of the total metadata IO is to the log file, and it is concentrated on 4 NSD server's journal file, and their replicas.  So in our BAD case, 8 metadata NSDs are handling 50% of the NSD activity, and the other 112 NSDs are handling the other 50%.

    Our new file systems use an explicit 256kb metadata block size, with a 16 MB log file size, allowing a 64-way stripe of the journal file(s).

    Dave B

    "50% of the metadata IO is to the log file..." - on first blush may seem a lot -- But most (all?) metadata changes must be journaled, so there you are.

    "clean shutdown" -- should leave the journal empty -- so there you are -- easily deleted and replaced.

  • marc_of_GPFS
    marc_of_GPFS
    33 Posts

    Re: delete policy crawling through 200m small files

    ‏2014-06-04T02:52:24Z  

    "50% of the metadata IO is to the log file..." - on first blush may seem a lot -- But most (all?) metadata changes must be journaled, so there you are.

    "clean shutdown" -- should leave the journal empty -- so there you are -- easily deleted and replaced.

    Also consider "system.log" (see 4.1 Adv Admin Guide) -- I imagine you could dedicate a few MB of NVRAM to the log - and eliminate any journalling to spinning storage.

    Also where is --striped-logs documented?  Nada -- Aside from a few posts on this board (notably and authoritatively by GPFS grandmaster Yuri) https://www.ibm.com/developerworks/community/forums/html/topic?id=341d12d9-5786-4611-9872-426b6a84514f

    And sure enough...

     `mmlsfs d4ki --striped-logs -Y`   yields: "mmlsfs::0:1:::d4ki:stripedLogs:Yes:"
     
  • oester
    oester
    108 Posts

    Re: delete policy crawling through 200m small files

    ‏2014-06-04T13:16:41Z  
    • db808
    • ‏2014-06-03T22:18:17Z

    Interesting discussion. 

    Just a hint that we picked up via some battle scars from doing it the wrong way.

    Do you need your GPFS disk statistics?  If so, save a mmfsadm dump disk output to a file for a reference.

    Here is my thought.

    reset the GPFS disk statistics "mmfsadm resetstatistics"

    Let the system run for a few minutes, with your deletes crawling in the background.

    Run "mmfsadm dump disk > disk file"

    Create a Perl (or whatever) script to read the disk statistics and summarize the metadata and data IO per NSD.

    Is there a NSD hot spot?  It is likely the metadata journal file.

    Even if the metadata journal file is being well handled by write-cache on the disk array, the intense activity will monopolize the channels (SAS or FC), and starve "normal" metadata activity on the same NSD.

    You might be able to retrofit GPFS striped logs onto the file system, IFF the GPFS log file size (default of 4MB) is larger than the metadata block size.

    I have been told by IBM (but never done myself), that if you perform a clean shutdown of the GPFS file system, you can use mmfsck to delete the existing log files (MUST HAVE A CLEAN SHUTDOWN), and then use mmchfs to add the --striped-logs option to the file system.  When you then bring the file system up, GPFS will create new log files, that will be striped.

    This does not eliminate the hot spot, but causes the hot spot to "rotate" across multiple NSDs, such that no one NSD is being starved by competing with intense journalling.  If your block size is 1MB, and the log file is file size is the default of 4MB ... you will get 4-fold better dispersion of the log file IO.  Also, with multiple NSDs now involved with the journalling, one NSD can be flushing write cache while the next one is servicing the current journalling IO.

    We made the mistake of not doing this ... and our metadata block size happened to equal the log file size ... so even with striping ... only one NSD would be used.

    We have 120 metadata NSDs.  50% of the total metadata IO is to the log file, and it is concentrated on 4 NSD server's journal file, and their replicas.  So in our BAD case, 8 metadata NSDs are handling 50% of the NSD activity, and the other 112 NSDs are handling the other 50%.

    Our new file systems use an explicit 256kb metadata block size, with a 16 MB log file size, allowing a 64-way stripe of the journal file(s).

    Dave B

    Hi Dave

     

    Is that command to reset the disk stats correct? It doesn't work for me:

     

    # mmfsadm resetstatistics
    Invalid command "resetstatistics".
    Type "help" or "?" for help.

     

    Bob

  • renarg
    renarg
    119 Posts

    Re: delete policy crawling through 200m small files

    ‏2014-06-04T13:36:12Z  
    • oester
    • ‏2014-06-04T13:16:41Z

    Hi Dave

     

    Is that command to reset the disk stats correct? It doesn't work for me:

     

    # mmfsadm resetstatistics
    Invalid command "resetstatistics".
    Type "help" or "?" for help.

     

    Bob

    High Bob,

    hint: mmfsadm vfsstats reset . I don't know this is the same as Dave mentioned.

    Regards Renar

  • db808
    db808
    86 Posts

    Re: delete policy crawling through 200m small files

    ‏2014-06-04T20:14:14Z  
    • renarg
    • ‏2014-06-04T13:36:12Z

    High Bob,

    hint: mmfsadm vfsstats reset . I don't know this is the same as Dave mentioned.

    Regards Renar

    Sorry Bob, my typo

    The syntax is "mmfsadm resetstats"

    The option is described on the GPFS wiki,

    https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/mmfsadm

    Apparently it is a hidden command.  I thought I saw it show up in a mmfsadm help listing, but I don't remember what version or update.  Like you indicated, it is not listed in the more recent GPFS versions of mmfsadm.

    Not that you would *ever* try to find hidden commands, it does show up when you run:

    strings `which mmfsadm` | grep reset      :-)

     

  • db808
    db808
    86 Posts

    Re: delete policy crawling through 200m small files

    ‏2014-06-04T22:04:56Z  
    • db808
    • ‏2014-06-04T20:14:14Z

    Sorry Bob, my typo

    The syntax is "mmfsadm resetstats"

    The option is described on the GPFS wiki,

    https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/mmfsadm

    Apparently it is a hidden command.  I thought I saw it show up in a mmfsadm help listing, but I don't remember what version or update.  Like you indicated, it is not listed in the more recent GPFS versions of mmfsadm.

    Not that you would *ever* try to find hidden commands, it does show up when you run:

    strings `which mmfsadm` | grep reset      :-)

     

    Few other hints ... that I have NOT explored ... but seem to be related to a serialization of deletes.

    The delete rate that you mentioned above (about 13.2M files per 24 hours) is about 153 deletes per second.  Witt 7200 RPM SATA disks, you could expect ~ 75 small IOs per second.

    You indicated that your configuration had 40 x (8+2) RAID6 LUNs.  Are they are disjoint disks?  If so, you have a total of 400 disks.  Ideally, you could do up to 10small reads in parallel per raid group, and a small R6 write would take 3-to-6 IO times, resulting in 1.6 to 3.3 parallel writes per raid group.  Scaling this up to 40 raid groups, you have the potential for 400 parallel reads or (23 to 132) parallel writes. (journal hot spot notwithstanding)

    So you see there seems to be something limiting throughput, unless each delete operation needs 10+ IO operations.

    Thinking-out-load ... there appears to be an artificial limit ... unsubstantiated at this point.

    With your 400 disk storage complex, you could need as many as 400 read threads or 133 write threads to keep the disks busy ... and these are higher than GPFS defaults.  And GPFS 3.5 completely changed the IO scheduling system, and within GPFS 3.5, the nsdThreadmethod changed from 0 to 1, changing the algorithms further.

    I am also concerned that the "old" default of nsdThreadPerDisk was 3 ... which is appropriate for a singe disk, not a 10-disk RAID group.  I understand that the new GPFS 3.5 IO queuing system is "better", but it is different, and it is not well documented if the old legacy parameters (like nsdThreadsPerDisk) are completely ignored, or must also be set properly.

    The second big problem relates to GPFS systems that have been upgraded. If you had an existing GPFS configuration that set a specific parameter, that parameter (in general) is passed through ... even though the parameter value may be LOWER than the GPFS 3.5 new defaults.  Unchanged parameters "inherit" the new GPFS defaults ... but over-ridden parameters continue ... and their values may not be appropriate for GPFS 3.5.

    So re-validate all the changed parameters.  Do a "mmfsadm dump config > config.out"  ... and the parameters that have a "!" have been modified.  Are they still appropriate.

    With 400 disks, you probably need 400 * 1.7 = 680 nsdWorkerThreads

    worker3Threads controls the amount of metadata prefetching parallelism

    prefetchThreads and worker1Threads would have similar values

    The old nsdThreadsPerDisk would be about 16 if you wanted to use a 10-disk RAID group in individual disk mode.

    maxFileCleaners defaults to 8, which is the number of threads flushing data or metadata.

    maxBackgroundDeletionThreads defaults to 4, which is probably too low for your case.

    I will defer to the IBM document for descriptions on the new queuing system..

    http://www-05.ibm.com/de/events/gpfs-workshop/pdf/pr-11-GPFS_R35_nsdMultipleQ_and_other_enhancmentsv4-OW.pdf

    What does the  "mmfsadm test nsd qcalc" tool show?

    The following performance presentation also has some good descriptions of some of the critical GPFS parameters that need to be scaled for larger configurations:

    http://www.gpfsug.org/wp-content/uploads/2014/05/UG10_GPFS_Performance_Session_v10.pdf

    Lastly ... we don't know how many IO operations are needed for the typical delete.  You can use GPFS io history to get a better understanding.

    mmchconfig ioHistorySize=64k ... increases the iohistory

    "mmdiag dump iohist"     or   "mmfsadm dump iohist"    will dump the iohistory in slightly different forms.

    From analyzing the iohistory, you should be able to determine the "average" sequence of IO events that are needed for a typical delete.  It will be very obvious how many metadata reads and writes are performed per set of log file writes.  ... and yes ... you do need to do "reads" to perform a delete ... to read the directory entries to find the inode to delete ... and "deleting" a file involves deleting the directory entry, and then reading and decrementing the inode use count.  If the use count is zero, then the inode and data block are deallocated.  I'm not sure when/if GPFS zeros the old information.  And all these changes are journalled.

    Once you know the average number of IOs to perform the delete (both log and non-log IOs), you can get a better understanding if you have enough parallel deletes in progress to keep the disks busy.

    Under GPFS 3.4.x, I would be very suspicious of nsdThreadsPerDisk, maxFileCleaners, maxBackgroundDeletionThreads, along with the typical nsdMaxWorkerThreads, prefetchThreads, worker1Threads, and worker3Threads.

    Have you ever run a random small IO stress test?  That is one way to validate that the basic IO scheduling is "opened up" enough for your 400 disk configuration.

    If you have a simple single-threaded program or script doing "direct" IO, with a random IO pattern, you should scale at about 75 read per invocation up 400 invocations.  You will see a falloff, and may need 50% to 70% more than 400 invocations to keep 400 disks busy ... reaching around 30,000 IOPs.  If you start falling off much lower, you are hitting some other limit.

    Similarly ... you can run a write test and see if you can achieve (23 to 132) parallel writes * 75/sec = 1725 to 9900 writes/sec.

    Good Luck.

    Please let us know what you discover.

    Dave B

     

     

     

    Updated on 2014-06-04T22:25:40Z at 2014-06-04T22:25:40Z by db808
  • marc_of_GPFS
    marc_of_GPFS
    33 Posts

    Re: delete policy crawling through 200m small files

    ‏2014-06-05T14:07:23Z  

    Also consider "system.log" (see 4.1 Adv Admin Guide) -- I imagine you could dedicate a few MB of NVRAM to the log - and eliminate any journalling to spinning storage.

    Also where is --striped-logs documented?  Nada -- Aside from a few posts on this board (notably and authoritatively by GPFS grandmaster Yuri) https://www.ibm.com/developerworks/community/forums/html/topic?id=341d12d9-5786-4611-9872-426b6a84514f

    And sure enough...

     `mmlsfs d4ki --striped-logs -Y`   yields: "mmlsfs::0:1:::d4ki:stripedLogs:Yes:"
     

    Regarding system.log - if you can upgrade to a version of gpfs that supports 'system.log' pool - this seems the best way to manage log/journalling performance.  The system.log "device" can be optimized for fast small writes and need be no bigger than the logfile size (mmchfs -L)

    I am not familiar with the current generation of disk controllers - but I would expect there to be some way to dedicate some of the NVRAM to a particular LUN or set of LUNs.   Assuming that to be the case - if number bytes of NVRAM dedicated to a LUN = number of bytes in the LUN -- then you would effectively have defined a pure NVRAM LUN -- doesn't matter if/when the controller gets around to flushing the NVRAM - the NVRAM would immediately "absorb" every write to the logfile.

     

  • dlmcnabb
    dlmcnabb
    1012 Posts

    Re: delete policy crawling through 200m small files

    ‏2014-06-05T14:55:11Z  

    Regarding system.log - if you can upgrade to a version of gpfs that supports 'system.log' pool - this seems the best way to manage log/journalling performance.  The system.log "device" can be optimized for fast small writes and need be no bigger than the logfile size (mmchfs -L)

    I am not familiar with the current generation of disk controllers - but I would expect there to be some way to dedicate some of the NVRAM to a particular LUN or set of LUNs.   Assuming that to be the case - if number bytes of NVRAM dedicated to a LUN = number of bytes in the LUN -- then you would effectively have defined a pure NVRAM LUN -- doesn't matter if/when the controller gets around to flushing the NVRAM - the NVRAM would immediately "absorb" every write to the logfile.

     

    The system.log pool must be large enough to hold ALL the logs for all nodes, so N * logfileSize plus some extras. If there are many nodes, then they will not have direct access to the system.log device directly, so will be accessing the log via NSD requests. The actual device IO latency will be good, but the you may not get as much performance as expected because of network latency.

  • GPFSuser
    GPFSuser
    5 Posts

    Re: delete policy crawling through 200m small files

    ‏2014-06-05T15:03:43Z  
    • db808
    • ‏2014-06-04T22:04:56Z

    Few other hints ... that I have NOT explored ... but seem to be related to a serialization of deletes.

    The delete rate that you mentioned above (about 13.2M files per 24 hours) is about 153 deletes per second.  Witt 7200 RPM SATA disks, you could expect ~ 75 small IOs per second.

    You indicated that your configuration had 40 x (8+2) RAID6 LUNs.  Are they are disjoint disks?  If so, you have a total of 400 disks.  Ideally, you could do up to 10small reads in parallel per raid group, and a small R6 write would take 3-to-6 IO times, resulting in 1.6 to 3.3 parallel writes per raid group.  Scaling this up to 40 raid groups, you have the potential for 400 parallel reads or (23 to 132) parallel writes. (journal hot spot notwithstanding)

    So you see there seems to be something limiting throughput, unless each delete operation needs 10+ IO operations.

    Thinking-out-load ... there appears to be an artificial limit ... unsubstantiated at this point.

    With your 400 disk storage complex, you could need as many as 400 read threads or 133 write threads to keep the disks busy ... and these are higher than GPFS defaults.  And GPFS 3.5 completely changed the IO scheduling system, and within GPFS 3.5, the nsdThreadmethod changed from 0 to 1, changing the algorithms further.

    I am also concerned that the "old" default of nsdThreadPerDisk was 3 ... which is appropriate for a singe disk, not a 10-disk RAID group.  I understand that the new GPFS 3.5 IO queuing system is "better", but it is different, and it is not well documented if the old legacy parameters (like nsdThreadsPerDisk) are completely ignored, or must also be set properly.

    The second big problem relates to GPFS systems that have been upgraded. If you had an existing GPFS configuration that set a specific parameter, that parameter (in general) is passed through ... even though the parameter value may be LOWER than the GPFS 3.5 new defaults.  Unchanged parameters "inherit" the new GPFS defaults ... but over-ridden parameters continue ... and their values may not be appropriate for GPFS 3.5.

    So re-validate all the changed parameters.  Do a "mmfsadm dump config > config.out"  ... and the parameters that have a "!" have been modified.  Are they still appropriate.

    With 400 disks, you probably need 400 * 1.7 = 680 nsdWorkerThreads

    worker3Threads controls the amount of metadata prefetching parallelism

    prefetchThreads and worker1Threads would have similar values

    The old nsdThreadsPerDisk would be about 16 if you wanted to use a 10-disk RAID group in individual disk mode.

    maxFileCleaners defaults to 8, which is the number of threads flushing data or metadata.

    maxBackgroundDeletionThreads defaults to 4, which is probably too low for your case.

    I will defer to the IBM document for descriptions on the new queuing system..

    http://www-05.ibm.com/de/events/gpfs-workshop/pdf/pr-11-GPFS_R35_nsdMultipleQ_and_other_enhancmentsv4-OW.pdf

    What does the  "mmfsadm test nsd qcalc" tool show?

    The following performance presentation also has some good descriptions of some of the critical GPFS parameters that need to be scaled for larger configurations:

    http://www.gpfsug.org/wp-content/uploads/2014/05/UG10_GPFS_Performance_Session_v10.pdf

    Lastly ... we don't know how many IO operations are needed for the typical delete.  You can use GPFS io history to get a better understanding.

    mmchconfig ioHistorySize=64k ... increases the iohistory

    "mmdiag dump iohist"     or   "mmfsadm dump iohist"    will dump the iohistory in slightly different forms.

    From analyzing the iohistory, you should be able to determine the "average" sequence of IO events that are needed for a typical delete.  It will be very obvious how many metadata reads and writes are performed per set of log file writes.  ... and yes ... you do need to do "reads" to perform a delete ... to read the directory entries to find the inode to delete ... and "deleting" a file involves deleting the directory entry, and then reading and decrementing the inode use count.  If the use count is zero, then the inode and data block are deallocated.  I'm not sure when/if GPFS zeros the old information.  And all these changes are journalled.

    Once you know the average number of IOs to perform the delete (both log and non-log IOs), you can get a better understanding if you have enough parallel deletes in progress to keep the disks busy.

    Under GPFS 3.4.x, I would be very suspicious of nsdThreadsPerDisk, maxFileCleaners, maxBackgroundDeletionThreads, along with the typical nsdMaxWorkerThreads, prefetchThreads, worker1Threads, and worker3Threads.

    Have you ever run a random small IO stress test?  That is one way to validate that the basic IO scheduling is "opened up" enough for your 400 disk configuration.

    If you have a simple single-threaded program or script doing "direct" IO, with a random IO pattern, you should scale at about 75 read per invocation up 400 invocations.  You will see a falloff, and may need 50% to 70% more than 400 invocations to keep 400 disks busy ... reaching around 30,000 IOPs.  If you start falling off much lower, you are hitting some other limit.

    Similarly ... you can run a write test and see if you can achieve (23 to 132) parallel writes * 75/sec = 1725 to 9900 writes/sec.

    Good Luck.

    Please let us know what you discover.

    Dave B

     

     

     

    [SNIP]

    I am also concerned that the "old" default of nsdThreadPerDisk was 3 ... which is appropriate for a singe disk, not a 10-disk RAID group.  I understand that the new GPFS 3.5 IO queuing system is "better", but it is different, and it is not well documented if the old legacy parameters (like nsdThreadsPerDisk) are completely ignored, or must also be set properly.

    [SNIP]

    With 400 disks, you probably need 400 * 1.7 = 680 nsdWorkerThreads

    Interesting, and potentially very helpful.

     

    However, while you refer to "GPFS 3.5", there seem to be substantial differences in those parameters within GPFS 3.5. For example, GPFS 3.5.0.9 under Linux has an upper limit of 128 nsdWorkerThreads. I have no idea what minor release changed that limit (if any).

     

    I really wish that IBM would improve the version-to-version changelogs for end users. The detailed listing of bug fixes is useful to individuals that have experienced a particular bug and want to know if it's fixed in a newer version, but they are not very helpful in determining changes to commands or parameters.

     

    A table that lists command and parameter changes, per-version, would be extremely useful.

  • marc_of_GPFS
    marc_of_GPFS
    33 Posts

    Re: delete policy crawling through 200m small files

    ‏2014-06-05T17:33:49Z  
    • dlmcnabb
    • ‏2014-06-05T14:55:11Z

    The system.log pool must be large enough to hold ALL the logs for all nodes, so N * logfileSize plus some extras. If there are many nodes, then they will not have direct access to the system.log device directly, so will be accessing the log via NSD requests. The actual device IO latency will be good, but the you may not get as much performance as expected because of network latency.

    Good points Dan.  If I understand correctly there is a log for each node.  Now suppose I have a local PCI/NVRAM card in each node (I believe you can find such products.)  Is there a way to "encourage" each node to use the local NVRAM?

  • marc_of_GPFS
    marc_of_GPFS
    33 Posts

    Re: delete policy crawling through 200m small files

    ‏2014-06-05T17:42:12Z  
    • db808
    • ‏2014-06-04T22:04:56Z

    Few other hints ... that I have NOT explored ... but seem to be related to a serialization of deletes.

    The delete rate that you mentioned above (about 13.2M files per 24 hours) is about 153 deletes per second.  Witt 7200 RPM SATA disks, you could expect ~ 75 small IOs per second.

    You indicated that your configuration had 40 x (8+2) RAID6 LUNs.  Are they are disjoint disks?  If so, you have a total of 400 disks.  Ideally, you could do up to 10small reads in parallel per raid group, and a small R6 write would take 3-to-6 IO times, resulting in 1.6 to 3.3 parallel writes per raid group.  Scaling this up to 40 raid groups, you have the potential for 400 parallel reads or (23 to 132) parallel writes. (journal hot spot notwithstanding)

    So you see there seems to be something limiting throughput, unless each delete operation needs 10+ IO operations.

    Thinking-out-load ... there appears to be an artificial limit ... unsubstantiated at this point.

    With your 400 disk storage complex, you could need as many as 400 read threads or 133 write threads to keep the disks busy ... and these are higher than GPFS defaults.  And GPFS 3.5 completely changed the IO scheduling system, and within GPFS 3.5, the nsdThreadmethod changed from 0 to 1, changing the algorithms further.

    I am also concerned that the "old" default of nsdThreadPerDisk was 3 ... which is appropriate for a singe disk, not a 10-disk RAID group.  I understand that the new GPFS 3.5 IO queuing system is "better", but it is different, and it is not well documented if the old legacy parameters (like nsdThreadsPerDisk) are completely ignored, or must also be set properly.

    The second big problem relates to GPFS systems that have been upgraded. If you had an existing GPFS configuration that set a specific parameter, that parameter (in general) is passed through ... even though the parameter value may be LOWER than the GPFS 3.5 new defaults.  Unchanged parameters "inherit" the new GPFS defaults ... but over-ridden parameters continue ... and their values may not be appropriate for GPFS 3.5.

    So re-validate all the changed parameters.  Do a "mmfsadm dump config > config.out"  ... and the parameters that have a "!" have been modified.  Are they still appropriate.

    With 400 disks, you probably need 400 * 1.7 = 680 nsdWorkerThreads

    worker3Threads controls the amount of metadata prefetching parallelism

    prefetchThreads and worker1Threads would have similar values

    The old nsdThreadsPerDisk would be about 16 if you wanted to use a 10-disk RAID group in individual disk mode.

    maxFileCleaners defaults to 8, which is the number of threads flushing data or metadata.

    maxBackgroundDeletionThreads defaults to 4, which is probably too low for your case.

    I will defer to the IBM document for descriptions on the new queuing system..

    http://www-05.ibm.com/de/events/gpfs-workshop/pdf/pr-11-GPFS_R35_nsdMultipleQ_and_other_enhancmentsv4-OW.pdf

    What does the  "mmfsadm test nsd qcalc" tool show?

    The following performance presentation also has some good descriptions of some of the critical GPFS parameters that need to be scaled for larger configurations:

    http://www.gpfsug.org/wp-content/uploads/2014/05/UG10_GPFS_Performance_Session_v10.pdf

    Lastly ... we don't know how many IO operations are needed for the typical delete.  You can use GPFS io history to get a better understanding.

    mmchconfig ioHistorySize=64k ... increases the iohistory

    "mmdiag dump iohist"     or   "mmfsadm dump iohist"    will dump the iohistory in slightly different forms.

    From analyzing the iohistory, you should be able to determine the "average" sequence of IO events that are needed for a typical delete.  It will be very obvious how many metadata reads and writes are performed per set of log file writes.  ... and yes ... you do need to do "reads" to perform a delete ... to read the directory entries to find the inode to delete ... and "deleting" a file involves deleting the directory entry, and then reading and decrementing the inode use count.  If the use count is zero, then the inode and data block are deallocated.  I'm not sure when/if GPFS zeros the old information.  And all these changes are journalled.

    Once you know the average number of IOs to perform the delete (both log and non-log IOs), you can get a better understanding if you have enough parallel deletes in progress to keep the disks busy.

    Under GPFS 3.4.x, I would be very suspicious of nsdThreadsPerDisk, maxFileCleaners, maxBackgroundDeletionThreads, along with the typical nsdMaxWorkerThreads, prefetchThreads, worker1Threads, and worker3Threads.

    Have you ever run a random small IO stress test?  That is one way to validate that the basic IO scheduling is "opened up" enough for your 400 disk configuration.

    If you have a simple single-threaded program or script doing "direct" IO, with a random IO pattern, you should scale at about 75 read per invocation up 400 invocations.  You will see a falloff, and may need 50% to 70% more than 400 invocations to keep 400 disks busy ... reaching around 30,000 IOPs.  If you start falling off much lower, you are hitting some other limit.

    Similarly ... you can run a write test and see if you can achieve (23 to 132) parallel writes * 75/sec = 1725 to 9900 writes/sec.

    Good Luck.

    Please let us know what you discover.

    Dave B

     

     

     

    Yes there are several (many?) parameters one can tweak - in some cases increasing some of the xxxThreads parameters can help.  

    For a complete list try `mmfsadm saferdump config` - BUT warning!  Do not turn any drastically up or down on a production system without testing elsewhere.  Also search this board, and the google (or similar search) for tips on GPFS "tuning".

  • dlmcnabb
    dlmcnabb
    1012 Posts

    Re: delete policy crawling through 200m small files

    ‏2014-06-05T17:47:16Z  

    Good points Dan.  If I understand correctly there is a log for each node.  Now suppose I have a local PCI/NVRAM card in each node (I believe you can find such products.)  Is there a way to "encourage" each node to use the local NVRAM?

    Nope.  The local NVRAM might be usable by LROC, but it cannot be used for written permanent filesystem metadata as if it were a disk.

  • marc_of_GPFS
    marc_of_GPFS
    33 Posts

    Re: delete policy crawling through 200m small files

    ‏2014-06-05T18:25:47Z  
    • dlmcnabb
    • ‏2014-06-05T17:47:16Z

    Nope.  The local NVRAM might be usable by LROC, but it cannot be used for written permanent filesystem metadata as if it were a disk.

    Assume there is a device driver to make the NVRAM appear as a block/disk device.  We tell GPFS the device name... Then how would GPFS know the difference?

    My question was more about affinity of particular log devices to particular nodes.  

    To rephrase the question.  If I have a particular NSD(LUN) that I want a particular node to use for its logfile, is there a way to encourage that binding?  Kinda like FPO, but regarding logfiles...

  • marc_of_GPFS
    marc_of_GPFS
    33 Posts

    Re: delete policy crawling through 200m small files

    ‏2014-06-05T19:28:28Z  

    Assume there is a device driver to make the NVRAM appear as a block/disk device.  We tell GPFS the device name... Then how would GPFS know the difference?

    My question was more about affinity of particular log devices to particular nodes.  

    To rephrase the question.  If I have a particular NSD(LUN) that I want a particular node to use for its logfile, is there a way to encourage that binding?  Kinda like FPO, but regarding logfiles...

    Where I wrote NVRAM, you may substitute the newer acronyms (which I just learned) BBWC and FBWC - different technologies but same purpose - an effectively fast, non-volatile storage device.

  • dlmcnabb
    dlmcnabb
    1012 Posts

    Re: delete policy crawling through 200m small files

    ‏2014-06-05T19:57:21Z  

    Assume there is a device driver to make the NVRAM appear as a block/disk device.  We tell GPFS the device name... Then how would GPFS know the difference?

    My question was more about affinity of particular log devices to particular nodes.  

    To rephrase the question.  If I have a particular NSD(LUN) that I want a particular node to use for its logfile, is there a way to encourage that binding?  Kinda like FPO, but regarding logfiles...

    No, a log file must be something the FS manager can read and replay when a node dies. If the log is only available on the node, then it is by definition unreadable when the node dies.