Topic
  • 5 replies
  • Latest Post - ‏2014-06-03T11:50:21Z by marc_of_GPFS
oester
oester
112 Posts

Pinned topic Placement policy - Multiple Tiers

‏2014-04-29T17:50:29Z |

I want to craft a file placement policy so that new files are placed on SSD until that pool fills up, and then go to disk. Is it as simple as setting up a policy like this? I tried using the "Group Pool" clause but the file placement policy rejected that.

RULE 'DEFAULT'
SET POOL ssd
RULE 'DEFAULT'
SET POOL Disk

Corrections or comments welcome.

 

Bob

  • chr78
    chr78
    132 Posts

    Re: Placement policy - Multiple Tiers

    ‏2014-04-30T11:31:17Z  

    I'm not 100% sure, but I assume you need different names for each rules If you

    have dataOnly or dataAndMetatda disks in your system pool, I'd suggest to

    add a

    RULE 'DEFAULT' SET POOL system

    at the end.

    But, back to your question, yes it should be as simple as your example suggests.

    cheers.

  • db808
    db808
    86 Posts

    Re: Placement policy - Multiple Tiers

    ‏2014-05-05T20:23:40Z  

    It sounds like you will need at least THREE pools.

    First, I would recommend a separate pool just for the metadata.  This MUST be the "system" pool.  By having a separate metadata pool, you can independently set the metadata block size ...which you usually want fairly small, like 256kb (which results in a 256kb/32 = 8gb directory block size).  By gaving a separate metadata pool (and the corresponding metadata LUNs and NSDs) you will also be able to accurately monitor the size (via mmdf) and performance (by filtering by the metadata LUNs).

    The metadata pool will be a portion of the SSD space that you have.  You need to estimate the metadata size.  Subtract this size from the total SSD size to get the size for the "SSD_DATA"

    The third pool will be the traditional disk-based pool. 

    By default, GPFS executes the policy rules in order.  If you attempt to assign a file to a pool that is full, then the rule fails and you proceed on to the next rule.  You use the "DEFAULT" rule to catch all remaining files.

    So ...

     

    Step1, create NSDs specified as "metadata_only".  They will be assigned to the "SYSTEM" pool. You can not change the name.

    Step 2, create NSDs specified as "data_only", assigned to the "SSD_DATA" pool

    Step 3, create NSDs specified as "data_only", assigned to the "DISK_DATA" pool.

     

    Then your policy file would look like:

    # Metadata automatically assigned to "SYSTEM" pool due to metadata_only specification when creating the metadata NSDs

    # Assign all files that can fit to the SSD_DATA pool up to 90% full

    RULE 'ssd_first' SET POOL 'SSD_DATA' LIMIT(90)

    RULE 'default' SET POOL 'DISK_DATA'

     

    You probably want to use the "LIMIT" clause to leave some headroom for existing files to grow after they have been initially placed.

    I would higly recommend that you run "filehist" from the /usr/lpp/mmfs/samples/debugtools folder to see if there is significant file skew.  You can also easily modify the underlying awk script to report how many files might fit in large inodes.

    In our case, filehist showed significant file size skew.  68% of our files were less than 4kb, and would fit in 4kb inodes.  So it was a big win for our skewed distribution to use 4kb Inodes, which saved significant space (since a data block fragment did not need to be allocated).

    We also went the extra step ... filehist showed that just a small number of files accounted for a bulk of the data.  Could these files be easilily identified by file name extension and pro-actively placed in the DISK_DATA pool?  Using the IBM awk scripts as examples (in samples/debugtools and samples/util folders), we created custom scripts to analyze the file size usage by file extension.  We found out that a few file name extensions (identifying "large" files, accounted for most of the buik.

    We then added a rule before the "SET POOL" rule to pro-actively force the large files to the disk pool ... leaving more space in the SSD pool for "small" files.  The rule was like:

    RULE 'large_files' SET POOL 'DISK_DATA' WHERE LOWER(NAME) LIKE '%.mxf' OR '%.mp4' OR '%.mov'

    Use the file name extenstions that your analysis reported were significant for your case.

    You can also later create file migration RULEs that could migrate less-active files from the SSD pool to the disk pool, freeing up space in the SSD pool for new files.

    Hope this helps.

    Dave B

     

    Updated on 2014-05-05T20:28:06Z at 2014-05-05T20:28:06Z by db808
  • marc_of_GPFS
    marc_of_GPFS
    33 Posts

    Re: Placement policy - Multiple Tiers

    ‏2014-06-02T16:37:31Z  
    • db808
    • ‏2014-05-05T20:23:40Z

    It sounds like you will need at least THREE pools.

    First, I would recommend a separate pool just for the metadata.  This MUST be the "system" pool.  By having a separate metadata pool, you can independently set the metadata block size ...which you usually want fairly small, like 256kb (which results in a 256kb/32 = 8gb directory block size).  By gaving a separate metadata pool (and the corresponding metadata LUNs and NSDs) you will also be able to accurately monitor the size (via mmdf) and performance (by filtering by the metadata LUNs).

    The metadata pool will be a portion of the SSD space that you have.  You need to estimate the metadata size.  Subtract this size from the total SSD size to get the size for the "SSD_DATA"

    The third pool will be the traditional disk-based pool. 

    By default, GPFS executes the policy rules in order.  If you attempt to assign a file to a pool that is full, then the rule fails and you proceed on to the next rule.  You use the "DEFAULT" rule to catch all remaining files.

    So ...

     

    Step1, create NSDs specified as "metadata_only".  They will be assigned to the "SYSTEM" pool. You can not change the name.

    Step 2, create NSDs specified as "data_only", assigned to the "SSD_DATA" pool

    Step 3, create NSDs specified as "data_only", assigned to the "DISK_DATA" pool.

     

    Then your policy file would look like:

    # Metadata automatically assigned to "SYSTEM" pool due to metadata_only specification when creating the metadata NSDs

    # Assign all files that can fit to the SSD_DATA pool up to 90% full

    RULE 'ssd_first' SET POOL 'SSD_DATA' LIMIT(90)

    RULE 'default' SET POOL 'DISK_DATA'

     

    You probably want to use the "LIMIT" clause to leave some headroom for existing files to grow after they have been initially placed.

    I would higly recommend that you run "filehist" from the /usr/lpp/mmfs/samples/debugtools folder to see if there is significant file skew.  You can also easily modify the underlying awk script to report how many files might fit in large inodes.

    In our case, filehist showed significant file size skew.  68% of our files were less than 4kb, and would fit in 4kb inodes.  So it was a big win for our skewed distribution to use 4kb Inodes, which saved significant space (since a data block fragment did not need to be allocated).

    We also went the extra step ... filehist showed that just a small number of files accounted for a bulk of the data.  Could these files be easilily identified by file name extension and pro-actively placed in the DISK_DATA pool?  Using the IBM awk scripts as examples (in samples/debugtools and samples/util folders), we created custom scripts to analyze the file size usage by file extension.  We found out that a few file name extensions (identifying "large" files, accounted for most of the buik.

    We then added a rule before the "SET POOL" rule to pro-actively force the large files to the disk pool ... leaving more space in the SSD pool for "small" files.  The rule was like:

    RULE 'large_files' SET POOL 'DISK_DATA' WHERE LOWER(NAME) LIKE '%.mxf' OR '%.mp4' OR '%.mov'

    Use the file name extenstions that your analysis reported were significant for your case.

    You can also later create file migration RULEs that could migrate less-active files from the SSD pool to the disk pool, freeing up space in the SSD pool for new files.

    Hope this helps.

    Dave B

     

    Thanks for your answer Dave!

    I want to second and EMPHASIZE your remark that customers/admins with multiple data pools will almost certainly want to "craft" MIGRATE rules and arrange to run mmapplypolicy periodically and/or triggered by pool occupancy THRESHOLDs. With MIGRATE rules you can move files from one pool to another based on various file attributes, such as last access time, last modification time, FILE_HEAT, PATH_NAME, GROUP_ID, SIZE, and so on... RTFineM: GPFS Advanced Admin Guide, Chapter 2.

    And/or adopt your strategy of segregating files into different pools at creation time by NAME, GROUP_ID, FILESET, ... 

    IF one only deployed simple rules like:

    RULE 'ssd_first' SET POOL 'SSD_DATA' LIMIT(90)

    RULE 'default' SET POOL 'DISK_DATA'

     

    Then once SSD_DATA reached 90% occupancy, all new files would get tossed into DISK_DATA - which is probably not what you want.

     

     

  • db808
    db808
    86 Posts

    Re: Placement policy - Multiple Tiers

    ‏2014-06-02T23:27:01Z  

    Thanks for your answer Dave!

    I want to second and EMPHASIZE your remark that customers/admins with multiple data pools will almost certainly want to "craft" MIGRATE rules and arrange to run mmapplypolicy periodically and/or triggered by pool occupancy THRESHOLDs. With MIGRATE rules you can move files from one pool to another based on various file attributes, such as last access time, last modification time, FILE_HEAT, PATH_NAME, GROUP_ID, SIZE, and so on... RTFineM: GPFS Advanced Admin Guide, Chapter 2.

    And/or adopt your strategy of segregating files into different pools at creation time by NAME, GROUP_ID, FILESET, ... 

    IF one only deployed simple rules like:

    RULE 'ssd_first' SET POOL 'SSD_DATA' LIMIT(90)

    RULE 'default' SET POOL 'DISK_DATA'

     

    Then once SSD_DATA reached 90% occupancy, all new files would get tossed into DISK_DATA - which is probably not what you want.

     

     

    Hi Marc,

    Thank you for the added feedback about migration rules.  I specifically did not spend much time on that because it is relatively well covered in the manual.

    I would like to re-emphasize the suggestion of PLEASE CHECK FOR SKEW.

    GPFS's migration policies are very powerful, but can incur substantial overhead in a very active file system.  Some of our file systems show 100% churn in less than 3 days.  Remember, the policy must scan ALL the files, just to find a small percentage of the files that meet the rule conditions.  Perhaps it is ok to run the policy once a day, but could you run such a migration policy once every 10 minutes?  You probably don't need to ... but I am trying to illustrate a point.

    Migration policies are useful tools, but they do require "work".  The inodes and perhaps the directories are scanned, and then the resulting files are moved ... one read and one write for each candidate file.

    Policies based on inactivity or aging of the data are one class of policies for cost tiering.

    However ... what can be done to help ensure that the file is placed in the proper tier to begin with?

    PLEASE CHECK FOR SKEW ... you may be surprised .... we were.

    That "average" file size of 1 MB ... is not real.  It may be many "tiny" files combined with a small number of massive files.  The "average" may be very different than the median file size.  With a reporting tool as valuable as "filehist", it is worth the effort to at least check.

    "Small" files are more difficult to handle.  The overhead of reading the metadata to traverse the directory tree, and then open/close the file needs to be amortized over a small amount of "data".  If the file is hundreds of megabytes in size, the directory navigation and open/close overhead is trivial.  If the file is "small", you may actually spend more effort doing metadata IO than "data" IO.  In our shop, we have both extremes.  We have some GPFS file systems that average 2.3 metadata IOs for every data IO.  We have other file systems that have several hundred data IOs per metadata IO.

    The following questions are BIG ifs .... but if they are true, they can dramatically simplify GPFS policies, and result in much lower costs for higher-performance pools, such as SSDs.

    IF you are using strong, consistent naming conventions, you may be able to identify different classes of files by their naming convention ... such as file name extension, or a field within the directory path or file name.  Perhaps you put all your "large" files in a "download" directory, for example, or can deduce the typical file "class" by its name.  Perhaps the date of the file is repeated in the file's name.

    Why is this important?  We often want to create GPFS policies based on expected activity or file size ... but these attributes are NOT available at file creation time ... and thus can NOT be used for determining the initial file placement in the "SET POOL" policy.

    However, the file's name,its directory path, and some basic permissions ARE available at file create time, and can be used to trigger a GPFS policy.  For example, we know the bulk of the files with .jpg extension are tiny and would not take up much space if stored on SSD.  If accessing these files quickly was important, we could pro-actively force the .jpg files to the SSD pool .... no migration rule needed (until we ran out of space).

    Similarly, if you can identify ultra-large files by file name extension ... you can pro-actively direct them disk storage, where large block sizes and read-ahead threads yield very cost effective performance.

    If you want to take the care in exploiting file naming conventions, you can do amazing things with initial placement.  You may be able to recognize "high priority" files, and pro-actively put them on higher performance storage.  You may be able to recognize low priority files, and pro-actively keep them away from the higher-performance storage ...leaving more for the high priority.

    The "middle" is the grey area.  But what IF .... the number of files that were in the "middle" were small in number ... the impact of the mis-placement could be very small also.

    You will never know if you have a SKEWED file distribution ... if you never look.  GPFS's high performance scanning tools like filehist, tsreaddir, tsinode and others can make the analysis relatively easy.

    IF you have files primarily generated by automated workflows and jobs ... do you have control of the file naming conventions and directory structure?  Then you MAY be able to "classify" a file at file creation time, and use that classification to drive a GPFS policy.  It could be as simple as using a "temp" directory for transient files, and allocating those files on high-performance storage .. with a migration or deletion policy to police the files that were not properly deleted when the job completed.

     

    In several of our file systems, we actually use the rule:

    RULE 'ssd_first' SET POOL 'SSD_DATA' LIMIT(90)

     

    because we have already pro-actively classified the large files, and assigned them to the disk-based data pool.  The SSD pool is actually only around 45% full based on pro-active classification ... and pretty much stays that way with the existing workflows and naming conventions.  In our SKEWED distribution ... 98.4% of the file names ... account for ~ 15% of the storage.  1.6% of the file names (23 different extensions) account for over 85% of the storage, and are pro-actively assigned to disk-based pools.

    In this file system, we have over 250 different file name extensions (including "no extension) ... but in reality 3 small file extensions account for about 2/3 of the files, and 6 large file extensions account for over 80% of the storage.  If I pro-actively deal with these 9 file classes, the remainder is noise.

    Your mileage WILL vary ... but PLEASE CHECK FOR SKEW.

    If you don't have skew ... traditional GPFS policy management is still available.  However, with skewed distributions and strong naming conventions you can exploit GPFS placement policies.

    We ourselves were so surprised about the amount of "hidden" skew we had, that we are now going back and analyzing all of our existing file systems for skew.  If you have identifiable skew, it could significantly change how additional storage might be deployed when/if you get some budget dollars to add capacity or performance.

    Hope this helps.

    Dave B

  • marc_of_GPFS
    marc_of_GPFS
    33 Posts

    Re: Placement policy - Multiple Tiers

    ‏2014-06-03T11:50:21Z  
    • db808
    • ‏2014-06-02T23:27:01Z

    Hi Marc,

    Thank you for the added feedback about migration rules.  I specifically did not spend much time on that because it is relatively well covered in the manual.

    I would like to re-emphasize the suggestion of PLEASE CHECK FOR SKEW.

    GPFS's migration policies are very powerful, but can incur substantial overhead in a very active file system.  Some of our file systems show 100% churn in less than 3 days.  Remember, the policy must scan ALL the files, just to find a small percentage of the files that meet the rule conditions.  Perhaps it is ok to run the policy once a day, but could you run such a migration policy once every 10 minutes?  You probably don't need to ... but I am trying to illustrate a point.

    Migration policies are useful tools, but they do require "work".  The inodes and perhaps the directories are scanned, and then the resulting files are moved ... one read and one write for each candidate file.

    Policies based on inactivity or aging of the data are one class of policies for cost tiering.

    However ... what can be done to help ensure that the file is placed in the proper tier to begin with?

    PLEASE CHECK FOR SKEW ... you may be surprised .... we were.

    That "average" file size of 1 MB ... is not real.  It may be many "tiny" files combined with a small number of massive files.  The "average" may be very different than the median file size.  With a reporting tool as valuable as "filehist", it is worth the effort to at least check.

    "Small" files are more difficult to handle.  The overhead of reading the metadata to traverse the directory tree, and then open/close the file needs to be amortized over a small amount of "data".  If the file is hundreds of megabytes in size, the directory navigation and open/close overhead is trivial.  If the file is "small", you may actually spend more effort doing metadata IO than "data" IO.  In our shop, we have both extremes.  We have some GPFS file systems that average 2.3 metadata IOs for every data IO.  We have other file systems that have several hundred data IOs per metadata IO.

    The following questions are BIG ifs .... but if they are true, they can dramatically simplify GPFS policies, and result in much lower costs for higher-performance pools, such as SSDs.

    IF you are using strong, consistent naming conventions, you may be able to identify different classes of files by their naming convention ... such as file name extension, or a field within the directory path or file name.  Perhaps you put all your "large" files in a "download" directory, for example, or can deduce the typical file "class" by its name.  Perhaps the date of the file is repeated in the file's name.

    Why is this important?  We often want to create GPFS policies based on expected activity or file size ... but these attributes are NOT available at file creation time ... and thus can NOT be used for determining the initial file placement in the "SET POOL" policy.

    However, the file's name,its directory path, and some basic permissions ARE available at file create time, and can be used to trigger a GPFS policy.  For example, we know the bulk of the files with .jpg extension are tiny and would not take up much space if stored on SSD.  If accessing these files quickly was important, we could pro-actively force the .jpg files to the SSD pool .... no migration rule needed (until we ran out of space).

    Similarly, if you can identify ultra-large files by file name extension ... you can pro-actively direct them disk storage, where large block sizes and read-ahead threads yield very cost effective performance.

    If you want to take the care in exploiting file naming conventions, you can do amazing things with initial placement.  You may be able to recognize "high priority" files, and pro-actively put them on higher performance storage.  You may be able to recognize low priority files, and pro-actively keep them away from the higher-performance storage ...leaving more for the high priority.

    The "middle" is the grey area.  But what IF .... the number of files that were in the "middle" were small in number ... the impact of the mis-placement could be very small also.

    You will never know if you have a SKEWED file distribution ... if you never look.  GPFS's high performance scanning tools like filehist, tsreaddir, tsinode and others can make the analysis relatively easy.

    IF you have files primarily generated by automated workflows and jobs ... do you have control of the file naming conventions and directory structure?  Then you MAY be able to "classify" a file at file creation time, and use that classification to drive a GPFS policy.  It could be as simple as using a "temp" directory for transient files, and allocating those files on high-performance storage .. with a migration or deletion policy to police the files that were not properly deleted when the job completed.

     

    In several of our file systems, we actually use the rule:

    RULE 'ssd_first' SET POOL 'SSD_DATA' LIMIT(90)

     

    because we have already pro-actively classified the large files, and assigned them to the disk-based data pool.  The SSD pool is actually only around 45% full based on pro-active classification ... and pretty much stays that way with the existing workflows and naming conventions.  In our SKEWED distribution ... 98.4% of the file names ... account for ~ 15% of the storage.  1.6% of the file names (23 different extensions) account for over 85% of the storage, and are pro-actively assigned to disk-based pools.

    In this file system, we have over 250 different file name extensions (including "no extension) ... but in reality 3 small file extensions account for about 2/3 of the files, and 6 large file extensions account for over 80% of the storage.  If I pro-actively deal with these 9 file classes, the remainder is noise.

    Your mileage WILL vary ... but PLEASE CHECK FOR SKEW.

    If you don't have skew ... traditional GPFS policy management is still available.  However, with skewed distributions and strong naming conventions you can exploit GPFS placement policies.

    We ourselves were so surprised about the amount of "hidden" skew we had, that we are now going back and analyzing all of our existing file systems for skew.  If you have identifiable skew, it could significantly change how additional storage might be deployed when/if you get some budget dollars to add capacity or performance.

    Hope this helps.

    Dave B

    Hi Dave!  I entirely agree with you.  But I happen to know (via a non-public channel) that your application of GPFS is for a large "digital media" system.  So the "skew" as you call it, may be more extreme than in other GPFS deployments - which can range from office documents to databases, "big science and/or engineering" data, medical image archiving, and yes, entertainment and sports media.

    Still your use of policy rules will be a great example for many customers and admins -- would you be so kind as to post a sample/example of rules here?  Preferably you have applied my policy rule "optimization" recommendations, which I am happy to post here:

    (A)
     
    RULE 'a1' SET POOL 'A' WHERE (condition_1)
    RULE 'a2' SET POOL 'A' WHERE (condition_2)
     
    can be condensed into one rule
     
    RULE 'a_combo' SET POOL 'A' WHERE (condition_1) OR (condition_2)
     
    and that is very likely somewhat "faster" because OR is evaluated in a "shortcut" style, meaning if (condition_1) is true, we skip the evaluation of condition_1
     
    (B)
     
    Instead of many LIKE predicates, it is better to use a single invocation of the RegEx (regular expression) function.
    In fact, when the second argument is a constant expression, we optimize by "pre-compiling" the regular expression.  (For details see Posix regcomp(3))
     
    rule 'several-file-types-with-regex' set pool 'B'  where  RegEx(lower(name),['\.avi$|\.mp[34]$|\.[mj]pe?g$'])
     
     is preferable to to the equivalent 7(!) like predicates
     
    rule 'several-file-types-with-or-like' set pool 'B' where lower(name) like '%.avi' OR lower(name) like '%.mp3' OR lower(name) like '%.mp4'
              OR lower(name) like '%.mpeg' OR lower(name) like '%.jpeg' OR lower(name) like '%.mpg' OR lower(name) like '%.jpg'
     
     which by guideline (A) is still preferable to the equivalent 7 rules that one would obtain with just one LIKE predicate per rule.

     

    Updated on 2014-06-03T12:10:29Z at 2014-06-03T12:10:29Z by marc_of_GPFS