Topic
  • 3 replies
  • Latest Post - ‏2012-09-11T19:24:07Z by SystemAdmin
HajoEhlers
HajoEhlers
253 Posts

Pinned topic Howto concatenate the result of a LIST policy using more than 1 node

‏2012-09-11T09:11:26Z |
The rule looks like:

RULE EXTERNAL LIST 
'listfiles' EXEC 
' /gpfs/bin/policycat   RULE 
'xl' LIST 
'listfiles' DIRECTORIES_PLUS SHOW    ( CASE when MISC_ATTRIBUTES LIKE 
'%D%'  then 
'^_"'||PATH_NAME||
'/"^_'  

else 
'^_"'|| SUBSTR(PATH_NAME,1,LENGTH(PATH_NAME)-LENGTH(NAME)-1) ||
'/"^_'  END ) WHERE   ( DAYS(CURRENT_TIMESTAMP) - DAYS(MODIFICATION_TIME) < 9999 ) 
/* EO POLICY */


The external script /gpfs/bin/policycat looks like


#!/usr/bin/ksh 

case $1 in # We are invoked from the policy LIST) cat 
"$2" rc=0 ;; TEST)    # Respond with success rc=0 ;; *)       # Command not supported by 

this script rc=1 ;; esac exit $rc

So far so. But how do i get the final result ?
A "mmapplypolicy /gpfs/tmp -P ListPolicy" -I yes gives a mixure of info on stdout AND stderr

A "mmapplypolicy /gpfs/tmp -P ListPolicy -I yes -L 0 gives the output from the external script on STDERR!

My understand is that -L 0 is the default anyway for mmapplypolicy - Why the different behavior ?
How can i be sure that i get with the -L 0 only the filelist and no other error messages from mmapplypolicy ?
A "mmapplypolicy /gpfs/tmp -P ListPolicy" -I defer" gives a file list but the external script has not been used thus further processing must be done.

In case that the external script will write the result to an external file - like in the following example:


... LIST) cat 
"$2" >> /tmp/policy.result rc=0 ...


i will have a /tmp/policy.result on EACH node taking part of the policy run.
Even in case the result will be written to an shared fs i have no guaranty that the writing of each node does not interfere with another.

So the later can be used if only a single node is running the policy.

So my current option for parallel execution are :
* use the "-L 0" with the risk to get additional stuff on STDERR like:

mmapplypolicy /gpfs/tmp -P ListPolicy -I yes -L 0

* use the "-I defer" with the draw back that the result ( Which might be very large) must be processed further since the "external script is not used at all.

mmapplypolicy /gpfs/tmp -P ListPolicy -I defer MyScript /tmp/....mapplypolicy.list

So the question arise:
How does the command looks like to invoke a list policy on more than on node, let each node using also the external script but get the final result only on a single node ?

tia
Hajo
Updated on 2012-09-11T19:24:07Z at 2012-09-11T19:24:07Z by SystemAdmin
  • SystemAdmin
    SystemAdmin
    2092 Posts

    Re: Howto concatenate the result of a LIST policy using more than 1 node

    ‏2012-09-11T16:13:56Z  

    How to aggregate results from your scripts running in parallel under mmapplypolicy


    • IF your policy EXEC 'my-script' is doing trivial processing of the filelist file (the $2 argument), then you really are best off using -I defer. No complications. You will avoid all the overhead of creating, and distributing the (sub)list files, and starting all of those processes.
      • If you've already written your my-script to process a policy file LIST file, you can invoke it AFTER the mmapplypolicy -I defer command has completed. e.g.
        
        policy-file: rule 
        'L1' external list 
        'XL' exec 
        '/home/me/my-script.sh' rule 
        'L2' list 
        'XL' SHOW(...whatever...) WHERE ...sql-logical-condition... command-line: mkdir /yadah/results mmapplypolicy /gpfs/tree-of-file -P policy-file -I defer -f /yadah/results -N all -g /gpfs/temp/scratch ... /home/me/my-script LIST /yadah/results/list.XL
        

    • ON THE OTHER HAND - If you have some non-trivial processing to do with your 'my-script', and thousands or millions of files to process, then you will likely benefit from the parallel execution facility built into mmapplypolicy. In the common case, where the only expected outputs from 'my-script' to STDOUT or STDERR are informational, diagnostic or error messages, mmapplypolicy makes your programming/admin life relatively easy. Write your my-script to handle read the pathnames and any other information about the file you require from simple file list format. And do its thing.

    • HOWEVER - If each invocation of my-script produces some results that you wish to aggregate in some way, then you need to do some more design work and programming to accomplish this. I can think of several ways this can be accomplished. Let me show you three here, ranging from simple to more and more powerful and complex:
      • Just "print" (printf, echo) the results you want to aggregate from your my-script, BUT tag each line so you can easily pick out the results from the other messages that might be intermixed in the output stream from mmapplypolicy. For example you may output lines like this:
        
        [U:1] some important output from my-script [U:2] some other important output from my-script [U:3] In 
        
        this example 
        "U" is a unique code identifying my-script output, and we can further qualify into categories U:1, U:2, U:3
        
        Collect ALL of the STDOUT and/or STDERR output from mmapplypolicy into one or two files using the shell '1>file.a', '2>&1', and/or '2>file.b' operators. Then post-process your output files with another script, perhaps using grep, awk, perl to find and process the lines marked with [U:codes]. You can depend on mmapplypolicy to use the [I], [W], [E] codes for its own informational, warning and error messages.
      • Establish an output-results-directory in a shared (gpfs) filesystem, and have each instance of my-script put its results into that directory. You may use the $2 argument to your my-script to form a filename that is guaranteed to be unique for each instantiation. For example:
        
        # $2 argument pathnames always look like /abc/tdir/mmPolicy.ix.3079.155D6D8B.4 # the first number is a process-id; the second a random number; the third a counter. # re-use the numbers to manufacture my own output filename: # … compute some-results; find or determine a sharedoutput directory ... echo $some-results  >$output-results-directory/result.$
        {2##*.ix.
        }
        
        After the mmapplypolicy command completes, all your results will be in your output-results-directory, you may run additional scripts or programs to "reduce" those files to the final results you require.
      • my-script may access any file or any database. Aggregate the results of each instance of my-script, by having my-script update the approriate files and/or databases. Use locking and/or communication protocols appropriate for parallel processing, keeping in mind that you may have multiple instances of my-script running with multiple processes per GPFS node (host machines.)

    • Keep in mind that you can control the number of files in the filelist passed to my-script with the -B option; the number of concurrent processes running my-script per node with the -m option; which nodes run my-script with the -N option.
  • HajoEhlers
    HajoEhlers
    253 Posts

    Re: Howto concatenate the result of a LIST policy using more than 1 node

    ‏2012-09-11T18:54:44Z  

    How to aggregate results from your scripts running in parallel under mmapplypolicy


    • IF your policy EXEC 'my-script' is doing trivial processing of the filelist file (the $2 argument), then you really are best off using -I defer. No complications. You will avoid all the overhead of creating, and distributing the (sub)list files, and starting all of those processes.
      • If you've already written your my-script to process a policy file LIST file, you can invoke it AFTER the mmapplypolicy -I defer command has completed. e.g. <pre class="jive-pre"> policy-file: rule 'L1' external list 'XL' exec '/home/me/my-script.sh' rule 'L2' list 'XL' SHOW(...whatever...) WHERE ...sql-logical-condition... command-line: mkdir /yadah/results mmapplypolicy /gpfs/tree-of-file -P policy-file -I defer -f /yadah/results -N all -g /gpfs/temp/scratch ... /home/me/my-script LIST /yadah/results/list.XL </pre>

    • ON THE OTHER HAND - If you have some non-trivial processing to do with your 'my-script', and thousands or millions of files to process, then you will likely benefit from the parallel execution facility built into mmapplypolicy. In the common case, where the only expected outputs from 'my-script' to STDOUT or STDERR are informational, diagnostic or error messages, mmapplypolicy makes your programming/admin life relatively easy. Write your my-script to handle read the pathnames and any other information about the file you require from simple file list format. And do its thing.

    • HOWEVER - If each invocation of my-script produces some results that you wish to aggregate in some way, then you need to do some more design work and programming to accomplish this. I can think of several ways this can be accomplished. Let me show you three here, ranging from simple to more and more powerful and complex:
      • Just "print" (printf, echo) the results you want to aggregate from your my-script, BUT tag each line so you can easily pick out the results from the other messages that might be intermixed in the output stream from mmapplypolicy. For example you may output lines like this: <pre class="jive-pre"> [U:1] some important output from my-script [U:2] some other important output from my-script [U:3] In this example "U" is a unique code identifying my-script output, and we can further qualify into categories U:1, U:2, U:3 </pre> Collect ALL of the STDOUT and/or STDERR output from mmapplypolicy into one or two files using the shell '1>file.a', '2>&1', and/or '2>file.b' operators. Then post-process your output files with another script, perhaps using grep, awk, perl to find and process the lines marked with [U:codes]. You can depend on mmapplypolicy to use the [I], [W], [E] codes for its own informational, warning and error messages.
      • Establish an output-results-directory in a shared (gpfs) filesystem, and have each instance of my-script put its results into that directory. You may use the $2 argument to your my-script to form a filename that is guaranteed to be unique for each instantiation. For example: <pre class="jive-pre"> # $2 argument pathnames always look like /abc/tdir/mmPolicy.ix.3079.155D6D8B.4 # the first number is a process-id; the second a random number; the third a counter. # re-use the numbers to manufacture my own output filename: # … compute some-results; find or determine a sharedoutput directory ... echo $some-results >$output-results-directory/result.$ {2##*.ix. } </pre> After the mmapplypolicy command completes, all your results will be in your output-results-directory, you may run additional scripts or programs to "reduce" those files to the final results you require.
      • my-script may access any file or any database. Aggregate the results of each instance of my-script, by having my-script update the approriate files and/or databases. Use locking and/or communication protocols appropriate for parallel processing, keeping in mind that you may have multiple instances of my-script running with multiple processes per GPFS node (host machines.)

    • Keep in mind that you can control the number of files in the filelist passed to my-script with the -B option; the number of concurrent processes running my-script per node with the -m option; which nodes run my-script with the -N option.
    Thanks again Marc,

    So even if the policy is run on a single node because of the -m option ( default is 24 threads ) , the list script will be started up to 24 times ?

    This could explain with my output is mangled

    Cheers
    Hajo
  • SystemAdmin
    SystemAdmin
    2092 Posts

    Re: Howto concatenate the result of a LIST policy using more than 1 node

    ‏2012-09-11T19:24:07Z  
    Thanks again Marc,

    So even if the policy is run on a single node because of the -m option ( default is 24 threads ) , the list script will be started up to 24 times ?

    This could explain with my output is mangled

    Cheers
    Hajo
    Yes, well almost correct.

    -m 24 means 24 concurrently.

    So with the default -B 100, we process about 2400 "at once".
    If you had 24000 files to process your script would be started 240 times.