Using data provenance tools
Specify the LSF data provenance tools as bsub job submission options.
About this task
Procedure
-
Enable data provenance by defining LSB_DATA_PROVENANCE=Y as an
environment variable or by using the esub.dprov application.
The esub.dprov script automatically defines LSB_DATA_PROVENANCE=Y for the job and takes input file names as parameters.
- To specify the environment variable at job submission time:
bsub -e LSB_DATA_PROVENANCE=y … command
- To specify the esub.dprov application at job submission
time:
bsub -a 'dprov(/path/to/input.file)' … command
- To specify the esub.dprov application as a mandatory esub
for all job submissions, add dprov to the list of applications in the
LSB_ESUB_METHOD parameter in the lsf.conf
file:
LSB_ESUB_METHOD="dprov"
- To specify the environment variable at job submission time:
-
Attach provenance data to the job-generated files by using the predefined script
tag.sh as a post-execution script.
- To specify the tag.sh post-execution script at job submission
time:
bsub -Ep 'tag.sh' ... command
- To specify the tag.sh post-execution script at the application- or queue-level, specify POST_EXEC=tag.sh in the lsb.applications or lsb.queues file.
For example,-
bsub -e LSB_DATA_PROVENANCE=y -Ep 'tag.sh' myjob
-
bsub -a 'dprov(/home/userA/test1)' -Ep 'tag.sh' myjob
You can edit the tag.sh script to customize data provenance for your specific environment.
All environment variables that are set for a job are also set when data provenance for a job.
The following additional environment variables apply only to the data provenance environment (that is, the following environment variables are available to the predefined tag.sh script that is used for data provenance):- LSB_DP_SUBCMD: The bsub job submission command.
- LSB_DP_STDINFILE: The standard input file for the job, as defined in the bsub -i option.
- LSB_DP_SUBFILES_index: The source files on the submission host (to be copied to the execution host), as defined in the bsub -f option.
- LSB_DP_EXECFILES_index: The destination files on the execution host (copied from the submission host), as defined in the bsub -f option.
- LSB_DP_FILES: The number of files to be copied, as defined in the bsub -f option.
- LSB_DP_INPUTFILES_index: The files that are defined in the esub.dprov script.
- LSB_DP_INPUTFILES: The number files that are defined in the esub.dprov script.
- To specify the tag.sh post-execution script at job submission
time:
-
Optional. Use the showhist.py script to show the history information of the
job data file.
showhist.py file_name
showhist.py generates a picture to show the relationship of the data files.
What to do next
The data provenance script files (esub.dprov, tag.sh, and showhist.py) are all located in the LSF_TOP/10.1/misc/examples/data_prov directory. Optionally, you can edit these files to customize the data provenance for your specific environment.