Using data provenance tools

Specify the LSF data provenance tools as bsub job submission options.

About this task

Procedure

  1. Enable data provenance by defining LSB_DATA_PROVENANCE=Y as an environment variable or by using the esub.dprov application.

    The esub.dprov script automatically defines LSB_DATA_PROVENANCE=Y for the job and takes input file names as parameters.

    • To specify the environment variable at job submission time:

      bsub -e LSB_DATA_PROVENANCE=ycommand

    • To specify the esub.dprov application at job submission time:

      bsub -a 'dprov(/path/to/input.file)'command

    • To specify the esub.dprov application as a mandatory esub for all job submissions, add dprov to the list of applications in the LSB_ESUB_METHOD parameter in the lsf.conf file:

      LSB_ESUB_METHOD="dprov"

  2. Attach provenance data to the job-generated files by using the predefined script tag.sh as a post-execution script.
    • To specify the tag.sh post-execution script at job submission time:

      bsub -Ep 'tag.sh' ... command

    • To specify the tag.sh post-execution script at the application- or queue-level, specify POST_EXEC=tag.sh in the lsb.applications or lsb.queues file.
    For example,
    • bsub -e LSB_DATA_PROVENANCE=y -Ep 'tag.sh' myjob
    • bsub -a 'dprov(/home/userA/test1)' -Ep 'tag.sh' myjob

    You can edit the tag.sh script to customize data provenance for your specific environment.

    All environment variables that are set for a job are also set when data provenance for a job.

    The following additional environment variables apply only to the data provenance environment (that is, the following environment variables are available to the predefined tag.sh script that is used for data provenance):
    • LSB_DP_SUBCMD: The bsub job submission command.
    • LSB_DP_STDINFILE: The standard input file for the job, as defined in the bsub -i option.
    • LSB_DP_SUBFILES_index: The source files on the submission host (to be copied to the execution host), as defined in the bsub -f option.
    • LSB_DP_EXECFILES_index: The destination files on the execution host (copied from the submission host), as defined in the bsub -f option.
    • LSB_DP_FILES: The number of files to be copied, as defined in the bsub -f option.
    • LSB_DP_INPUTFILES_index: The files that are defined in the esub.dprov script.
    • LSB_DP_INPUTFILES: The number files that are defined in the esub.dprov script.
  3. Optional. Use the showhist.py script to show the history information of the job data file.
    showhist.py file_name

    showhist.py generates a picture to show the relationship of the data files.

What to do next

The data provenance script files (esub.dprov, tag.sh, and showhist.py) are all located in the LSF_TOP/10.1/misc/examples/data_prov directory. Optionally, you can edit these files to customize the data provenance for your specific environment.