-data

Specifies data requirements for a job.

Categories

properties, resource

Synopsis

-data "[host_name:]abs_file_path [[host_name:/]abs_file_path ...]" ... [-datagrp "user_group_name"] [-datachk] | -data "tag:[tag_owner@]tag_name [tag:[tag_owner@]tag_name ...]" ...
-data "[host_name:]abs_folder_path[/[*]] [[host_name:/]abs_folder_path/[*] ...]" ... [-datagrp "user_group_name"] [-datachk] | -data "tag:[tag_owner@]tag_name [tag:[tag_owner@]tag_name ...]" ...

Description

You can specify data requirements for the job three ways:
  • As a list of files for staging.
  • An absolute path to a folder that contains files for staging.
  • As a list of arbitrary tag names.

Your job can specify multiple -data options, but all the requested data requirements must be either tags, files, or folders. You can specify individual files, folders, or data specification files in each -data clause. You cannot mix tag names with file specifications in the same submission. The combined requirement of all -data clauses, including the requirements that are specified inside specification files, are combined into a single space-separated string of file requirements.

Valid file or tag names can contain only alphanumeric characters ([A-z|a-z|0-9]), and a period (.), underscore (_), and dash (-). The tag or file name cannot contain the special operating system names for parent directory (../), current directory (./), or user home directory (~/). File, folder, and tag names cannot contain spaces. Tag names cannot begin with a dash character.

A list of files must contain absolute paths to the required source files. The list of files can refer to actual data files your job requires or to a data specification file that contains a list of the data files. The first line of a data specification file must be #@dataspec.

A data requirement that is a folder must contain an absolute path to the folder. The colon character (:) is supported in the path of a job requirement.

A host name is optional. The default host name is the submission host. The host name is the host that LSF attempts to stage the required data file from. You cannot substitute IP addresses for host names.

By default, if the CACHE_PERMISSIONS=group parameter is specified in lsf.datamanager, the data manager uses the primary group of the user who submitted the job to control access to the cached files. Only users that belong to that group can access the files in the data requirement. Use the -datagrp option to specify a particular user group for access to the cache. The user who submits the job must be a member of the specified group. You can specify only one -datagrp option per job.

The tag_name element is any string that can be used as a valid data tag.

If the CACHE_ACCESS_CONTROL = Y parameter is configured in the lsf.datamanager file, you can optionally specify a user name for the tag owner tag:[tag_owner@]tag_name.

The sanity check for the existence of files or folders and whether the user can access them, discovery of the size and modification of the files or folders, and generation of the hash from the bsub command occurs in the transfer job. This equalizes submission performance between jobs with and without data requirements. The -datachk option forces the bsub and bmod commands to perform these operations. If the data requirement is for a tag, the -datachk option has no effect.

You can use the %I runtime variable in file paths. The variable resolves to an array index in a job array submission. If the job is not an array, %I in the data specification is interpreted as 0. You cannot use the %I variable in tag names.

The path to the data requirement files or the files that are listed inside data specification files can contain wildcard characters. All paths that are resolved from wildcard characters are interpreted as files to transfer. The path to a data specification file cannot contain wildcard characters.

The use of wildcard characters has the following rules:
  • A path that ends in a slash followed by an asterisk (/*) is interpreted as a directory. For example, a path like /data/august/*. Transfers only the files in the august directory without recursion into subdirectories.
  • A path that ends in a slash (/) means that you want to transfer all of the files in the folder and all files in all of its sub folders, recursively.
  • You can use the asterisk (*) wildcard character only at the end of the path. For example, a path like /data/*/file_*.* is not supported. The following path is correct: /data/august/*. The * is subject to expansion by the shell and must be quoted properly.
  • When wildcard characters are expanded, symbolic links in a path are expanded. If a broken symbolic link is detected, the job submission fails.
  • When you use the asterisk character at the end of the path, the data file requirements must be in quotation marks.

Examples

The following job requests three data files for staging, listing in the -data option:
bsub -o %J.out -data "hostA:/proj/std/model.tar 
hostA:/proj/user1/case02341.dat hostB:/data/human/seq342138.dna" /share/bin/job_copy.sh
 Job <1962> is submitted to default queue <normal>. 
The following job requests the same data files for staging. Instead of listing them individually in the -data option, the required files are listed in a data specification file named /tmp/dataspec.user1:
bsub -o %J.out -data “/tmp/dataspec.user1” /share/bin/job_copy.sh
 Job <1963> is submitted to default queue <normal>. 
The data specification file /tmp/dataspec.user1 contains the paths to the required files:
cat /tmp/dataspec.user1
 #@dataspec
 hostA:/proj/std/model.tar
 hostA:/proj/user1/case02341.dat
 hostB:/data/human/seq342138.dna
The following job requests all data files in the folder /proj/std/on hostA. Only the contents of the top level of the folder are requested.
bsub -o %J.out -data "hostA:/proj/std/*" /share/bin/job_copy.sh
 Job <1964> is submitted to default queue <normal>. 
The following job requests all data files in the folder /proj/std/, including on hostA. All files in all subfolders are requested recursively as the required data files.
bsub -o %J.out -data "hostA:/proj/std/" /share/bin/job_copy.sh
 Job <1964> is submitted to default queue <normal>. 
The following command submits an array job that requests data files by array index. All data files must exist.
bsub -o %J.out -J "A[1-10]" -data "hostA:/proj/std/input_%I.dat" 
/share/bin/job_copy.sh
 Job <1965> is submitted to default queue <normal>. 
The following job requests data files by tag name.
bsub -o %J.out -J "A[1-10]" -data "tag:SEQ_DATA_READY" "tag:SEQ_DATA2" 
/share/bin/job_copy.sh
 Job <1966> is submitted to default queue <normal>. 
The following job requests the data file /proj/std/model.tar, which belongs to the user group design1:
bsub -o %J.out -data "hostA:/proj/std/model.tar" -datagrp "design1" my_job.sh
 Job <1962> is submitted to default queue <normal>.