LSF uses accounting files to track resource allocation and usage for all finished jobs. This is the primary purpose of the lsb.acct file. This accounting file is a plain text file with one job record per line. Each job record includes job submission options, allocation information, start time, end time and resource usage. By default, there is only one lsb.acct accounting file per cluster, located in the LSF working directory, which is defined by ${LSB_SHAREDIR}/<cluster_name>/logdir.
LSF logs all finished job records to lsb.acct. By default, no automatic file archiving or rollover mechanism enabled, The lsb.acct file can grow very large, taking up space in the LSF working directory and affecting the performance of commands that query the accounting file (bacct, for instance).
LSF offers several features to help manage account files automatically, including automatic accounting file archiving and automatic deletion of old accounting files.
Automatic accounting file archiving
Automatic accounting file archiving can be triggered by several conditions: time of day, size of the accounting file, and age of accounting files. All of these are configured through parameters in lsb.params.
-
ACCT_ARCHIVE_AGE=<days> defines an archiving interval. LSF archives the current accounting file if the length of time from its creation date exceeds the specified number of days.
-
ACCT_ARCHIVE_SIZE=<kilobytes> defines an accounting file size threshold. LSF archives the current accounting file if its size exceeds the specified number of kilobytes.
-
ACCT_ARCHIVE_TIME=<hh:mm> defines a time of day for LSF to automatically archive the current accounting file.
ACCT_ARCHIVE_AGE and ACCT_ARCHIVE_TIME are mutually exclusive, since ACCT_ARCHIVE_AGE is configured in days, while ACCT_ARCHIVE_TIME sets a specific time of day. If ACCT_ARCHIVE_TIME is defined, ACCT_ARCHIVE_AGE is ignored. ACCT_ARCHIVE_SIZE can coexist with the other two parameters. Run badmin reconfig after you change any of the parameters.
When LSF archives the accounting file, it saves the old accounting file as lsb.acct.n, where n is an index number of archived files. The bigger the index number, the older the archived file. lsb.acct.1 always represents the latest archived accounting file.
Automatic deletion of old accounting files
Configure MAX_ACCT_ARCHIVE_FILE in lsb.params to enable automatic deletion of archived accounting files. Automatic file deletion controls the total number of archived files, and prevents LSF from archiving accounting files indefinitely. MAX_ACCT_ARCHIVE_FILE defines the maximum number of archived files under the LSF working directory. When the total number of archived files reaches the configured threshold, LSF automatically removes the oldest archived file. With this feature, you can control the total number of accounting archives stored under the working directory and make sure accounting files don't consume too much disk space.
Guidelines and best practices for automatic archiving
In a real production environment, you need to choose parameters and values based on the specific needs of your site. Here are some general guidance and best practices:
-
The size of job finish records can vary depending on job characteristics, ranging from a few hundred bytes to tens of thousands of bytes. By default, the bacct command only reads the current lsb.acct file, and scansall accounting records in the accounting file to generate its output. The size and the total number of records in lsb.acct affects the performance of bacct. For example, an accounting file with 400,000 job finish records could be as large as 1 GB. bacct may take as much as 1 minute to complete. To ensure good performance of bacct, the recommended setting for ACCT_ARCHIVE_SIZE is 500 MB (500,000 KB).
-
If your cluster runs a large number of jobs on a daily basis (say, a million or more), use ACCT_ARCHIVE_TIME, so it won't accumulate too many job records in the current accounting file. Set ACCT_ARCHIVE_TIME to midnight, so that accounting file archiving does not affect user queries.
-
If the daily job throughput of your cluster is within a few thousands or even tens of thousands of jobs, use ACCT_ARCHIVE_AGE. Base the value on the rough total number of job finish records you want to keep in the accounting file. You need to balance bacct command performance and the number of days of records you want to maintain. For instance, if cluster daily throughput is 50,000 jobs, and you want to maintain 1 month of job finish records in current accounting file, total number of finish records in current accounting file will be around 1.5 million. In this case, the bacct command could take several minutes to complete.
-
Set MAX_ACCT_ARCHIVE_FILE based on the total number of archives you plan to maintain in the working directory. Some of bacct command options, like -C, -D, or -S, make bacct search through the entire accounting file archive. When these options are used, bacct runs slower the more archives you keep in the directory. Furthermore, since LSF needs to rename archive files during archiving process, keeping a large number of archives can also impact user interaction. You should consider these performance implications when setting a reasonable value MAX_ACCT_ARCHIVE_FILE.
Examples
-
Site A has a high throughput cluster, and daily throughput reaches 800,000 jobs. The administrator wants to maintain current accounting files to reflect only the current day's information and to maintain accounting archives for 10 days. Here is the configuration:
ACCT_ARCHIVE_TIME=23:00
MAX_ACCT_ARCHIVE_FILE=10
-
Site B runs lots of parallel workload but runs less than 10,000 jobs daily. The administrator wants to maintain 1 month of accounting records. Each accounting file should be kept within 500 MB in size, and archived for 2 years. Here is the configuration:
ACCT_ARCHIVE_AGE=30
ACCT_ARCHIVE_SIZE=500000
MAX_ACCT_ARCHIVE_FILE=24