Writing programs that access large files

AIX® supports files that are larger than 2 gigabytes (2 GB). This section assists programmers in understanding the implications of large files on their applications and to assist them in modifying their applications. Application programs can be modified, through programming interfaces, to be aware of large files. The file system programming interfaces generally are based on the off_t data type.

Implications for existing programs

The 32-bit application environment that all applications used prior to AIX 4.2 remains unchanged. However, existing application programs cannot handle large files.

For example, the st_size field in the stat structure, which is used to return file sizes, is a signed, 32-bit long. Therefore, that stat structure cannot be used to return file sizes that are larger than LONG_MAX. If an application attempts to use the stat subroutine with a file that is larger than LONG_MAX, the stat subroutine will fail, and errno will be set to EOVERFLOW, indicating that the file size overflows the size field of the structure being used by the program.

This behavior is significant because existing programs that might not appear to have any impacts as a result of large files will experience failures in the presence of large files even though the file size is irrelevant.

The errno EOVERFLOW can also be returned by an lseek pointer and by the fcntl subroutine if the values that need to be returned are larger than the data type or structure that the program is using. For lseek, if the resulting offset is larger than LONG_MAX, lseek will fail and errno will be set to EOVERFLOW. For the fcntl subroutine, if the caller uses F_GETLK and the blocking lock's starting offset or length is larger than LONG_MAX, the fcntl call will fail, and errno will be set to EOVERFLOW.

Open protection

Many existing application programs could have unexpected behavior, including data corruption, if allowed to operate on large files. AIX uses an open-protection scheme to protect applications from this class of failure.

In addition to open protection, a number of other subroutines offer protection by providing an execution environment, which is identical to the environment under which these programs were developed. If an application uses the write family of subroutines and the write request crosses the 2 GB boundary, the write subroutines will transfer data only up to 2 GB minus 1. If the application attempts to write at or beyond the 2GB -1 boundary, the write subroutines will fail and set errno to EFBIG. The behavior of the mmap, ftruncate, and fclear subroutines are similar.

The read family of subroutines also participates in the open-protection scheme. If an application attempts to read a file across the 2 GB threshold, only the data up to 2 GB minus 1 will be read. Reads at or beyond the 2GB -1 boundary will fail, and errno will be set to EOVERFLOW.

Open protection is implemented by a flag associated with an open file description. The current state of the flag can be queried with the fcntl subroutine using the F_GETFL command. The flag can be modified with the fcntl subroutine using the F_SETFL command.

Because open file descriptions are inherited across the exec family of subroutines, application programs that pass file descriptors that are enabled for large-file access to other programs should consider whether the receiving program can safely access the large file.

Porting applications to the large file environment

AIX provides two methods for applications to be enabled for large-file access. Application programmers must decide which approach best suits their needs:
  • Define _LARGE_FILES, which carefully redefines all of the relevant data types, structures, and subroutine names to their large-file enabled counterparts. Defining _LARGE_FILES has the advantage of maximizing application portability to other platforms because the application is still written to the normal POSIX and XPG interfaces. It has the disadvantage of creating some ambiguity in the code because the size of the various data items cannot be determined from looking at the code.
  • Recode the application to explicitly call the large-file enabled subroutines. Recoding the application has the disadvantages of requiring more effort and reducing application portability. It can be used when the redefinition effect of _LARGE_FILES would have a considerable negative impact on the program or when it is desirable to convert only a very small portion of the program.

In either case, the application program must be carefully audited to ensure correct behavior in the new environment.

Using _LARGE_FILES

In the default compilation environment, the off_t data type is defined as a signed, 32-bit long. If the application defines _LARGE_FILES before the inclusion of any header files, then the large-file programming environment is enabled and off_t is defined to be a signed, 64-bit long long. In addition, all of the subroutines that deal with file sizes or file offsets are redefined to be their large-file enabled counterparts. Similarly, all of the data structures with embedded file sizes or offsets are redefined.

The following table shows the redefinitions that occur in the _LARGE_FILES environment:

Entity Redefined as Header file
off_t Object long long <sys/types.h>
fpos_t Object long long <sys/types.h>
struct stat Structure struct stat64 <sys/stat.h>
stat Subroutine stat64() <sys/stat.h>
fstat Subroutine fstat64() <sys/stat.h>
lstat Subroutine lstat64() <sys/stat.h>
mmap Subroutine mmap64() <sys/mman.h>
lockf Subroutine lockf64() <sys/lockf.h>
struct flock Structure struct flock64 <sys/flock.h>
open Subroutine open64() <fcntl.h>
creat Subroutine creat64() <fcntl.h>
F_GETLK Command Parameter F_GETLK64 <fcntl.h>
F_SETLK Command Parameter F_SETLK64 <fcntl.h>
F_SETLKW Command Parameter F_SETLKW64 <fcntl.h>
ftw Subroutine ftw64() <ftw.h>
nftw Subroutine nftw64() <ftw.h>
fseeko Subroutine fseeko64() <stdio.h>
ftello Subroutine ftello64() <stdio.h>
fgetpos ubroutine fgetpos64() <stdio.h>
fsetpos Subroutine fsetpos64() <stdio.h>
fopen Subroutine fopen64() <stdio.h>
freopen Subroutine freopen64() <stdio.h>
lseek Subroutine lseek64() <unistd.h>
ftruncate Subroutine ftruncate64() <unistd.h>
truncate Subroutine truncate64() <unistd.h>
fclear Subroutine fclear64() <unistd.h>
pwrite Subroutine pwrite64() <unistd.h>
pread Subroutine pread64() <unistd.h>
struct aiocb Structure struct aiocb64 <sys/aio.h>
aio_read Subroutine aio_read64() <sys/aio.h>
aio_write Subroutine aio_write64() <sys/aio.h>
aio_cancel Subroutine aio_cancel64() <sys/aio.h>
aio_suspend Subroutine aio_suspend64() <sys/aio.h>
aio_return Subroutine aio_return64() <sys/aio.h>
aio_error Subroutine aio_error64() <sys/aio.h>
liocb Structure liocb64 <sys/aio.h>
lio_listio Subroutine lio_listio64() <sys/aio.h>

Using 64-bit file system subroutines

Using the _LARGE_FILES environment may be impractical for some applications due to the far-reaching implications of changing the size of off_t to 64 bits. If the number of changes is small, it may be more practical to convert a relatively small part of the application to be large-file enabled. The 64-bit file system data types, structures, and subroutines are listed below:
<sys/types.h>
typedef long long off64_t;
typedef long long fpos64_t;

<fcntl.h>

extern int      open64(const char *, int, ...);
extern int      creat64(const char *, mode_t);

#define F_GETLK64
#define F_SETLK64
#define F_SETLKW64

<ftw.h>
extern int ftw64(const char *, int (*)(const char *,const struct stat64 *, int), int);
extern int nftw64(const char *, int (*)(const char *, const struct stat64 *, int,struct FTW *),int, int);

<stdio.h>

extern int      fgetpos64(FILE *, fpos64_t *);
extern FILE     *fopen64(const char *, const char *);
extern FILE     *freopen64(const char *, const char *, FILE *);
extern int      fseeko64(FILE *, off64_t, int);
extern int      fsetpos64(FILE *, fpos64_t *);
extern off64_t  ftello64(FILE *);

<unistd.h>

extern off64_t  lseek64(int, off64_t, int);
extern int      ftruncate64(int, off64_t);
extern int      truncate64(const char *, off64_t);
extern off64_t  fclear64(int, off64_t);
extern ssize_t  pread64(int, void *, size_t, off64_t);
extern ssize_t  pwrite64(int, const void *, size_t, off64_t);
extern int      fsync_range64(int, int, off64_t, off64_t);

<sys/flock.h>

struct flock64;

<sys/lockf.h>

extern int lockf64 (int, int, off64_t);

<sys/mman.h>

extern void     *mmap64(void *, size_t, int, int, int, off64_t);

<sys/stat.h>

struct stat64;

extern int      stat64(const char *, struct stat64 *);
extern int      fstat64(int, struct stat64 *);
extern int      lstat64(const char *, struct stat64 *);

<sys/aio.h>

struct aiocb64
int     aio_read64(int, struct aiocb64 *):
int     aio_write64(int, struct aiocb64 *);
int     aio_listio64(int, struct aiocb64 *[],
        int, struct      sigevent *);
int     aio_cancel64(int, struct aiocb64 *);
int     aio_suspend64(int, struct aiocb64 *[]);

struct liocb64
int     lio_listio64(int, struct liocb64 *[], int, void *);

Common pitfalls in using the large file environment

Porting of application programs to the large-file environment can expose a number of different problems in the application. These problems are frequently the result of poor coding practices, which are harmless in a 32-bit off_t environment, but which can manifest themselves when compiled in a 64-bit off_t environment. Some of the more common problems and solutions are discussed in this section.

Note: In the following examples, off_t is assumed to be a 64-bit file offset.

Improper use of data types

A common source of problems with application programs is a failure to use the proper data types. If an application attempts to store file sizes or file offsets in an integer variable, the resulting value will be truncated and lose significance. To avoid this problem, use the off_t data type to store file sizes and offsets.

Incorrect:

int file_size;
struct stat s;

file_size = s.st_size;

Better:

off_t file_size;
struct stat s;
file_size = s.st_size;

When you are passing 64-bit integers to functions as arguments or when you are returning 64-bit integers from functions, both the caller and the called function must agree on the types of the arguments and the return value.

Passing a 32-bit integer to a function that expects a 64-bit integer causes the called function to misinterpret the caller's arguments, leading to unexpected behavior. This type of problem is especially severe if the program passes scalar values to a function that expects to receive a 64-bit integer.

You can avoid problems by using function prototypes carefully. In the code fragments below, fexample() is a function that takes a 64-bit file offset as a parameter. In the first example, the compiler generates the normal 32-bit integer function linkage, which would be incorrect because the receiving function expects 64-bit integer linkage. In the second example, the LL specifier is added, forcing the compiler to use the proper linkage. In the last example, the function prototype causes the compiler to promote the scalar value to a 64-bit integer. This is the preferred approach because the source code remains portable between 32-bit and 64-bit environments.

Incorrect:

fexample(0);

Better:

fexample(0LL);  

Best:

\est: