Considerations for the use of direct I/O (O_DIRECT)

When a file is opened in the O_DIRECT mode (direct I/O mode), GPFS transfers data directly between the user buffer and the file on the disk.

Using direct I/O may provide some performance benefits in the following cases:
  • The file is accessed at random locations.
  • There is no access locality.
Direct transfer between the user buffer and the disk can only happen if all of the following conditions are true:
  • The number of bytes transferred is a multiple of 512 bytes.
  • The file offset is a multiple of 512 bytes.
  • The user memory buffer address is aligned on a 512-byte boundary.

When any of these conditions is false, the operation proceeds but is treated more like other normal file I/O, with the O_SYNC flag that flushes the dirty buffer to disk.

When these conditions are all true, the GPFS page pool is not used because the data is transferred directly. Therefore, an environment in which most of the I/O volume is due to direct I/O (such as in databases) does not benefit from a large page pool. However, that the page pool still needs to be configured to an adequate size or left at its default value. The reason is that the page pool is also used to store file metadata, especially for the indirect blocks required for large files.

In IBM Spectrum Scale 5.0.5 and later, the configuration parameter minIndBlkDescs can be used to set the cache size for indirect blocks. The applications that perform direct I/O on very large files might benefit from this parameter. By specifying a large value for minIndBlkDescs, you can ensure that more indirect block descriptors are cached in memory. For more information, see mmchconfig command.

When direct I/O is done on 4K sector disks, the alignment requirement for the number of bytes, the file offset, and the user memory buffer is 4K bytes instead of 512 bytes. If such an alignment is not in place during an individual direct I/O operation, the request is then treated like normal I/O, as explained earlier.

With direct I/O, the application is responsible for coordinating access to the file, and neither the overhead nor the protection provided by GPFS locking mechanisms plays a role. In particular, if two threads or nodes perform direct I/O concurrently on overlapping portions of the file, the outcome is undefined. For example, when multiple writes are made to the same file offsets, it is undetermined which of the writes will be on the file when all I/O is completed. In addition, if the file has data replication, it is not guaranteed that all the data replicas will contain the data from the same writer. That is, the contents of each of the replicas could diverge.

Even when the I/O requests are aligned as previously listed, in the following cases GPFS will not transfer the data directly and will revert to the slower buffered behavior:
  • The write causes the file to increase in size.
  • The write is in a region of the file that has been per-allocated (via gpfs_prealloc()) but has not yet been written.
  • The write is in a region of the file where a hole is present; that is, the file is sparse and has some unallocated regions.
When direct I/O requests are aligned but none of the previously listed conditions (that would cause the buffered I/O path to be taken) are present, handling is optimized this way: the request is completely handled in kernel mode by the GPFS kernel module, without the GPFS daemon getting involved. Any of the following conditions, however, will still result in the request going through the daemon:
  • The I/O operation needs to be served by an NSD server.
  • The file system has data replication. In the case of a write operation, the GPFS daemon is involved to produce the log records that ensure that the replica contents are identical (in case of a failure while writing the replicas to disk).
  • The operation is performed on the Windows operating system.

Note that setting the O_DIRECT flag on an open file with fcntl (fd, F_SETFL,[..]), which may be allowed on Linux®, is ignored in a GPFS file system.

Because of a limitation in Linux, I/O operations with O_DIRECT should not be issued concurrently with a fork(2) system call that is invoked by the same process. Any calls to fork() in the program should be issued only after O_DIRECT I/O operations are completed. That is, fork() should not be invoked while O_DIRECT I/O operations are still pending completion. For more information, see the open(2) system call in the Linux documentation.