Skip to main content

Developing a file system for AIX

Srikanth Srinivasan (ssrikanth@in.ibm.com), Staff Software Engineer, IBM
Photo of Srikanth
Srikanth Srinivasan is a Staff Software Engineer for the IBM India Systems and Technology Labs. His main focus is on parallel file systems and the Linux kernel. He is part of the IBM General Parallel File systems (GPFS) development team, involved in developing various features for GPFS. Srikanth holds a bachelor's degree in Electronics and Communications Engineering from the University of Madras, India. You can reach him at ssrikanth@in.ibm.com.

Summary:  Learn the intricacies of the AIX® file system framework. Every operating system provides a native kernel framework that kernel developers have to understand and adhere to when developing a piece of a kernel component for that operating system. This article sheds some light on the AIX file system framework. You need to understand the framework in order to develop a new file system, or to port an existing file system to the AIX operating system.

Date:  29 May 2007
Level:  Intermediate
Activity:  3976 views
Comments:  

Introduction

AIX 5L™ is an award-winning operating system that delivers superior scalability, reliability, and manageability. It is the default operating system that powers some of the most powerful IBM UNIX® servers in the market.

Typically, a file system can be defined as a piece of software that helps in storing, organizing, and retrieving data from a physical storage medium, be it a hard disk drive, CD-ROM, or any other storage device. The code for such data organization, by its very nature, should be portable. In the real world, though, every operating system provides its own interfaces by which it requests a particular file system operation, and it is expected that the underlying piece of software provides results in the format that the operating system expects. The interfaces vary with different flavors of operating systems, and need to be exported by the file system to be supported on the particular operating system.

In this article, you'll learn about the AIX® operating system file system framework. You'll also get an overview of the IO layer and an explanation of some important concepts. Brief explanations are also included of the interfaces and methods when developing a new file system, or when porting an existing file system to the AIX operating system.

AIX, like many UNIX flavors, hosts the file system as a kernel extension. It is assumed that you have basic knowledge of UNIX programming and file system concepts. It would also be helpful to know how to write kernel extensions for AIX.

Understanding the logical file system and the virtual file system

The logical file system layer is the level of abstraction at which users can request the various file operations, such as read, write, stat, and so on. The logical file system interface supports UNIX-type file access semantics. The logical file system layer acts as a superset of the virtual file system, which encapsulates disparate file systems, that provides the kernel with a consistent view of the underlying directory tree. The logical file system is also responsible for managing the kernel's open file table and the per process file descriptor information.

The virtual file system is an abstraction of the underlying physical file system. The virtual file system provides a standard set of interfaces that you should support in order for your file system to be hosted over the AIX operating system. The virtual file system bridges the underlying disparate physical file system to the logical file system, providing a consistent directory tree hierarchy to the rest of the operating system.

Each unique mount instance of a file system object is represented by a virtual file system structure. A virtual file system can be a physical file system, a network file system, or a logical file system (one that does not have a physical backing store, such as ramfs). Figure 1 shows the AIX file system hierarchy.


Figure 1. The AIX file system hierarchy

As shown in Listing 1, the virtual file system is maintained as a linked list of struct vfs, as denoted by the member vfs_next.
Listing 1. Virtual file system structure

<sys/vfs.h>

struct vfs {
  struct vfs      *vfs_next;      
  struct gfs      *vfs_gfs;       
  struct vnode    *vfs_mntd;      
                                        
  struct vnode    *vfs_mntdover;  
                                        
  struct vnode    *vfs_vnodes;    
  int             vfs_count;      
  caddr_t         vfs_data;       
  unsigned int    vfs_number;     
  int             vfs_bsize;      
#ifdef  _SUN
  short           vfs_exflags;    
  unsigned short  vfs_exroot;     
#else
  short           vfs_rsvd1;      
  unsigned short  vfs_rsvd2;      
#endif  /* _SUN */
  struct vmount   *vfs_mdata;     
  Simple_lock     vfs_lock;       
};

Each entry in the list represents a mounted file system object.

vfs_mntd
vfs_mntd represents the mount point vnode over which this file system has been mounted. For the '/' root file system, this will be the root vnode.
vfs_vnodes
vfs_vnodes is a linked list of all the vnodes for this mounted instance.
vfs_lock
You use vfs_lock to serialize the access to the vnode.
vfs_gfs
vfs_gfs points to the struct gfs structure for the corresponding file system.

struct gfs contains information pertaining to a file system that is independent of the mounted instances. It contains all the common traits of a file system layout, as shown in Listing 2 below. Each file system registered with the operating system has exactly one struct gfs, while there is one or more than one struct vfs, each for one mounted instance. The important members of the struct gfs are gfs_ops and gn_ops, which represent the virtual file system operations and the vnode operations for the file system. You should supply the virtual file system operations and vnode operations for the file system to be supported on the AIX operating system.


Listing 2. gfs structure
<sys/vfs.h>

struct gfs {
  struct vfsops   *gfs_ops;
  struct vnodeops *gn_ops;
  int             gfs_type;       
  char            gfs_name[16];   
  int             (*gfs_init)(struct gfs *);  
  int             gfs_flags;      
  caddr_t         gfs_data;       
  int             (*gfs_rinit)(void);
  int             gfs_hold;       
};

vnode, gnode, and inode

The vnode represents an open object in the file system. Every attempt to create or open a file results in a vnode object being created for the file. If a vnode already exists, the reference count is incremented. The vnode remains alive in the virtual file system until the last holder goes away, thereby bringing the reference count to 0. The main job of the vnode is to translate the pathname to a physical object in the underlying file system. An object can have more than one pathname if:

  • The file system is mounted more than once over different mount points.
  • There is soft link or hard link to the object.

However, attempts to open synonymous paths do not result in multiple vnodes being created. A gnode can be referred by multiple vnodes only when there is more than one mounted instance of the file system. At any given instance, there is at least one vnode referring a gnode. Listing 3 shows the vnode structure.


Listing 3. vnode structure
<sys/vnode.h>

struct vnode {
  ushort         v_flag;         
  ushort         v_flag2;        
  ulong32int64_t v_count; 
  int            v_vfsgen;       
  Simple_lock    v_lock;     
  struct vfs     *v_vfsp;     
  struct vfs     *v_mvfsp;    
                               
  struct gnode   *v_gnode;  
  struct vnode   *v_next;   
  struct vnode   *v_vfsnext; 
  struct vnode   *v_vfsprev; 
  union v_data {
    void         * _v_socket;      
    struct vnode * _v_pfsvnode;     
  } _v_data;
  char           *v_audit;        
};

v_vfsp
v_vfsp represents the containing vfs object. If this vnode is a mount point for a different file system, v_mvfsp holds the struct vfs of the containing file system.
v_gnode
v_gnode points back to the gnode of the object. Each physical object in a file system is represented by a unique gnode. Unlike the vnode, there is only one gnode per object, regardless of the number of mounted instances of a file system. There is an one-to-one correspondence between a gnode and a file on the disk.

The open object in each mounted instance is represented by a unique vnode. Thus, the vnode can be said to contain the open, instance-specific information, and the gnode encapsulates the object itself.


Listing 4. gnode structure
<sys/vnode.h>

struct gnode {
  enum vtype      gn_type;             
  short           gn_flags;              
  vmid_t          gn_seg;                 
  long32int64_t   gn_mwrcnt;      
  long32int64_t   gn_mrdcnt;     
  long32int64_t   gn_rdcnt;     
  long32int64_t   gn_wrcnt;     
  long32int64_t   gn_excnt;     
  long32int64_t   gn_rshcnt;    
  struct vnodeops *gn_ops;
  struct vnode    *gn_vnode; 
  dev_t           gn_rdev;       
  chan_t          gn_chan;       
  Simple_lock     gn_reclk_lock;  
  int             gn_reclk_event; 
  struct filock   *gn_filocks;     
  caddr_t         gn_data;       
};

Some important members of the struct gnode are:

gn_vnode
gn_vnode points to the list of vnodes referring this gnode.
gn_ops
gn_ops is the standard vnode operations that you should register for your file system.
gn_data
gn_data points to the file system-specific information for the object, generally the inode.

The inode object is a file system data structure that represents the information associated with the file and makes sense only to that file system. Every gnode has an associated inode in the file system. A gnode stores the generic characters for the file, and the inode is designed to store information specific to the file system. The inode is encapsulated within the gnode in the gn_data member variable.

AIX does not doctor any interfaces for designing an inode. It is up to you to design your own inode and interpret it per the design. Typically, an inode provides information such as the ownership (group and user), access permissions, creation or modification time, and so on.

The virtual file system operations

The AIX vnode and virtual file system framework provides the well-defined interfaces known as the virtual file system and vnode operations. The virtual file system operations are used to support file system management operations, such as mounting and unmounting of a file system, syncing the file system with its backing store during an unmount operation, and querying the various ancillary provisions available with the file system. Listing 5 shows mount and unmount operations.

All the virtual file system operations return 0 on successful completion. In event of a failure, an error number is returned from /usr/include/sys/errno.h.


Listing 5. Mount and unmount operations
int (*vfs_mount)(struct vfs *, struct ucred *);

int (*vfs_unmount)(struct vfs *, int, struct ucred *);

vfs_mount
This interface is called by the logical file system layer to invoke the mounting of the requested file system. It is passed an initialized struct vfs, which the routine should fill and return. The vfs_mount interface returns 0 on a successful mount and an error on failure.
vfs_umount
This routine is called for unmounting a file system, either through an explicit call to the umount command or during operating system shutdown.

The vfs_sync routine, shown in Listing 6 below, is called to sync the in-memory file system data with its backing store.


Listing 6. Sync operations
#if defined(__64BIT_KERNEL) || defined(__FULL_PROTO)
  int (*vfs_sync)(struct gfs *);
#else
  int (*vfs_sync)();
#endif

int (*vfs_syncvfs)(struct gfs *, struct vfs *, int, struct ucred *);

vfs_sync
This interface is passed the struct gfs that describes the file system type and not the struct vfs, as in other virtual file system operations. The sync routine is called once per file system type and not for each vfs instance.
vfs_syncfs
vfs_syncfs is used to sync a specific mount instance, unlike vfs_sync, which is called once for all mounted instances of that particular virtual file system type. The vfs_syncfs can be called with several options that control the sync operation.

Listing 7. Quota and Access Control Lists management
int (*vfs_quotactl)(struct vfs *, int, uid_t, caddr_t, struct ucred *);

int (*vfs_aclxcntl)(struct vfs *, struct vnode *, int, struct uio *, size_t *, 
                    struct ucred *);

vfs_quotactl
The logical file system layer invokes the vfs_quotactl interface to perform quota-related control operations of the file system.
vfs_aclxcntl
vfs_aclxcntl is invoked to perform various Access Control Lists (ACL) to specific control operations on the file system. If a file system has quota and ACL support, then it needs to adhere to the framework defined for the AIX operating system.

The routines return 0 on a successful operation. If the operations are not supported by the underlying file system, the routines should return EINVAL.


Listing 8. Other control operations
int (*vfs_root)(struct vfs *, struct vnode **, struct ucred *);

int (*vfs_statfs)(struct vfs *, struct statfs *, struct ucred *);

int (*vfs_vget)(struct vfs *, struct vnode **, struct fileid *, struct ucred *);

int (*vfs_cntl)(struct vfs *, int, caddr_t, size_t, struct ucred *);    

vfs_root
vfs_root is used to get the vnode root pointer of the file system. The routine returns the vnode root in struct vnode **.
vfs_statfs
vfs_statfs is used to get the file system characteristics. On a successful completion, the struct statfs is filled with the appropriate file system characteristics. Table 1 describes the various file system characteristics.

Table 1. File system characteristics
f_blocksSpecifies the number of blocks
f_filesSpecifies the total number of file system objects
f_bsizeSpecifies the file system block size
f_bfreeSpecifies the number of free blocks
f_ffreeSpecifies the number of free file system objects
f_fnameSpecifies a 32-byte string indicating the file system name
f_fpackSpecifies a 32-byte string indicating a pack ID
f_name_maxSpecifies the maximum length of an object name
vfs_vget
vfs_vget is used to get the vnode of an object in the file system identified by the virtual file system and fileid. The fileid is created during a call to the vn_fidvnode operation. If a vnode exists in the virtual file system layer for the corresponding object, the reference count is incremented and the vnode returned. Or, a new vnode is created through a vn_get kernel service referencing the object and returned after setting its count to 1. The fileid parameter at any given instance is used to uniquely identify an object in the file system.
vfs_cntl
vfs_cntl is used to implement miscellaneous user-specified control operations. The logical file system invokes the vfs_cntl routine with a control ID and the necessary arguments. The vfs_cntl is invoked by the fscntl subroutine. There can be a maximum of 32768 control operations specified by the user.

The vnode operations

Like the virtual file system operations, the AIX file system framework exports a set of interfaces to do the various file system operations, such as read, write, stat, and so on. These interfaces are the vnode operations. The AIX Version 5.3 kernel exports in all 56 interfaces for the various file system operations. It is not necessary to support all of them; it depends on what purpose your file system driver serves. Some of the extensions are provided purely for backward compatibility with the older version of the AIX operating system, while others are callout interfaces to aid the virtual memory manager in paging operations. The rest of the routines provide the backbone support for the various UNIX file access semantics.


Listing 9. Creating, naming, and deletion operations
int (*vn_link)(struct vnode *dvp, struct vnode *vp, char *name, struct ucred *cred);

int (*vn_mkdir)(struct vnode *dvp, char *name, int32long64_t mode, struct ucred *cred);

int (*vn_mknod)(struct vnode *dvp, caddr_t name, int32long64_t mode, dev_t dev,
                struct ucred *cred);
                
int (*vn_remove)(struct vnode *vp, struct vnode *dvp, char *name, struct ucred *cred);

int (*vn_rename)(struct vnode *srcVp, struct vnode *srcDvp, caddr_t oldName,
                 struct vnode *destVp, struct vnode *destDvp, caddr_t newName,
                 struct ucred *cred);
                 
int (*vn_rmdir)(struct vnode *vp, struct vnode *dvp, char *name, struct ucred *cred);

The routines are used to create, name, and delete objects in the underlying file system. All the routines are passed with at least the parent directory vnode in which the operation for the child object is to be performed. The logical file system layer ensures that the vnode parent directory does not lie in a read-only file system. The entry points are:

vn_link
vn_link is invoked to create a new hard link to an existing object as part of the link subroutine's job. The routine creates a hard link in the dvp directory with the name name for the vnodevp. The logical file system ensures that the objects referred by the dvp and vp parameters reside in the same virtual file system.
vn_mkdir
vn_mkdir is used to create a new named directory under the directory specified by the vnodedvp. The directory mode permissions are passed as an int32long64_t variable and the name of the new directory is passed in the name parameter.
vn_mknod
vn_mknod is used to create a new file with name name in the dvp directory. The mode carries the type of the file (regular, special files) and the file access permissions as a combination of bit masks. In the case of special files (such as device files), the dev parameter carries the device number.
vn_remove
vn_remove is invoked by the logical file system layer in favor of the unlink subroutine. This interface is used to remove a directory entry or a link specified by the vnodevp lying under the dvp directory. It is required that the user calls the vn_rele function to release a reference to the vnode. If this is the last reference to the object, the physical disk resources occupied by the file are released.
vn_rename
vn_rename is called by the logical file system to rename a file or a directory. The source object (srcName) lying under the srcDvp directory is referred by the vnodesrvVp. The new name is specified by the newName parameter and the new destination directory is specified by the destDvp. In case an object of the same name already exists in the destination path, the object's vnode is passed in destVp.
vn_rmdir
vn_rmdir is used to remove a directory entry specified by vnodevp lying in the directory dvp. The logical file system ensures that the vnode to be removed is a directory and not the current directory or the root directory. To remove a directory, the directory should be empty and should not have any child objects.

Listing 10. Lookup and filehandle operations
int (*vn_lookup)(struct vnode *dvp, struct vnode **vpp, char *name,
                 int32long64_t vflag, struct vattr *attr, struct ucred *cred);
                 
int (*vn_fid)(struct vnode *vp, struct fileid *fidp, struct ucred *cred);

vn_lookup
The logical file system uses this entry point to look up the file with name name under the dvp directory. If found, the vnode of the file looked up is returned in vpp.

Every lookup to file results in the reference count being incremented. So, you should ensure that a successful lookup operation is followed by a close that will decrement the reference count. If the attr field is not null, the routine should also return the file attributes in the attr parameter.

vn_fid
vn_fid is invoked to build a file identifier for the vnodevp. This file identifier is used by the vfs_vget routine to get the vnode of the same object for which the file identifier was constructed in the first place. Hence, the file identifier should have enough information to successfully identify the right object.

Listing 11. File access operations
int (*vn_open)(struct vnode *vp, int32long64_t flag, ext_t dev, caddr_t * vinfo,
               struct ucred *cred);
               
int (*vn_create)(struct vnode *dvp, struct vnode **vpp, int32long64_t flags,
                 caddr_t name, int32long64_t mode, caddr_t *vinfo,
                 struct ucred *cred);
                 
int (*vn_hold)(struct vnode *vp);

int (*vn_rele)(struct vnode *vp);

int (*vn_close)(struct vnode *vp, int32long64_t flag, caddr_t vinfo,
                struct ucred *cred);
                
int (*vn_map)(struct vnode *vp, caddr_t addr, uint32long64_t length,
              uint32long64_t offset, uint32long64_t flags, struct ucred *cred);
              
int (*vn_unmap)(struct vnode *vp, int32long64_t flag, struct ucred *cred);

vn_open
An entry point used to open a file with the file open parameters specified in the flag parameter. The vnode of the file to be opened is passed in the vp parameter.

Typically, the vnode is created during the file lookup operation or the file create operation (if open is called with the O_CREAT flag). It is up to you to decide what needs to be done during a vn_open call. A successful open should result in the reference count being incremented. In case of opens for a device file, the dev parameter has the device-specific information.

vn_create
This routine is called to create a regular file with the name name in the dvp directory. The new file's access mode is passed in the mode parameter. The flag parameter carries the open flag options for the succeeding open call. On a successful create, a new vnode entry is created in the virtual file system and the reference count is set to 1. The newly created vnode is returned in the vpp parameter.
vn_hold / vn_rele
The vn_hold routine is used to increase the reference count of the vnode. This is to ensure that the vnode is not deleted accidentally underneath the caller without the caller's knowledge. All calls to vn_hold should promptly be followed by a vn_rele after completing the job to ensure that the reference count is decremented.
vn_close
vn_close is an entry point called by the close subroutine to close the vnodevp. This routine is called only when the last reference on the vnode goes away. No further operations can be performed on the vnode once a vn_close has been called for the vnode.
vn_map
vn_map is a routine used to validate file mapping requests resulting from an mmap call or shmat call for the object referred by the vnodevp. The addr parameter specifies the address in the requesting process's address space where the object is to be mapped. The length, offset, and flags define the length of the mapping, the offset within the file from which the file is to be mapped, and a bit mask of flags that defines the type of mapping, respectively.

It is expected that the underlying file system store the gn_seg field of the gnode for the file that is mapped. The logical file system layer creates the virtual memory object if one doesn't already exist and increments the count.

vn_unmap
The vn_unmap entry is called to unmap the file that had been already mapped to the process's address space. The vnodevp specifies the target file that needs to be unmapped. The flag parameter specifies the bit mask of flags that defines the type of mapping.

The file system implementation is required to perform only file system-specific operations on both the vn_map and vn_unmap calls. The logical file system layer takes the responsibility of handling the virtual memory operations.


Listing 12. Attribute manipulation operations
int (*vn_access)(struct vnode *vp, int32long64_t mode, int32long64_t who,
                 struct ucred *cred);
                 
int (*vn_getattr)(struct vnode *vp, struct vattr *attr, struct ucred *cred);

int (*vn_setattr)(struct vnode *vp, int32long64_t cmd, int32long64_t arg1,
                  int32long64_t arg2, int32long64_t arg3, struct ucred *cred);

vn_access
vn_access is an entry point used by the logical volume file system to validate access to the vnodevp. This entry point is used to implement the access subroutine and to check the permissions, such as read, write, execute, and so on.

The who parameter specifies the user for whom the check needs to be done. The mode parameter specifies the type of check that needs to be done.

vn_getattr
vn_getattr is an entry point used to get the various file attributes (specified in the vattr structure) for the given vnode. This entry point supports the stat, fstat, and lsta subroutines.
vn_setattr
vn_setattr ia an entry point used to set the various file attributes (specified in the vattr structure) for the vnodevp. The values of the arg1, arg2, and arg3 arguments depend on the cmd parameter.

Table 2 specifies interpretation of these arguments based on the command value. This entry point supports the chmod, chownx, and utime subroutines.


Table 2. Possible command values for vn_setattr
CommandV_OWNV_UTIMEV_MODE
arg 1int fag;int flag;int mode;
arg 2int uid;timestruct_t *atime;Unused
arg 3int gid;timestruct_t *mtime;Unused

Listing 13. Data update operations
int (*vn_fclear)(struct vnode *vp, int32long64_t flags, offset_t offset, offset_t len,
                 caddr_t vinfo, struct ucred *cred cred);
                 
int (*vn_fsync)(struct vnode *vp, int32long64_t flags, int32long64_t fd,
                struct ucred *cred);
                
int (*vn_ftrunc)(struct vnode *vp, int32long64_t flags, offset_t length, caddr_t vinfo,
                 struct ucred *cred);
                 
int (*vn_rdwr)(struct vnode *vp, enum uio_rw op, int32long64_t flag,
               struct uio *uiop, caddr_t dev, struct vattr *attr,
               struct ucred *cred);
               
int (*vn_lockctl)(struct vnode *vp, offset_t offset, struct eflock *lckdata,
                  int32long64_t cmd, int (*retry_fcn)(), ulong *retry_id,
                  struct ucred *cred);   

vn_fclear
vn_fclear is an entry point used to clear a portion of the file and release back whole free blocks to the underlying file system. The offset parameter defines the offset at which the clearing should start. The len parameter specifies the number of bytes that need to be cleared. The flag with which the file was opened is passed in the flag parameter. The logical file system updates the file size to reflect the number of bytes cleared.
vn_fsync
vn_fsync is a routine called to request the file system to flush all modified data back to the backing store for the given vnodevp. This call must be completed synchronously so that the caller can be assured that all I/O has completed successfully. The flag parameter specifies the various sync options. The different sync options are available in the fcntl.h header file.
vn_ftrunc
vn_ftrunc is an entry point used to truncate the file specified by the vnodevp. The length parameter indicates the size of the file after the truncation operation is complete. If the new length is less than the previous length, the data between the two is removed. If the new length is greater than the existing length, zeros are added to extend the file size.

When truncation is complete, whole blocks are returned back to the file system and the file size is updated. The operation fails if the range to be truncated is locked.

vn_rdwr
vn_rdwr is one of the most commonly supported vnode operations. vn_rdwr performs the file IO operations for the file specified by the vnodevp. The op parameter indicates whether the request is for a read operation (UIO_READ) or a write operation (UIO_WRITE). The uio parameter carries the user IO data structure that describes a memory buffer to be used in a data transfer. The dev parameter carries the device-specific information if the IO is directed to special files. If the attr parameter is not NULL, then the attributes of the file should be passed back in the attr parameter.
vn_lockctl
vn_lockctl is an entry point used to dictate record based locking for the file targeted by vnodevp. The struct eflock (lckdata) defines the locking information to do the necessary record locking. The cmd parameter defines the type of lock operation that needs to be performed.

An optional retry function can be passed in the retry_fcn parameter that can be used to retry locks if the lock is not granted immediately. The retry_id stores the value that correlates a retry operation with a specific set of locks. The locking implementation calls the operating system-provided common_reclock() routine, which hides the complexities of locking operations from the end user.


Listing 14. Extension operations
int (*vn_ioctl)(struct vnode *vp, int32long64_t cmd, caddr_t arg, size_t flags,
                ext_t dev, struct ucred *cred);
                
int (*vn_readlink)(struct vnode *vp, struct uio *uio, struct ucred *cred);

int (*vn_select)(struct vnode *vp, int32long64_t corel_id, ushort req_event, 
                 ushort *ret_event, void (*notify)(), caddr_t vinfo,
                 struct ucred *cred);
                 
int (*vn_symlink)(struct vnode *dvp, char *link, char *target, struct ucred *cred);

int (*vn_readdir)(struct vnode *vp, struct uio *uio, struct ucred *cred);

vn_ioctl
vn_ioctl is an entry point used by the logical file system to perform miscellaneous user-specified IO control operations on special files. If the file system has support for special files, the information is passed on to the intended device driver referred by the vnodevp.

The cmd parameter identifies which IOCTL (IO Control) this routine has been called for. The arg parameter carries the necessary arguments.

vn_readlink
vn_readlink is an entry point used to read the contents of a symbolic link for the vnodevp. The logical file system takes the responsibility of locating the vnode corresponding to the link. This routine simply does the reading of the data blocks for the link.
vn_select
vn_select is an entry point invoked by the logical file system to poll the vnodevp to determine if it is immediately ready for I/O. It is used to implement the select and poll subroutines. File system implementation can support constructs, such as devices or pipes, that support the select semantics.

The fp_select kernel service provides more information about select and poll requests. The req_event specifies the events requested for polling, and notify is the callback routine that is called back on the event occurring.

vn_symlink
vn_symlink is used to create a symbolic link for an object whose absolute path is specified in the target parameter. The new link is created with the name linkname in the dvp directory.
vn_readdir
vn_readdir is an interface used to read the directory entries of the directory referred by the vnodevp. The entries are returned as struct dirent in the uio structure. The read starts at the first directory entry from the address specified in the uio_offset member of struct uio.

When the uio buffer is full, the uio_offset is updated with the starting address of the directory entry that was not pushed in the present buffer. The uiop->uio_resid field is updated with the number of bytes that have been read into the uio structure. The end of read operation is specified by an empty read into the uio structure.


Listing 15. Buffer operations
int (*vn_strategy)(struct vnode *vp, struct buf *buf, struct ucred *cred);

vn_startegy
vn_startegy is the routine responsible for reading data from the block device. This entry point is intended to provide a block-oriented interface for servers for efficiency in paging. The vnodevp provides the file information for which the block read needs to be done. The buf parameter contains the struct buf that describes the buffer.

Listing 16. Security-related operations
int (*vn_revoke)(struct vnode *vp, int32long64_t cmd, int32long64_t flags,
                 struct vattr *attr, struct ucred *cred);
                 
int (*vn_getacl)(struct vnode *vp, struct uio *uio, struct ucred *cred);

int (*vn_setacl)(struct vnode *vp, struct uio *uio, struct ucred *cred);

int (*vn_getpcl)(struct vnode *vp, struct uio *uio, struct ucred *cred);

int (*vn_setpcl)(struct vnode *vp, struct uio *uio, struct ucred *cred);

int (*vn_seek)(struct vnode *vp, offset_t *offset, struct ucred *cred);

vn_revoke
vn_revoke is an entry point used to revoke all access to an object specified by the vnodevp. The cmd parameter, which defines if the calling process has the file open, can have the following values:
  • 0—The process did not have the file open.
  • 1—The process had the file open.
  • 2—The process had the file open and the reference count in the file structure was greater than one.
vn_getacl / vn_setacl
vn_getacl / vn_setacl is an entry point used by the logical file system to retrieve the ACL for a file to implement the getacl subroutine. The vn_setacl in turn is used to set the ACL for the file. These routines provide the backbone support for the chacl, chown, chmod, and statacl subroutines.
vn_getpcl / vn_setpcl
vn_getpcl / vn_setpcl is an entry point used by the logical file system to retrieve the privilege control list (PCL) on a file to implement the getpcl subroutine. The vn_setpcl is used to set the privilege control list for a file and supports the setpcl subroutine.
vn_seek
vn_seek is an entry point used to validate the offset of the seek operation. The vnode for which the seek operation is to be validated is passed in vp and the offset is passed in offset. Typically, if the offset is greater than 0 and less than the maximum length of the file, the routine returns an EOK, else it returns EINVAL.

Listing 17. External pager callout operations
int (*pagerBackRange)(struct gnode *gnp, offset_t offset, caddr_t dest, 
                      size_t *nBytesOfRange, size_t *nBytesBacked, uint *flags);
                      
int64_t (*pagerGetFileSize)(struct gnode *gnp);

void (*pagerReadAhead)(struct gnode *gnp, vpn_t pFault, vpn_t * pFirst,
                       vpn_t *nPage, vpn_t *pTripWire, boolean_t tripWire);
                       
void (*pagerReadWriteBehind)(struct gnode *gnp, int64_t offset, int64_t length,
                             uint flags);
                             
void (*pagerEndCopy)(struct gnode *gnp, offset_t offset, size_t nBytesMoved,
                     size_t nBytesBacked, uint flags);

AIX has a provision for external pager callout routines that the virtual memory manager consults when paging in/out files from that particular file system.

pagerBackRange
A callback used to request the file system to back the in core pages with storage in the backing physical store. A call to the pagerBackRange is followed by a call to the pagerEndCopy callback.
pagerEndCopy
pagerEndCopy does the post processing after the copying.
pagerReadAhead, pagerReadWriteBehind
pagerReadWriteBehind callbacks to facilitate the virtual memory manager in consulting the file system for the read-ahead and write-behind policies.

All of these callout operations are optional. It is up to you to provide those that are needed.

Other than these, the vnodeops_t provides a number of interfaces called 421 extensions, which are provided for backward compatibility with AIX Version 4.2.1

File system helpers and mount helpers

To enable support for multiple file systems, many of the file system routines do not process the command by itself. Instead, they collect the arguments passed to the command and send them on to specific, back-end programs of the file system that do the real processing of the command. The back-end programs are known as file system helpers and mount helpers; it is imperative that a file system developer provide these programs.

The helper programs are in /sbin/helpers/<vfs_type>, where the vfs_type matches the file system type for which the command is being invoked. The program name must match the name of the command being executed.

The mount command is the front-end program for the mount helper routine provided by each file system. The back-end support programs provided for the mount and unmount commands are the mount helpers. The front-end mount program collects the various parameters passed to the mount program. It then looks into the /etc/filesystems file to determine the virtual file system type of the target file system. It calls the /sbin/helpers/<vfs_type>/mount with the collected parameters to process the command. A typical entry for a file system looks like Listing 18 in the /etc/filesystems configuration file.


Listing 18. Sample entry in /etc/filesystems
/data:
    dev             = /dev/fslv00
    vfs             = jfs2
    log             = /dev/hd8
    mount           = true
    options         = rw

The vfs attribute determines the virtual file system type for the file system (<vfs_type>). The mount attribute determines the default mount behavior for this file system. It can have the following values.

AutomaticAutomatic mounts the file system automatically when the system is started.
FalseThe file system is not mounted by default.
ReadonlyThe file system is mounted as read only.
TrueThe file system is mounted by the mount all command and unmounted by the unmount all command.
NodenameNodename is used by the mount command to determine which node contains the remote file system. If this attributed is not present, then the mount is a local mount.

The options attribute specifies any additional options to be passed to the back-end mount processor. Both the mount and unmount command have six parameters; the first four are common and the last two are specific to the specific command.

Building and configuring the file system

The file system component is built as a kernel extension in AIX. The kernel maintains a list of active file system types registered with it. For the AIX operating system to honor your file system, it must be registered with AIX. AIX provides two kernel services, gfsadd and gfsdel, for adding and deleting the file system.

Every file system should provide a configuration routine that can be called to configure the kernel extension as a file system. This routine should fill in the file system information in struct gfs and call gfsadd. The gfsadd kernel service uses the information in the struct gfs to register it in the global file system table and calls the initialization routine specified in gfs_init. The initialization routine does the rest of the file system initialization job.

To load the file system from a user perspective:

  • Have a user program or script call the sysconfig subroutine to load the kernel extension.
  • Call the sysconfig subroutine again to configure the kernel extension as a virtual file system by specifying the configuration routine.
  • The configuration routine calls the gfsadd kernel service to register the kernel extension as a recognized file system for AIX.

Once this is done, the file system becomes operational.

Conclusion

File system development is one of the most challenging aspects of kernel development. By its nature, code for the data management in a file system should be portable and platform-independent. Each operating system has its own IO framework and interfaces that bridge your file system with the kernel. It is this framework that you should thoroughly understand in order to make your file system work for the particular operating system.

You should also have a basic understanding of the kernel design and the interfaces it exports for various support operations, such as memory allocation, data copy routines, and so on. There is no single library or standard that defines these support routines. As a general rule, designing generic interfaces for such common support routines and having platform-specific files defining these operations for each platform should make life easier. A well-designed file system should be easily portable, at least across the various UNIX flavors.


Resources

Learn

Get products and technologies

  • IBM trial software: Build your next development project with software for download directly from developerWorks.

Discuss

About the author

Photo of Srikanth

Srikanth Srinivasan is a Staff Software Engineer for the IBM India Systems and Technology Labs. His main focus is on parallel file systems and the Linux kernel. He is part of the IBM General Parallel File systems (GPFS) development team, involved in developing various features for GPFS. Srikanth holds a bachelor's degree in Electronics and Communications Engineering from the University of Madras, India. You can reach him at ssrikanth@in.ibm.com.

Comments



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=223518
ArticleTitle=Developing a file system for AIX
publish-date=05292007
author1-email=ssrikanth@in.ibm.com
author1-email-cc=mmccrary@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers