IBM Support

GPFS Internal and External mount states

Question & Answer


Question

When is GPFS file system 'mounted'? What is an 'internal mount'? What is a 'file system panic'?

Answer

Seemingly simple concepts of 'mount' and 'unmount' often generate confusion among GPFS users. How come my file system is shown as mounted, but I cannot access any files? I have just unmounted my file system, but mmlsmount shows that it is 'internally mounted' - why? I am trying to unmount my file system, and the darn thing just will not unmount - how can I force it? Can I use the mount command provided by the OS, or do I have to use mmmount? I do not use NFS, and yet I see error messages about 'Stale NFS file handles' - why?

At its core, GPFS is a Unix file system, and so it follows the standard Unix conventions where it comes to mount and unmount semantics. However, GPFS is orders of magnitude more complex than a typical local Unix file system, and that makes things more complicated. For example, an ext3 file system can either be mounted or not mounted, there is nothing in between. If the OS kernel is alive, a mounted ext3 file system is accessible because all of ext3 code lives in the OS kernel.

GPFS, being a cluster file system, is not blessed with such simple semantics. There are important things to worry about that take place outside the OS kernel. For instance, in order to be able to provide any data access on a given node, a node quorum must be reached, and the current node must be an active member of the cluster. If the quorum is lost, data access is impossible no matter what the mount command output says. On the other hand, since GPFS is a distributed application, there are situations when a node that does not have the file system mounted, needs to read internal data structures from disk to allow the node to act as, for example, a file system manager, or a token manager. This creates a dichotomy: the external mount, initiated through the OS mount command is a system administrator-driven operation that cannot be easily reversed, but GPFS needs a way to manipulate the file system accessibility in response to internal triggers. This situation creates a need for two distinct states: an external mount, and an internal mount. The choice of terminology is, unfortunately, confusing, but it is pretty well established at this point, and attempting to change it now would only breed more confusion.

An external mount is the traditional Unix mount: the file system is present in mount and df output, and the files belonging to the file system can be accessed under its mount point. An internal mount is not really a mount in the traditional sense. That is, if a file system is only internally mounted, it does not show up in mount or df output, and there is no way to access any of the files from that node. It is possible for a file system to be internally mounted, without being externally mounted, if a node has been appointed a file system manager (as mmlsmgr would show) or a token manager (as mmdiag --tokenmgr would show). Such appointments are usually managed automatically, and there is rarely a need to manage them manually. Another way to think about an internal mount is to consider the file system to be 'open' by the main GPFS daemon process (mmfsd). Even though the users cannot do any IO to this file system from this node without an external mount, GPFS itself may need to do IO, for example, when replaying a log (journal) of a failed node during recovery.

So what happens if a file system is externally mounted, but the conditions conspire to make data access impossible (for example, quorum is lost due to a bad network link)? In GPFS lingo, the file system will be panicked. A file system panic is also known as internal unmount. When a panic occurs, the external mount persists, but all internal in-memory data structures are thrown out, and the file system is 'closed' by mmfsd. So what happens if you try to access a file system in that state? Any attempt to access the file system fails with the ESTALE error code. On more modern operating systems, the error code is translated into a 'Stale file handle' error message. On older operating systems, the corresponding error message string is 'Stale NFS file handle', owning to NFS pioneering the use of this error code. Note that the error message string is produced by the OS libraries, not GPFS, and there is nothing GPFS can do about an error message citing 'NFS' when there is no NFS in the picture. A file system panic can occur for a number of reasons: loss of quorum, a disk going down, an unrecoverable error encountered in the code, mmfsd crash, or lack of free memory. The only way to tell is to examine the GPFS log. The offending events may even take place on another node in the cluster. The real cause of the file system panic is known to GPFS, and is shown in the GPFS log, but the interface between GPFS and the OS kernel only allows passing back an integer error code when an error is encountered. This naturally means that when multiple conditions could result in the same error code, a precise diagnosis is not possible on the basis of the error code alone. So a complaint about 'Stale file handle' error being seen will not be met with a torrent of useful advice, even though the error may look exotic to the error reporter. You need to supply additional detail about the problem when asking for help with this type of error. Again, a panicked file system remains externally mounted, so it shows up in mount and df output, although the df reports 'Stale file handle' for each panicked file system.

What can be done about a panicked file system? That depends on the cause of the panic. If quorum is lost, and is then re-established, GPFS code automatically performs a remount operation. During a remount, an internal mount is done, and the corresponding data structures are reconnected to the existing external mount point. The same happens if the GPFS daemon (mmfsd) crashes and then automatically restarts. This is generally useful, but in itself could lead to confusion: a file system would be OK for a long time, and then suddenly it is not OK, everything fails with 'Stale file handle', and then it is OK again - what happened? Again, you have to look in the GPFS log (or syslog, on GPFS V4.1 and newer) to ascertain the cause of the transient loss of access. So if the file system is automatically remounted after a temporary disruption, and remains externally open throughout, and if the application does not do any IO until the file system is available again, everything is good, right? Unfortunately, no. During panic processing, all internal state is summarily thrown out. This includes all open file information, cache content, local locks and tokens. If a file remained open through a panic, it is no longer safe to access once a remount takes place because GPFS cannot know what happened to that file while the file system was panicked. The file may not even be there anymore, because it got deleted on a different node. So any attempt to do IO using a cached open file descriptor after a panic fails with ESTALE; the file needs to be closed and re-opened to be safely accessed again. While many users would prefer semantics akin to an NFS hard mount, that would be very hard to do - unlike NFSv3, GPFS is very, very stateful.

A mounted file system may eventually need to be unmounted. The unmount scenario is not as simple as it may seem. This is where the boundary between GPFS proper and the OS kernel gets very fuzzy, and things get even more complicated, especially on Linux. GPFS can manipulate its internal state any way it wants, but the same cannot be said about the data structures managed by the OS kernel. Where it comes to the unmount processing, GPFS code does have a role, but it really takes a back seat to the VFS layer. On Linux, the VFS layer has very particular semantics when it comes to handling an unmount request. If a given file system appears to be 'in use' (according to a kernel-maintained 'use count') unmount is failed with the EBUSY ('Device is busy') error before the actual file system code is even notified. Several different things can cause a file system to be 'in use': an open file, a process using a directory in the file system as the current working directory, an NFS export, among other things. Standard Unix reporting tools like lsof and fuser can report some of those conditions, but not necessarily all. Walking /proc/<pid>/fd subdirectories can help identify the users of a given file system, but that too does not necessarily produce an exhaustive list. All in all, figuring out why a given file system will not unmount can be pretty hard on Linux. The thing is, this is a Linux architecture feature, not something unique to GPFS. The corollary to that is: there is no way to 'fix' GPFS to allow a true forced unmount on Linux. On AIX, the forced unmount does work, although that does not have the exact same effect as a regular unmount (see below).

So why would you want to unmount a file system in the first place? There are many possibilities, but two most common are: (1) upgrading GPFS code, and (2) running offline mmfsck. The former is a simpler scenario: the file system only needs to be unmounted externally, and only on a single node. You need to perform a clean external unmount in order to unload the GPFS kernel modules, and thus avoid the need to reboot the node. Again, those are not GPFS-specific semantics, that is how kernel module management works on Unix (on Windows, kernel modules cannot be unloaded, period, so a reboot is needed to upgrade GPFS). Note that a forced unmount on AIX does not help the matter: even though the mount point is not visible after a forced unmount, the kernel does not allow the GPFS kernel module to be unloaded if it is still referenced somehow. So you either need to track down all file system users and take care of them (kill off some processes, cd out of a given directory from a shell session), or resort to rebooting the node. So what should you do if the file system is successfully unmounted externally, but is shown as being internally mounted by mmlsmount? Nothing. Just shut down GPFS and whatever node role was causing the need for the internal mount fails over to another node. This does not affect your ability to unload the kernel modules and upgrade GPFS without a reboot. The offline mmfsck case is more complicated, because the subject file system must be unmounted on all nodes. In the mmfsck case, internal mounts do count. However, it is important to understand where the internal mounts came from in the first place. If a node is acting as a token server or a file system manager, the need for those services is driven by external mounts. If the file system is externally unmounted on all nodes, the internal mounts go away too, although this takes some time if there is recovery in progress. Still, there is no need to fight internal mounts themselves, you just need to take care of all external mounts.

As with most rules, there are exceptions. For example, SoNAS and Storwize V7000 Unified offerings, which use GPFS internally, represent an unmount challenge. In order to externally unmount a file system on those systems, you need to shut down various file export services (for example, NFS, Samba). With file exporting daemons gone, external unmounts should succeed. However, you may still have a hard time starting mmfsck, because the file system is internally mounted on one of the nodes. Those internal mounts are often caused by a constant trickle of GPFS management commands (mmlsfs, mmlsfileset, mmlssnapshot, to name a few) issued frequently by the monitoring framework running on SoNAS/V7000U platforms. So you need to shut down the management process to quiet things down to allow mmfsck to start.

GPFS comes with a number of tools to make life simpler for system administrators. For mount management, mmmount and mmumount commands are provided. Those commands offer a level of cluster awareness, and try to hide some of the sharp edges. For instance, you can mount a file system on all nodes in the cluster using mmmount -a. If a file system is externally mounted, but is panicked, you can attempt a manual remount operation using mmmount. This can be useful when the conditions that have led to the panic have changed, but GPFS is not aware of the change. For example, if GPFS accesses disks through a SAN, and the SAN goes down, GPFS loses access to the disks, which eventually leads to a file system panic. GPFS does not try to rediscover automatically whether the disks are back, because that is potentially very expensive to do (when there is a disruption in a storage networking fabric, IO timeouts can run into many minutes). When the SAN is known to be healthy again, you can simply issue mmmount, and if an external mount is already there, GPFS attempts an internal mount, and if that succeeds, a re-mount operation is performed. A word of caution: mmmount and mmumount consciously try to insulate users from a flood of annoying error messages (for example, 'File system is already mounted'), and do not always show all error messages. When troubleshooting, it is a good idea to use the OS mount and umount commands, the same way you would with a different file system. In general, you do not have to use mmmount and mmumount; those are just provided for added convenience. It is perfectly OK to use OS commands for GPFS mount management.

If the OS mount command is used on a GPFS file system, is there flexibility in what arguments the command can take? For example, can a GPFS file system be mounted under a different mount point on different nodes? Yes. Again, GPFS is a Unix file system, and thus it supports the standard Unix mount semantics. By default, mount command uses the mount options and the mount point specified in the global configuration file (/etc/fstab on Linux, /etc/filesystems on AIX). However, if desired, you can manually craft a mount command with all parameters specified on the command line. It is critical to pass the “dev=devicename” option to the mount command, but otherwise mount options and mount point can be varied. Since providing a global namespace is one of the main benefits of GPFS, it does not make very much sense to mount a file system under a different mount point on different nodes, in most cases, but it is a possibility. It should be noted that a custom mount configuration of this kind is external to GPFS, and will not be restored automatically by GPFS if a node reboots. GPFS manages its entries in /etc/fstab and /etc/filesystems, so a manual edit to GPFS entries in those files will be overwritten when local GPFS configuration data is refreshed (for example, when GPFS restarts). If a custom mount configuration is desired, it has to be established through other means.

[{"Product":{"code":"SSFKCN","label":"General Parallel File System"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":"--","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF016","label":"Linux"},{"code":"","label":"System x"},{"code":"PF033","label":"Windows"}],"Version":"3.5.0;3.4.0;3.3.0;3.2.1;3.1.0;4.1.0","Edition":"","Line of Business":{"code":"","label":""}}]

Document Information

Modified date:
25 June 2021

UID

isg3T1021965