A hidden and dangerous side effect of tuning j2_inodeCacheSize and the memory DR operation

In this article, we reveal a hidden and dangerous side effect of tuning j2_inodeCacheSize and the memory dynamic reconfiguration (DR) operation. This harmful side effect might cause serious consequences including inode cache exhaustion, file system corruption, and system failure. We demonstrate and explain how it occurs and how to prevent it. We also introduce some methods to detect and fix such problems.

Share:

Jian Jun Wu (bjwjianj@cn.ibm.com), Advisory Software Engineer, IBM China

Jian Jun WuWu Jian Jun is a software engineer and has been working for IBM for the past five years, focusing on IBM AIX development support. He holds a PhD degree in Computer Science from Zhejiang University in China.



24 June 2013

Also available in Chinese

Introduction

The appropriate size of an enhanced journaled file system (JFS2) inode cache is critical to get high performance and stability for the IBM® AIX® system. The j2_inodeCacheSize tunable is frequently tuned to control the maximum memory usage for the inode cache. The inode cache size can also be changed by the memory dynamic reconfiguration (DR) operation. In AIX 6.1 (later than 6100-04) and AIX 7.1 there is a hidden side effect that the maximum pile size of an inode cache class can only be decreased after tuning j2_inodeCacheSize or dynamic logical partition (DLPAR) memory operations. In this article, we demonstrate how it might cause inode cache exhaustion and we introduce some methods to handle such problems.


JFS2 inode cache

The inode is the fundamental structure for JFS2. Every inode has a 512 bytes on-disk structure. When inodes are being worked within memory, JFS2 will track more than just the on-disk fields. The in-core inode, including the on-disk piece and the working piece, is currently around 1 KB. AIX kernel caches all of these for the sake of performance.

To prevent multiple processor contention, the inode cache is divided into several cache classes. The AIX kernel creates two cache classes per processor and plus one, that is, total [n(processor) * 2 + 1] cache classes in system initialization. The iCacheClass and iCache structures are defined in /usr/include/j2/j2_inode.h:

typedef struct iCacheClass {
        MUTEXLOCK_T cc_lock;
        int32 cc_nInode;                    /* # of inode in cacheList  */
        CDLL_HEADER(inode) cc_cacheList; /* cacheList header */
        struct pile *cc_pile;             /* inode pile */
        boolean_t pileFull;               /* pile is full */
} iCacheClass_t;

struct iCache {
        int32 nInode;                 /* # of in-memory inode */
        int16 nCacheClass;            /* # of cacheClass */
        struct iCacheClass *cacheTable;
        int32 nInodePerCacheClass;    /* # of inode per cacheClass */
        int32 nHashClass;             /* # of hashClass - 1 */
        int32 nNewHashClass;          /* # of hashClass - 1 */
        int32 nInodePerHashClass;     /* # of inode per hashClass */
        struct iHashClass **hashTable;
        int32 nPagesPerCacheClass; /* # of pile pages per cacheClass */
        int32 nMaxInode;            /* nInode at initialization time */
};

Each cache class is given a pile for inode allocations:

struct pile {
        eye_catch_t pile_eyec;         /* 8: pile eye-catcher */
        uint32_t flags;             /* 4: guarded by pile_lock */
        uint16_t obj_size;          /* 2: opaque object size */
        uint16_t align;             /* 2: object align (offset mask) */
        uint16_t slab_size;         /* 2: alloc slab size in pages */
        ......
        uint64_t max_total_pages;   /* 8: max total pages, ideally */
        uint64_t min_total_pages;
        uint64_t cur_total_pages;   /* 8: real world value */
        ......      
};

A pile can be configured with maximum number of pages. This max_total_pages field determines how many inodes can be allocated before a pile is full and starts recycling inodes from the cache list. The pile can be forced to shrink and can be allowed to expand. This happens during memory DR and j2_inodeCacheSize tuning.


Tuning j2_inodeCacheSize

The inode cache size can be tuned by changing the j2_inodeCacheSize tunable by using the ioo command. It defaults to 400 in AIX 6.1 and defaults to 200 in AIX 7.1. The value does not explicitly indicate the amount that will be used, but is instead a scaling factor. It is used in combination with the size of the main memory to determine the maximum memory usage for the inode cache. The current formula is:

(inode cache memory) = (system memory)*(j2_inodeCacheSize)/4000

We can run the following command to display the current value of j2_inodeCacheSize:

#ioo -a |grep j2_inodeCacheSize
j2_inodeCacheSize = 400

We can get details about the inode cache by using the kdb command:

(0)> i2 -c
iCache:
  nInode:         0xB3306 (733958)
  nMaxInode:      0xB3306 (733958)
  nCacheClass:    17
  nHashClass:     0xFFFF (65535)
  nNewHashClass:  0xFFFF (65535)
  cacheTable:     0xF10001003B4FC000
  hashTable:      0xF10001003B54B000

Cache table:
   CLASS      LOCK    INODES    CACHELIST.HEAD              PILE  FULL
       0         0       260  F10001003FD92080  F10001003B502300  0
       1         0       273  F10001003D4A2880  F10001003B502600  0
       ……
      16         0       260  F10001003FF17880  F10001003B503400  0

(0)> dw iCache 12
iCache+000000: 000B3306 00110000 F1000100 3B4FC000  ..3.........;O..
iCache+000010: 0000A8A6 0000FFFF 0000FFFF 00000010  ................
iCache+000020: F1000100 3B54B000 000029D5 000B3306  ....;T....)...3.

The output shows that nInodePerCacheClass is 0xA8A6 and nPagesPerCacheClass is 0x29D5.

We can examine the pile of each cache class through the kdb command:

(0)> pile F10001003B502300
name........iCache
prev........0xF100010034832800 next........0xF10001003B502600
eyec........0x4C465361 objectsize..0x0400     align.......0x007F
slabsize....0x0010     intpri......0x000B
flags.......0x00000026 SLAB_PINNED ZEROED PROTECTED
pa_slabs....0x0000
paq_next....0x0000000000000000 paq_prev....0x0000000000000000
pa_flags....0x00000000
maxtotalpg..0x00000000000029D5                mintotalpg..0x0000000000000000
curtotalpg..0x0000000000000060

The output shows that maxtotalpg is 0x29D5, which is equal to nPagesPerCacheClass.

Now, let us decrease the current value of j2_inodeCacheSize using the ioo command:

#ioo -o j2_inodeCacheSize=300
Setting j2_inodeCacheSize to 300

(0)> i2 -c
iCache:
  nInode:         0x86609 (550409)
  nMaxInode:      0xB3306 (733958)
  nCacheClass:    17
  nHashClass:     0xFFFF (65535)
  nNewHashClass:  0xFFFF (65535)
  cacheTable:     0xF10001003B4FC000
  hashTable:      0xF10001003B54B000

Cache table:
   CLASS      LOCK    INODES    CACHELIST.HEAD              PILE  FULL
       0         0       271  F10001003FF2C480  F10001003B502300  0
       1         0       282  F10001003D4A5880  F10001003B502600  0
       ……
      16         0       272  F10001003FF1C480  F10001003B503400  0

The output shows that nInode is decreased to 0x86609.

(0)> dw iCache 12
iCache+000000: 00086609 00110000 F1000100 3B4FC000  ..f.........;O..
iCache+000010: 00007E79 0000FFFF 0000FFFF 00000010  ..~y............
iCache+000020: F1000100 3B54B000 00001F5F 000B3306  ....;T....._..3.

The output shows that nPagesPerCacheClassis decreased to0x1F5F.

(0)> pile F10001003B502300
name........iCache
……
maxtotalpg..0x0000000000001F5F                mintotalpg..0x0000000000000000
curtotalpg..0x0000000000000060

The output shows that maxtotalpg is also decreased to 0x1F5F, which is equal to nPagesPerCacheClass.

Now let us increase the current value of j2_inodeCacheSize using the ioo command:

#ioo -o j2_inodeCacheSize=500Setting j2_inodeCacheSize to 500

(0)> i2 -c
iCache:
  nInode:         0xE0003 (917507)
  nMaxInode:      0xE0003 (917507)
  nCacheClass:    17
  nHashClass:     0xFFFF (65535)
  nNewHashClass:  0xFFFF (65535)
  cacheTable:     0xF10001003B4FC000
  hashTable:      0xF10001003B54B000

Cache table:
   CLASS      LOCK    INODES    CACHELIST.HEAD              PILE  FULL
       0         0      1628  F10001004339D080  F10001003B502300  0
       1         0      1640  F100010042B78480  F10001003B502600  0
        ……
      16         0      1629  F10001004338D080  F10001003B503400  0

The output shows that nInode is increased to 0xE0003.

(0)> dw iCache 12
iCache+000000: 000E0003 00110000 F1000100 3B4FC000  ............;O..
iCache+000010: 0000D2D3 0000FFFF 0000FFFF 00000010  ................
iCache+000020: F1000100 3B54B000 0000344B 000E0003  ....;T....4K....

The output shows that nPagesPerCacheClass is increased to 0x344B.

(0)> pile F10001003B502800
name........iCache
……
maxtotalpg..0x0000000000001F5F                mintotalpg..0x0000000000000000
curtotalpg..0x00000000000001B0

The output shows that maxtotalpg is much less than nPagesPerCacheClass.

In fact, it still keeps the original value as 0x1F5F.


Memory DR operation

After memory DR operation, we can check the maxtotalpg value using the same method. Then we will find that maxtotalpg is decreased after we remove some memory from the LPAR, but it is never increased after we add some memory to the LPAR.


Trace log and report

A trace can help us to understand this problem further.

Trace for decreasing inodeCacheSize

Click to see code listing

#ioo -a | grep j2_inodeCacheSize
             j2_inodeCacheSize = 500

#trace -anl -C all -T100M -L200M -K vmm -o trace.raw;ioo -o j2_inodeCacheSize=300; trcstop
Setting j2_inodeCacheSize to 300

#trcrpt -C all -o trc.out trace.raw4DCioo0222842221758033close0.1229326312pile_config_max: pile= F10001003B502300, pflags=0026, origmax=344B, newmax=1F5F: rc=0000

From the trace log, we can find that the ioo command called the pile_config_max() function to decrease the maxtotalpg value.

Trace for increasing inodeCacheSize

#ioo -a | grep j2_inodeCacheSize
             j2_inodeCacheSize = 300

#trace -anl -C all -T100M -L200M -K vmm -o trace.raw;
 ioo -o j2_inodeCacheSize=400; trcstop
Setting j2_inodeCacheSize to 400

#trcrpt -C all -o trc.out trace.raw#grep pile_config_max trc.out

From the trace log, we find that pile_config_max()function is not called when we increase j2_inodeCacheSize, and that is why maxtotalpg was keep untouched.

We can get a similar trace log for the memory DR operation: pile_config_max()function is called when we remove memory, but it is not called when we add memory.


Inode cache exhaustion

The fact that maxtotalpg can only be decreased implies that the inode cache can be exhausted after you occasionally decreased j2_inodeCacheSize or did a DLPAR memory removal, even followed by a j2_inodeCacheSize increasing or DLPAR memory addition.

The following test demonstrates the inode cache exhaustion.

# ioo -o j2_inodeCacheSize=50
Setting j2_inodeCacheSize to 50
# ioo -o j2_inodeCacheSize=400
Setting j2_inodeCacheSize to 400
#kdb
(0)> i2 -c
iCache:
  nInode:         0xB3306 (733958)
  nMaxInode:      0xB3306 (733958)
  nCacheClass:    17
  nHashClass:     0xFFFF (65535)
  nNewHashClass:  0xFFFF (65535)
  cacheTable:     0xF10001003A70F000
  hashTable:      0xF10001003B57D000

Cache table:
   CLASS      LOCK    INODES    CACHELIST.HEAD              PILE  FULL
       0         0       282  F10001003FAFA080  F10001003B50D300  0
       1         0       280  F10001003D4B4480  F10001003B50D600  0
    ……
      16         0       281  F10001003FAEA080  F10001003B510500  0

(0)> dw iCache 12
iCache+000000: 000B3306 00110000 F1000100 3A70F000  ..3.........:p..
iCache+000010: 0000A8A6 0000FFFF 0000FFFF 00000010  ................
iCache+000020: F1000100 3B57D000 000029D5 000B3306  ....;W....)...3.

(0)> pile F10001003B50D300
name........iCache
……
maxtotalpg..0x0000000000000539                mintotalpg..0x0000000000000000
curtotalpg..0x0000000000000060

Then, we write a program to open a lot of files (see openfile.c)

#./openfile 1000 100 /home/testdir

The following error message gets displayed:

open 90 failed Resource temporarily unavailable

From kdb, we can find that all piles are full and all cache lists are empty, that is, the inode cache has been exhausted:

(0)> i2 -c
iCache:
  nInode:         0xB3306 (733958)
  nMaxInode:      0xB3306 (733958)
  nCacheClass:    17
  nHashClass:     0xFFFF (65535)
  nNewHashClass:  0xFFFF (65535)
  cacheTable:     0xF10001003A70F000
  hashTable:      0xF10001003B57D000

Cache table:
   CLASS      LOCK    INODES    CACHELIST.HEAD         PILEFULL
       0         0         0  F10001003A70F010  F10001003B50D300  1
       1         0         0  F10001003A70F040  F10001003B50D600  1
       ……
      16         0         0  F10001003A70F310  F10001003B510500  1

The value of curtotalpg is almost same as the value of maxtotalpg:

(0)> pile F10001003B510300 | grep totalpg
maxtotalpg..0x0000000000000539                mintotalpg..0x0000000000000000
curtotalpg..0x0000000000000530

Because the inode cache has been exhausted, we cannot open any new file and cannot start any new process. Also, no user can log in to the system. Furthermore, it may cause more serious consequences, such as file system corruption and even system failure depending on system configuration.


Detect and fix incorrect maxtotalpg

There are several methods to detect and fix the incorrect maxtotalpg value for inode cache piles.

Using kdb

Firstly we can use kdb and run the pile command to check whether maxtotalpg is equal to nPagesPerCacheClass, as we have introduced previously. If there is any inconsistency between the values of maxtotalpg and nPagesPerCacheClass, we can restart the system, and thus, maxtotalpg will recover to the default value. We may also manually modify maxtotalpg to its correct value in kdb.

There is also a more convenient method. We can first increase j2_inodeCacheSize to a bigger value and then decrease it to the necessary value. And similarly, we can first add more memory and then remove memory to the requried amount. As a result, we can get the correct maxtotalpg value.

Writing a program to fix maxtotalpg

We can also write a program to automatically check and fix the maxtotalpg value (see pilefix.c).

First, we get the memory address of iCache using kdb:

(0)> ns
Symbolic name translation off
(0)> dd iCache
0285C158

Then, we read and compare the values of maxtotalpg and nPagesPerCacheClass through the /dev/kmem interface. If the value of maxtotalpg is different from that of nPagesPerCacheClass, we write back the correct value:

open("/dev/kmem", O_RDWR, 0);
kread((unsigned long long )icachep, (char *)&icache, sizeof(icache));
ccp = (iCacheClass_t *)icache.cacheTable;        
for (i=0; i<icache.nCacheClass; i++, ccp++){
    kread((unsigned long long )ccp, (char *)&cc, sizeof(iCacheClass_t));
    kread((unsigned long long )pmaxtp, (char *)&max_total_pages, 8);
    if(icache.nPagesPerCacheClass != max_total_pages){
         max_total_pages = icache.nPagesPerCacheClass;
         kwrite((unsigned long long)pmaxtp, (char *)&max_total_pages, 8);
    }
}

Conclusion

In AIX 6.1 (later than 6100-04) and AIX 7.1, the maximum pile size of the inode cache class can only be decreased after tuning j2_inodeCacheSize or DLPAR memory operations. IBM recently has provided an authorized program analyst report (APAR) IV41462 that can avoid this problem. If your AIX system does not apply IV41462, the approaches introduced in this article can be used to avoid inode cache exhaustion.


Resources

  • IV41462: Provides details about IV41462.
  • mem and kmem: Provides more information relating to mem and kmem special files.
  • The ioo command: Provides more information relating to the ioo command.

Downloads

DescriptionNameSize
Program to open a lot of files openfile.c3 KB
Program to automatically check and fix the maxtotapilefix.c7 KB

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into AIX and Unix on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=935029
ArticleTitle=A hidden and dangerous side effect of tuning j2_inodeCacheSize and the memory DR operation
publish-date=06242013