Topic
10 replies Latest Post - ‏2011-08-02T14:51:04Z by MatthewBourne
MatthewBourne
MatthewBourne
25 Posts
ACCEPTED ANSWER

Pinned topic AIX61 TL03 SP03 Kernel Memory Leak?

‏2011-03-01T18:02:06Z |
Folks

Been struggling to help my customer with this issue for some months now - unfortunately nothing tremendously useful coming out of the PMR route just yet. Wondering if anyone has seen similar symptoms?

Please help!

Thanks

Scenario
Uptime is less than 60 days, memory all but exhausted, including paging space. Left alone the LPAR will probably crash, logging "out of resource" type messages in error log.


>oslevel -s 6100-03-03-0943



> lsattr -El mem0 ent_mem_cap         I/O memory entitlement in Kbytes           False goodsize       4096 Amount of usable physical memory in Mbytes False size           4096 Total amount of physical memory in Mbytes  False var_mem_weight      Variable memory capacity weight            False



> lsps -a Page Space      Physical Volume   Volume Group    Size %Used Active Auto  Type Chksum hd6             hdisk0            rootvg        4160MB    56   yes   yes    lv     0


NMON tells me that the kernel is the biggest consumer of memory pages:


FileSystemCache(numperm) 11.1% Process                  16.3% System                   71.9% Free                      0.8%


SVMON says pgsp is over 50% utilised...


> svmon -G -O unit=MB Unit: MB ------------------------------------------------------------------------------- size       inuse        free         pin     virtual  available memory      4096.00     4060.28        35.7      985.27     5497.27     367.96 pg space    4160.00     2314.65 work        pers        clnt       other pin          846.77           0           0      138.50 in use      3441.47           0      618.81


... but I've got nearly 2GB occupancy in pgsp (7x256MB) due to kernel segments that show a minimal if not zero count of pages in use (~110MB). Over time, we'll observe the number of segments that look like this increase - until eventually the LPAR becomes unresponsive and ultimately crashes.


> svmon -S -t 10 -O unit=MB,filtercat=kernel,sortseg=pgsp Unit: MB   Vsid      Esid Type Description              PSize  Inuse   Pin Pgsp Virtual 46008         - work kernel heap                  m      0     0 256.00  256.00 5600a         - work kernel heap                  m      0     0 256.00  256.00 3e007         - work kernel heap                  m      0     0 256.00  256.00 4e009         - work kernel heap                  m      0     0 256.00  256.00 36006         - work kernel heap                  m      0     0 256.00  256.00 2e005         - work kernel heap                  m   15.0     0 241.00  256.00 5e00b         - work kernel heap                  m 100.88  0.06 155.12  256.00 6a00         - work kernel heap                  m 106.94  8.88 106.25  115.25 4000         - work page table area s   9.45  0.17 23.7    23.9 28005  9ffffffd work shared library              sm   0.28     0 7.66    7.66


VMO settings are as per recommendation, I believe:


>vmo -F -L NAME                      CUR    DEF    BOOT   MIN    MAX    UNIT           TYPE DEPENDENCIES ams_loan_policy           n/a    1      1      0      2      numeric           D force_relalias_lite       0      0      0      0      1      

boolean           D kernel_heap_psize         64K    0      0      0      16M    bytes             B   lgpg_regions              0      0      0      0      8E-1                     D lgpg_size   lgpg_size                 0      0      0      0      16M    bytes             D lgpg_regions   low_ps_handling           1      1      1      1      2                        D   maxfree                   1088   1088   1088   16     838860 4KB pages         D minfree memory_frames   maxperm                   899409        899409                                 S   maxpin                    845956        845956                                 S   maxpin%                   80     80     80     1      100    % memory          D pinnable_frames memory_frames   memory_frames             1M            1M                   4KB pages         S   memplace_data             2      2      2      1      2                        D memory_affinity   memplace_mapped_file      2      2      2      1      2                        D memory_affinity   memplace_shm_anonymous    2      2      2      1      2                        D memory_affinity   memplace_shm_named        2      2      2      1      2                        D memory_affinity   memplace_stack            2      2      2      1      2                        D memory_affinity   memplace_text             2 2      2      1      2                        D memory_affinity   memplace_unmapped_file    2      2      2      1      2                        D memory_affinity   minfree                   960    960    960    8      838860 4KB pages         D maxfree memory_frames   minperm                   29980         29980                                  S   minperm%                  3      3      3      1      100    % memory          D maxperm% maxclient%   nokilluid                 0      0      0      0      4G-1   uid               D   npskill                   8320   8320   8320   1      1M-1   4KB pages         D   npswarn                   33280  33280  33280  1      1M-1   4KB pages         D   numpsblks                 1040K         1040K                4KB blocks        S   pinnable_frames           796362        796362               4KB pages         S   relalias_percentage       0      0      0      0      32K-1                    D   scrub                     0      0      0      0      1      

boolean           D   v_pinshm                  0      0      0      0      1      

boolean           D   vmm_default_pspa          0      0      0      -1     100    numeric           D   wlm_memlimit_nonpg        1      1      1      0      1      

boolean           D   ##Restricted tunables -------------------------------------------------------------------------------- cpu_scale_memp            8      8      8      4      64                       B   data_stagger_interval     161    161    161    0      4K-1   4KB pages         D lgpg_regions   defps                     1      1      1      0      1      

boolean           D   framesets                 2      2      2      1      10                       B   htabscale                 n/a    -1     -1     -4 0                        B   kernel_psize              64K    0      0      0      16M    bytes             B   large_page_heap_size      0      0      0      0      8E-1   bytes             B lgpg_regions   lru_file_repage           0      0      0      0      1      

boolean           D   lru_poll_interval         10     10     10     0      60000  milliseconds      D   lrubucket                 128K   128K   128K   64K    1M     4KB pages         D   maxclient%                90     90     90     1      100    % memory          D maxperm% minperm%   maxperm%                  90     90     90     1      100    % memory          D minperm% maxclient%   mbuf_heap_psize           64K    0      0      0      16M    bytes             B   memory_affinity           1      1      1      0      1      

boolean           B   npsrpgmax                 65K    65K    65K    1      1M-1   4KB pages         D npsrpgmin   npsrpgmin                 49920  49920  49920  1      1M-1   4KB pages         D npsrpgmax   npsscrubmax               65K    65K    65K    1      1M-1   4KB pages         D npsscrubmin   npsscrubmin               49920  49920  49920  1      1M-1   4KB pages         D npsscrubmax   num_spec_dataseg          0      0      0      0      8E-1                     B   page_steal_method         1      1      1      0      1      

boolean           B   psm_timeout_interval      20000  20000  20000  0      60000  milliseconds      D   rpgclean                  0      0      0      0      1      

boolean           D   rpgcontrol                2      2      2      0      3                        D   scrubclean                0      0      0      0      1      

boolean           D   soft_min_lgpgs_vmpool     0 0      0      0      90     %                 D lgpg_regions   spec_dataseg_int          512    512    512    0      8E-1                     B   strict_maxclient          1      1      1      0      1      

boolean           D strict_maxperm   strict_maxperm            0      0      0      0      1      

boolean           D strict_maxclient   vm_modlist_threshold      -1     -1     -1     -2     2G-1                     D   vmm_fork_policy           1      1      1      0      1      

boolean           D   vmm_mpsize_support        2      2      2      0      2      numeric           B   vmm_vmap_policy           0      0      0      0      1      

boolean           D

and I have the most up to date fixes I can find for this TL/SP combination:


>emgr -l   ID  STATE LABEL      INSTALL TIME      UPDATED BY ABSTRACT === ===== ========== ================= ========== ====================================== 1    S    IZ60190    12/02/10 14:40:41            Ifix 

for an apar IZ60190 2    S    z8703761F3 12/02/10 14:44:35            Ifix 

for apar IZ87037 at 61F SP03.
Updated on 2011-08-02T14:51:04Z at 2011-08-02T14:51:04Z by MatthewBourne
  • MatthewBourne
    MatthewBourne
    25 Posts
    ACCEPTED ANSWER

    Re: AIX61 TL03 SP03 Kernel Memory Leak?

    ‏2011-03-02T18:10:34Z  in response to MatthewBourne
    Any thoughts anybody?
  • MatthewBourne
    MatthewBourne
    25 Posts
    ACCEPTED ANSWER

    Re: AIX61 TL03 SP03 Kernel Memory Leak?

    ‏2011-03-08T17:30:22Z  in response to MatthewBourne
    Howdy folks - just wondering if anyone has seen anything similar?
    • dmj12031
      dmj12031
      2 Posts
      ACCEPTED ANSWER

      Re: AIX61 TL03 SP03 Kernel Memory Leak?

      ‏2011-03-08T19:54:43Z  in response to MatthewBourne
      Are you running the audit subsystem?

      IZ87818: MEMORY INCREASING FOR EVENT MANAGEMENT IN AUDITING
      • MatthewBourne
        MatthewBourne
        25 Posts
        ACCEPTED ANSWER

        Re: AIX61 TL03 SP03 Kernel Memory Leak?

        ‏2011-03-09T09:53:10Z  in response to dmj12031
        Funnily enough, we've had a hint that auditing might be something to look at - I'm expecting to disable it on a few LPARs today and see what happens - I'll keep the thread up to date. Thanks for your reply.
        • MatthewBourne
          MatthewBourne
          25 Posts
          ACCEPTED ANSWER

          Re: AIX61 TL03 SP03 Kernel Memory Leak?

          ‏2011-03-17T11:32:29Z  in response to MatthewBourne
          Update:

          commented out "audit" from inittab, and rebooted. I found the following snippet handy for tracking VM/PGSP usage so I thought I'd share it:

          
          svmon -G -O unit=MB,timestamp=on -i 24 600 | awk 
          'BEGIN{ "hostname |cut -c1-8" |getline y }{ 
          
          if ( $0 ~ /^U/ ) 
          { x = $4 
          } 
          
          else 
          
          if (( $0 ~ /size/ ) || ( $0 ~ /^m/ ) || ( $0 ~ /^pg/ )) 
          { sub(/g s/,
          "g_s",$0); print y 
          " " x 
          " " $0 
          } 
          }
          '
          


          If it turns up anything interesting I'll let you know ...
          • MatthewBourne
            MatthewBourne
            25 Posts
            ACCEPTED ANSWER

            Re: AIX61 TL03 SP03 Kernel Memory Leak?

            ‏2011-03-25T15:55:02Z  in response to MatthewBourne
            Update:
            audit disabled on a number of LPARs but kernel memory footprint still seems to be growing. Nothing noticeable in paging space yet.

            We've also been asked about the "filepath" kernel extension and why we're running it. Looks like it's part of a TSM client installation to support Journal-based backups. Anyone else running TSM client 6.1.0.0 with similar issues?

            M.
            • flodstrom
              flodstrom
              113 Posts
              ACCEPTED ANSWER

              Re: AIX61 TL03 SP03 Kernel Memory Leak?

              ‏2011-03-28T15:15:56Z  in response to MatthewBourne
              Our systems are slightly older than yours (AIX 6.1 TL3 SP1) and we have a slightly newer TSM client (v6.2.2). I know that the TSM client can take up a surprising amount of memory, around 1.2G on each host. We don't really notice it since they have 32G RAM and more, but I can imagine it being a burdon for a smaller host with only 4G RAM.

              I don't think we are using that filepath kernel extension. However, are there any quick ways of checking that?

              Overall we don't have any problems with growing kernel sizes on our hosts!
  • niella
    niella
    37 Posts
    ACCEPTED ANSWER

    Re: AIX61 TL03 SP03 Kernel Memory Leak?

    ‏2011-03-31T07:22:17Z  in response to MatthewBourne
    Hi there,

    I would strongly recommend that you (always) apply the latest SP when faced with such a problem, in this case AIX 6100-03-09. There are many references to leaks, including a pinned memory leak (IZ94045), and a HIPER relating to large pages.

    There is not much risk in applying the latest SP, these contains fixes and not new functionality.

    Hope this helps.

    Niel
    • MatthewBourne
      MatthewBourne
      25 Posts
      ACCEPTED ANSWER

      Re: AIX61 TL03 SP03 Kernel Memory Leak?

      ‏2011-04-27T10:18:18Z  in response to niella
      Thanks Niel

      Unfortunately, installation of service packs & maintenance levels is controlled very tightly here. If it were a small number of servers, I'd be suggesting the same as you - but I don't see how the customer could maintain a consistent estate and follow such a reactive approach.

      For interest, there are 2 updates:

      (1) the hosts on which the audit subsystem has been disabled are stable. No undue memory pressure or exceptionally high kernel memory usage

      (2) hosts on which the full kernel debug option has been enabled are also stable - despite still running the audit subsystem

      M.
  • MatthewBourne
    MatthewBourne
    25 Posts
    ACCEPTED ANSWER

    Re: AIX61 TL03 SP03 Kernel Memory Leak?

    ‏2011-08-02T14:51:04Z  in response to MatthewBourne
    Hi

    For anyone interested, the cause of the issue has been located and resolved:

    https://www-304.ibm.com/support/docview.wss?uid=isg1IZ91406

    Thanks to you all for your feedback.

    M.