Troubleshooting
Problem
RHEL 7.4 and 7.5 on Power Little Endian System hangs because it run out of memory.
You might see high kswapdX activity and/or oom-killer related system messages.
This has been observed on ESS nodes but affects other Power LE systems as well.
If you find a node in a hang condition, please trigger a kernel dump to get a vmcore for later analysis.
Symptom
System hang. Unable to ssh into the node. No console login possible.
The node will most likely still respond to ping over network.
Free memory is significantly higher than vm.min_free_kbytes.
dmesg might show oom-killer:
[466376.234149] JIT Sampler invoked oom-killer: gfp_mask=0x80d0, order=0, oom_score_adj=0
[466376.234157] JIT Sampler cpuset=/ mems_allowed=0
[466376.234161] CPU: 48 PID: 3961 Comm: JIT Sampler Tainted: G W OE ------------ 3.10.0-693.33.1.el7.ppc64le #1
[466376.234165] Call Trace:
[466376.234170] [c0000006ee923750] [c00000000001b3a0] show_stack+0x80/0x330 (unreliable)
[466376.234176] [c0000006ee923800] [c0000000009ea3f0] dump_stack+0x30/0x44
[466376.234179] [c0000006ee923820] [c0000000009e3780] dump_header+0xc4/0x264
[466376.234183] [c0000006ee9238f0] [c0000000009e39b8] oom_kill_process.part.5+0x98/0x474
[466376.234187] [c0000006ee9239c0] [c0000000002604bc] out_of_memory+0x6ac/0x760
[466376.234190] [c0000006ee923a90] [c00000000026a49c] __alloc_pages_nodemask+0xc4c/0xc70
[466376.234194] [c0000006ee923c80] [c0000000002ddcb0] alloc_pages_current+0x1f0/0x430
[466376.234198] [c0000006ee923d00] [c0000000002611c4] get_zeroed_page+0x24/0xa0
[466376.234202] [c0000006ee923d20] [c0000000003f91cc] sysfs_read_file+0x1cc/0x210
[466376.234206] [c0000006ee923dd0] [c0000000003289a8] SyS_read+0x138/0x390
[466376.234210] [c0000006ee923e30] [c00000000000a284] system_call+0x38/0xfc
[466376.234213] Mem-Info:
[466376.234218] active_anon:51878 inactive_anon:11914 isolated_anon:0
active_file:0 inactive_file:70 isolated_file:0
unevictable:379620 dirty:0 writeback:0 unstable:0
slab_reclaimable:1296 slab_unreclaimable:7171
mapped:2505 shmem:10144 pagetables:309 bounce:0
free:33969 free_pcp:0 free_cma:26368
[466376.234228] Node 0 DMA free:2174016kB min:512000kB low:640000kB high:768000kB active_anon:3320192kB inactive_anon:762496kB active_file:0kB inactive_file:4480kB unevictable:24295680kB isolated(anon):0kB isolated(file):0kB present:33554432kB managed:32394624kB mlocked:24295680kB dirty:0kB writeback:0kB mapped:160320kB shmem:649216kB slab_reclaimable:82944kB slab_unreclaimable:458944kB kernel_stack:44592kB pagetables:19776kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:1687552kB writeback_tmp:0kB pages_scanned:383341 all_unreclaimable? yes
[466376.234242] lowmem_reserve[]: 0 0 0
[466376.234246] Node 0 DMA: 977*64kB (UEM) 443*128kB (UEM) 394*256kB (UEM) 180*512kB (UEM) 87*1024kB (UEM) 32*2048kB (UEM) 4*4096kB (M) 0*8192kB 103*16384kB (C) = 2170816kB
[466376.234260] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16384kB
[466376.234262] 10235 total pagecache pages
[466376.234265] 0 pages in swap cache
[466376.234267] Swap cache stats: add 0, delete 0, find 0/0
[466376.234269] Free swap = 8191936kB
[466376.234270] Total swap = 8191936kB
[466376.234272] 524288 pages RAM
[466376.234274] 0 pages HighMem/MovableOnly
[466376.234276] 18122 pages reserved
Cause
RHEL on Power LE systems is allocating 5% of the installed memory for kvm_cma by default.
This memory is reported as free memory but can't be used by normal applications.
Environment
Confirmed with kernels:
3.10.0-514.28.1.el7.ppc64le
3.10.0-693.33.1.el7.ppc64le (ESS 5.3.1)
3.10.0-862.14.4.el7.ppc64le (ESS 5.3.2)
3.10.0-862.27.1.el7.ppc64le (ESS 5.3.3)
Diagnosing The Problem
This line can be found in dmesg (from a system with 32GB of RAM):
[ 0.000000] kvm_cma: CMA: reserved 1648 MiB
# cat /proc/vmstat |grep cma
nr_free_cma 26368 (64K pages = 1648MB)
To calculate the real application usable amount of free memory:
[root@ems1 crash]# cat /proc/vmstat
nr_free_pages 43750
(43750 - 26368)*64/1024 = 1086,375 MB
Resolving The Problem
kvm_cma allocation during boot can be disabled by adding the kernel option: kvm_cma_resv_ratio=0
Example on an ESS EMS node:
vi /etc/default/grub
GRUB_CMDLINE_LINUX="crashkernel=auto console=hvc0 kvm_cma_resv_ratio=0"
grub2-mkconfig -o /boot/grub2/grub.cfg
reboot
Was this topic helpful?
Document Information
Modified date:
06 May 2019
UID
ibm10730789