we have 4 GPFS Servers accessing 8 Storageboxes (ds3400).
Since we upgraded from RHEL5.4 to RHEL6.2 we see the Servers
Then we get messages in the kernelringbuffer of that kind:
INFO: task sh:10559 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
sh D 0000000000000005 0 10559 10558 0x00000080
ffff88032593bd38 0000000000000086 0000000000000000 ffff88032ce5e780
ffff88032593bcc8 ffffffffa06f4ccc ffff88032593bcc8 ffffffff81193bfa
ffff880373ca6638 ffff88032593bfd8 000000000000fb88 ffff880373ca6638
<ffffffffa06f4ccc> ? gpfs_i_permission_noacl+0x4c/0xe0 mmfslinux
<ffffffff81193bfa> ? dput+0x9a/0x150
<ffffffff811638b2> ? kmem_cache_alloc+0x182/0x190
<ffffffff8104457c> ? __do_page_fault+0x1ec/0x480
<ffffffff81500bf5> ? page_fault+0x25/0x30
<ffffffff811984f2> ? alloc_fd+0x92/0x160
The process which is hanging is at random.
But the first function in the stacktrace is always gpfs_i_permission_noacl.
Sometime the machines crash completly without any hint what has happened.
We are already on the latest patchlevel for the machine and the kernel/driver.
Has anyone got a clue?
esj 110000BDXT104 Posts
Re: x3650M2 slows down frequently2012-10-26T16:47:58ZThis is the accepted answer. This is the accepted answer.Here is what I was told by somebody who knows what he is talking about ....
This is an OS call into the GPFS permission op which actually does very little before turning around
and calling "generic_permission" back in Linux. The traceback is strange in that dput does not make
such a call. If the offset is the same (sounds like this happens frequently), then maybe an objdump
of that mmfslinux module would help to locate the instruction that we're hung on. We may also need
the entire Linux log file along with dump of kthreads to look for a deadlock.
This cannot be handled in this forum. You are better off opening a PMR.