Subscribe to this APAR
By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.
APAR (Authorized Program Analysis Report) |
Abstract
IPL-SRCC9002910-LOOP Loop sending message
Error Description
The ipl after a warm flash copy is essentially the same thing as
the ipl after a system termination where the data in memory is
not preserved. For example, losing power without Uninterruptable
Power Supply (UPS) protection could cause the data in memory to
not be written to disk. In this case QSYSOPR message queue was
being recovered in the first ipl after the warm copy. However,
an unexpected condition was encountered and the system was in a
loop with c9002910 on the panel. A Main Store Dump was taken and
step mode ipl was started. When the DST signon screen appeared,
service damaged the qsysopr message queue and continued the ipl.
This apar is being opened to see if there is a way to enhance
qsysopr recovery during the ipl after losing memory.
Problem Summary
****************************************************************
* PROBLEM: (SE51961) Licensed Program = 5770SS1 *
* Looping Condition *
****************************************************************
* USERS AFFECTED: All IBM i operating system users for i 7.1. *
****************************************************************
* RECOMMENDATION: Apply PTF SI46804 for i 7.1. *
****************************************************************
A job will go into a loop while sending a message to a
nonprogram message queue, like QHST or QSYSOPR. It does not
affect messages sent to a job log or program message queue.
Frequently, the problem occurs after a system crash, forced IPL,
or use of flash copy support. During the IPL after one of those
functions, the SCPF job will go into a loop while sending a
message. Usually, program QMHSNSTQ appears in the stack as the
program that is looping. One symptom was the target of warm
flash copy function stays at SRC C9002910 for hours. A different
customer system went into a loop sending a message to a user
profile message queue, which also involved QMHSNSTQ looping, but
it did not occur during an IPL.
These functions, system crash, forced IPL and use of flash copy,
result in the loss of data in memory, so it appears as if
QSYSOPR was interrupted in the middle of adding a message to the
message queue. It is similar to a system losing power without
Uninterruptable Power Supply (UPS) protection which could cause
the data in memory to not be written to disk. With flash copy,
QSYSOPR message queue was being used during the first IPL after
the flash copy when the loop was noticed. A Main Store Dump was
taken and step mode IPL was started. When the DST signon screen
appeared, service damaged the QSYSOPR message queue to continue
the IPL. This APAR is being opened to see if there is a way to
enhance QSYSOPR recovery during the IPL after losing memory.
Problem Conclusion
The message queue being used during the loop had logical damage
or corruption. For example, a message chain would be corrupt
when a message entry points to a wrong location. The corruption
can occur when the updating of a message queue is interrupted in
the middle of making a change. The system is designed to detect
several types of message queue corruption or damage, and would
force an MSGMCH0601 to cause message queue cleanup to occur. The
loop occurred because the message queue cleanup was not getting
invoked as expected. This was caused by a change in low-level
system function that no longer produced an error as it did in
the past. This fix is making a change to the method used to
force the MCH0601, so the message queue cleanup will once again
be called as expected when logical damage or corruption is
detected.
This fixes message handling send operations that are sending
messages to a nonprogram message queue. The fix will not prevent
message queue logical damage or corruption from occurring. It
only forces the MCH0601 to be generated, so the cleanup program
can be invoked as it was in the past, before the low-level
system change was made to eliminate the error.
Temporary Fix
*********
* HIPER *
*********
Comments
Circumvention
PTFs Available
Affected Modules
Affected Publications
Summary Information
Status............................................ | CLOSED UR1 |
HIPER........................................... | Yes |
Component.................................. | 5770SS1WM |
Failing Module.......................... | RCHMGR |
Reported Release................... | R710 |
Duplicate Of.............................. |
System i Support
IBM disclaims all warranties, whether express or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. By furnishing this document, IBM grants no licenses to any related patents or copyrights. Copyright © 1996,1997,1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017 IBM Corporation. Any trademarks and product or brand names referenced in this document are the property of their respective owners. Consult the Terms of use link for trademark information
Document Information
Modified date:
13 October 2012