Fixes are available
APAR status
Closed as program error.
Error description
v3.9 -FP1 - platform independent. In a failover and on a large domain environment, if ncp_virtualdomain process of Primary dies/stops while its transferring data to Backup domain, the Backup ncp_virtualdomain process stays up but fails to resync and download topology upon re-establishing connection with Primary. Replication steps: This is reproducible ONLY with large cache.: 1) Load large Model cache on Primary (set virtualdomain retryCount to 0) 2) Bring Backup as normal - Wait for a while until you see following messages in backup ncp_virtualdomain.<BACKUPDOMAIN>.trace: (i.e. its downloading and processing actively data from Primary) Adding chopped packets - Packet A len: 35415 Packet B len: 1024 Adding chopped packets - Packet A len: 36439 Packet B len: 1024 Adding chopped packets - Packet A len: 37463 Packet B len: 1024 CVirtDomSockClient::RivOnClientIO CVirtDomProtocol::ProcessResponse { requestSubj='ITNM/MODEL/QUERY/NCP_PUP'; responseSubj='ITNM/MODEL/UPDATE/NCP_PUF'; responseData=<Opaque Data>; } Adding chopped packets - Packet A len: 69 Packet B len: 1024 Adding chopped packets - Packet A len: 1093 Packet B len: 1024 Adding chopped packets - Packet A len: 2117 Packet B len: 1024 Adding chopped packets - Packet A len: 3141 Packet B len: 1024 3) At this point stop virtual domain on Primary - Now - you would notice following on Backup: Adding chopped packets - Packet A len: 1515 Packet B len: 1024 Adding chopped packets - Packet A len: 2539 Packet B len: 464 CRivSockEngine::OnIO: 0 length packet recived CVirtDomSockClient::ROSCOnClientDisconnect Thu Jun 14 12:30:59 2012 Info: A generic warning has occurred found in file CVirtDomSockClient.cc at line 315 - CVirtDomSockClient::ROSCOnClientDisconnect: Connection to primary domain lost raising alert: { EventName='ItnmFailoverConnection', Severity=3, EntityName='NCP_PUF', Description='Connection to Primary domain NCP_PUP lost', ExtraInfo={ ..........deleted few lines...... CRivSockEnd::RSESetEndPointAddress(10.10.10.10, 46729) Error on socket 16. Connect Failed. Reason: Connection refused Error No: 111 Thu Jun 14 12:30:59 2012 Info: Attempt to connect to server failed found in file CRivObjSockClient.cc at line 1080 Could not connect to a server on domain NCP_PUV - retrying! [CNotificationBuffer::FlushNotifications] sending 1 notifications . . 46384 Could not connect to a server on domain NCP_PUV - retrying! 46385 Receiving query: update state.domains set m_HealthStatus=1 where m_Domain='NCP_PUF'; 46386 Receiving query: update state.domains set m_HealthStatus=0 where m_Domain='NCP_PUP'; 46387 CVirtDomFailOver::FailOver: Initiating fail over 46388 raising alert: . . 4) Restart Primary virtual domain - just reinsert into Ctrl -now you notice following in backup trace: (Backup thinks Primary is healthy). leaving CRivSockEngine::RSEAddClientIO(). leaving CRivObjSockClient::ROSCConnectIO(). Receiving query: update state.domains set m_HealthStatus=1 where m_Domain='NCP_PUP'; CVirtDomFailOver::FailBack: Initiating fail back raising alert: { EventName='ItnmFailover', . . couple of mins later...you would notice problem on Backup: . . CRivSnoop::OnData: Snooped a packet onITNM/CTRL/STATE/NCP_PUF CRivUpdateThread::RUTUpdateThread: Processing update in thread pool CRivUpdateThread::RUTUpdateThread: Got final packet,decoding all... ROSC Send waiting for connect. Thu Jun 14 12:37:19 2012 Warning: Failed to send on transport layer found in file CRivObjSockClient.cc at line 1280 - Client ncp_virtualdomain is not connected to service ncp_virtualdomain
Local fix
Restart backup ncp_virtualdomain process.
Problem summary
**************************************************************** * USERS AFFECTED: * * All ITNM 3.9 Users * **************************************************************** * PROBLEM DESCRIPTION: * * BACKUP NCP_VIRTUALDOMAIN FAILS TO RESYNC WITH PRIMARY UPON * * LOSING CONNECTION WHILE RECEIVING AND PROCESSING TOPOLOGY * **************************************************************** * RECOMMENDATION: * * | fix pack | 3.9.0-ITNMIP-FP0003 * ****************************************************************
Problem conclusion
In a large domain environment, if ncp_virtualdomain process of Primary dies/stops while its transferring data to Backup domain, the Backup ncp_virtualdomain process stays up but fails to resync and download topology upon re-establishing connection with Primary issue has been fixed.
Temporary fix
Comments
APAR Information
APAR number
IV23136
Reported component name
NC/PRECISIONIP
Reported component ID
5724O52RC
Reported release
390
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt
Submitted date
2012-06-18
Closed date
2013-01-14
Last modified date
2013-01-14
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
NC/PRECISIONIP
Fixed component ID
5724O52RC
Applicable component levels
R390 PSN
UP
R390 PSY
UP
[{"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSCP984","label":"Discovery and RCA"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"390","Edition":"","Line of Business":{"code":"","label":""}}]
Document Information
Modified date:
14 January 2013