IBM Support

IV23136: BACKUP NCP_VIRTUALDOMAIN FAILS TO RESYNC WITH PRIMARY UPON LOSING CONNECTION WHILE RECEIVING AND PROCESSING TOPOLOGY

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • v3.9 -FP1 - platform independent.
    
    In a failover and on a large domain environment, if
    ncp_virtualdomain process of Primary dies/stops while its
    transferring data to Backup domain, the Backup ncp_virtualdomain
    process stays up but fails to resync and download topology upon
    re-establishing connection with Primary.
    
    Replication steps:
    This is reproducible ONLY with large cache.:
    
    1) Load large Model cache on Primary (set virtualdomain
    retryCount to 0)
    2) Bring Backup as normal
     - Wait for a while until you see following messages in backup
    ncp_virtualdomain.<BACKUPDOMAIN>.trace: (i.e. its downloading
    and processing actively data from Primary)
    
    Adding chopped packets - Packet A len: 35415 Packet B len: 1024
    Adding chopped packets - Packet A len: 36439 Packet B len: 1024
    Adding chopped packets - Packet A len: 37463 Packet B len: 1024
    CVirtDomSockClient::RivOnClientIO
    CVirtDomProtocol::ProcessResponse
    {
            requestSubj='ITNM/MODEL/QUERY/NCP_PUP';
            responseSubj='ITNM/MODEL/UPDATE/NCP_PUF';
            responseData=<Opaque Data>;
    }
    Adding chopped packets - Packet A len: 69 Packet B len: 1024
    Adding chopped packets - Packet A len: 1093 Packet B len: 1024
    Adding chopped packets - Packet A len: 2117 Packet B len: 1024
    Adding chopped packets - Packet A len: 3141 Packet B len: 1024
    
    3) At this point stop virtual domain on Primary -
    
    Now - you would notice following on Backup:
    Adding chopped packets - Packet A len: 1515 Packet B len: 1024
    Adding chopped packets - Packet A len: 2539 Packet B len: 464
    CRivSockEngine::OnIO: 0 length packet recived
    CVirtDomSockClient::ROSCOnClientDisconnect
    Thu Jun 14 12:30:59 2012  Info: A generic warning has occurred
    found in file CVirtDomSockClient.cc at line 315 -
    CVirtDomSockClient::ROSCOnClientDisconnect: Connection to
    primary domain lost
    raising alert:
    {
            EventName='ItnmFailoverConnection',
            Severity=3,
            EntityName='NCP_PUF',
            Description='Connection to Primary domain NCP_PUP lost',
            ExtraInfo={
    ..........deleted few lines......
    CRivSockEnd::RSESetEndPointAddress(10.10.10.10, 46729)
    Error on socket 16.  Connect Failed.  Reason: Connection refused
    Error No: 111
    Thu Jun 14 12:30:59 2012  Info: Attempt to connect to server
    failed found in file CRivObjSockClient.cc at line 1080
    Could not connect to a server on domain NCP_PUV - retrying!
    [CNotificationBuffer::FlushNotifications] sending 1
    notifications
    .
    .
    46384 Could not connect to a server on domain NCP_PUV -
    retrying!
      46385 Receiving query: update state.domains set
    m_HealthStatus=1 where m_Domain='NCP_PUF';
      46386 Receiving query: update state.domains set
    m_HealthStatus=0 where m_Domain='NCP_PUP';
      46387 CVirtDomFailOver::FailOver: Initiating fail over
      46388 raising alert:
    .
    .
    4) Restart Primary virtual domain - just reinsert into Ctrl
    
    -now you notice following in backup trace: (Backup thinks
    Primary is healthy).
    leaving CRivSockEngine::RSEAddClientIO().
    leaving CRivObjSockClient::ROSCConnectIO().
    Receiving query: update state.domains set
    m_HealthStatus=1 where m_Domain='NCP_PUP';
    CVirtDomFailOver::FailBack: Initiating fail back
    raising alert:
    {
            EventName='ItnmFailover',
    .
    .
    couple of mins later...you would notice problem on Backup:
    .
    .
    CRivSnoop::OnData: Snooped a packet onITNM/CTRL/STATE/NCP_PUF
    CRivUpdateThread::RUTUpdateThread: Processing update in thread
    pool
    CRivUpdateThread::RUTUpdateThread: Got final packet,decoding
    all...
    ROSC Send waiting for connect.
    Thu Jun 14 12:37:19 2012  Warning: Failed to send on
    transport layer found in file CRivObjSockClient.cc at line 1280
    - Client ncp_virtualdomain is not connected to service
    ncp_virtualdomain
    

Local fix

  • Restart backup ncp_virtualdomain process.
    

Problem summary

  • ****************************************************************
    * USERS AFFECTED:                                              *
    * All ITNM 3.9 Users                                           *
    ****************************************************************
    * PROBLEM DESCRIPTION:                                         *
    * BACKUP NCP_VIRTUALDOMAIN FAILS TO RESYNC WITH PRIMARY UPON   *
    * LOSING CONNECTION WHILE RECEIVING AND PROCESSING TOPOLOGY    *
    ****************************************************************
    * RECOMMENDATION:                                              *
    * | fix pack | 3.9.0-ITNMIP-FP0003                             *
    ****************************************************************
    

Problem conclusion

  • In a  large domain environment, if ncp_virtualdomain process of
    Primary dies/stops while its
    transferring data to Backup domain, the Backup ncp_virtualdomain
    process stays up but fails to resync and download topology upon
    re-establishing connection with Primary issue  has been fixed.
    

Temporary fix

Comments

APAR Information

  • APAR number

    IV23136

  • Reported component name

    NC/PRECISIONIP

  • Reported component ID

    5724O52RC

  • Reported release

    390

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt

  • Submitted date

    2012-06-18

  • Closed date

    2013-01-14

  • Last modified date

    2013-01-14

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    NC/PRECISIONIP

  • Fixed component ID

    5724O52RC

Applicable component levels

  • R390 PSN

       UP

  • R390 PSY

       UP

[{"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSCP984","label":"Discovery and RCA"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"390","Edition":"","Line of Business":{"code":"","label":""}}]

Document Information

Modified date:
14 January 2013