IBM Support

LO57703: SERVER BECOMES UNRESPONSIVE WITH THOUSANDS OF ESTABLISHED AND MANY CLOSE_WAIT APPEARING BOGUS CONNECTIONS MAKEPWENTRY THREAD

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as unreproducible.

Error description

  • Sver becomes unresponsive with thousands of ESTABLISHED and
    many CLOSE_WAIT appearing bogus connections MakePWEntry thread:
    
    ############################################################
    ### thread 12/97: [ nSERVER:  1198:  1dcc]
    ### FP=0x0a90f008, PC=0x7c82860c, SP=0x0a90efa0
    ### stkbase=0x0a910000, total stksize=262144, used stksize=4192
    ############################################################
     [ 1] 0x7c82860c ntdll.KiFastSystemCallRet+0
    (3e8,0,a90f028,600997ca)
     [ 2] 0x77e424fd kernel32.Sleep+15
    (3e8,38ae76d8,a90f370,600807e4)
    @[ 3] 0x600997ca nnotes.OSDelayThread@4+42 (3e8)
    @[ 4] 0x600807e4 nnotes.NIFUpdateCollectionNext@8+1732
    (38ae8208,37e1c9f8)
    @[ 5] 0x60047572 nnotes.NIFUpdateCollection@4+466 (a90112f)
    @[ 6] 0x60ad6492 nnotes.NIFGetCollectionUpdated@12+402
    (38ae76d8,0,a90f570)
    @[ 7] 0x60ad7c56 nnotes.NIFOpenCollectionExtended4@60+3414
    (1136,1136,2d2,20,0,a90f5b0,f10f10,ffffffff,0,0,0,0,0,0,0)
    @[ 8] 0x60059702 nnotes.NIFOpenCollectionExtended3@56+66
    (5c8,5c8,2d2,20,0,a90f5f4,f10f10,ffffffff,0,0,0,0,0,0)
    @[ 9] 0x600596bc nnotes.NIFOpenCollectionExtended2@48+60
    (5c8,5c8,2d2,20,0,a90f634,f10f10,ffffffff,0,0,0,0)
    @[10] 0x600653a4 nnotes.NIFOpenCollection@40+52
    (5c8,5c8,2d2,20,0,a90f66c,f10f10,ffffffff,0,0)
    @[11] 0x6035a966 nnotes.AdminpFindProxyDbEntry@28+102
    (5c8,60fa63ec,a90f808,a90f78c,0,a90f760,f10f10)
    @[12] 0x603e879a nnotes.FindProxyEntry@40+410
    (5c8,60fa63ec,a90fa74,60fa66d8,0,a90f874,f10f10,ffffffff,0,0)
    @[13] 0x603e934f nnotes.MakePWEntry@32+63
    (60fa63ec,a90fa5c,a90fa20,2be52f26,0,a90f9d8,f10f10,ffffffff)
    @[14] 0x603eb9a7 nnotes.SECMakeProxyEntry@40+423
    (8,0,0,a90fa74,0,a90f9f4,f10f10,ffffffff,0,0)
    @[15] 0x60b897e7 nnotes.MakeNewPWEntry@4+535 (a90fb80)
    @[16] 0x60b89a5a nnotes.Parse_PWNewHashSig@4+26 (a90fb80)
    @[17] 0x60b75a2e nnotes.AuthServerDialog@12+4350
    (a90fb80,1,38f60000)
    @[18] 0x600ca176 nnotes.AuthStateMachine@4+342 (a90fb80)
    @[19] 0x60b52057 nnotes.AUTHProcessNetbfr@16+199
    (da1c0010,10025de0,a90fe58,a90fc10)
    @[20] 0x100217ca nserverl.DbServer@8+1226 (968401ab,22d8001a)
    @[21] 0x100371f5 nserverl.WorkThreadTask@8+1621 (6035884,0)
    @[22] 0x10001a2e nserverl.Scheduler@4+750 (0)
    @[23] 0x6010cd0f nnotes.ThreadWrapper@4+175 (0)
     [24] 0x77e6482f kernel32.GetModuleHandleA+223 (0,0,0,0)
    
    - The clients are starting to connect to the server (opening
    sessions, all goes well until a certain period of time
    ~30minutes)
    
    - In this interval the number of connection from a particular
    user is rising and remains in the ESTABLISHED mode , and
    sometimes some of them are passing to the CLOSE_WAIT state,
    and increasing the number of them
    
    - On the client side the problem is different, the client has
    only 1 or 2 open sessions to the server, of corse, using
    different source ports
    
    - Seems that the domino (or the network driver) is not able to
    close connections, also very strange that also as an admin
    using tools, there is no way to close the opened connections
    
    - Network connection is ok, switched network cards and used
    another switch (Cisco), no switchport security is enabled, all
    settings are automatically 1Gb connection, self test of network
    card ok, ping and trace run fine
    
    When the system is failing respond to clients:
    - trace is not connecting to the server
    - telnet to server 1352 is opening and stable
    - number of sessions originating from the same client (on the
    same source port) increases (so domino is not closing
    connections, but trying to establish old connection that
    was previously good)
    - on the client side there are no new connections established
    - domino is putting many connections on CLOSE_WAIT state, and
    some of them in FIN_WAIT_2, which is very bad
    - connections cannot be closed manually
    - disconnecting the network card slowly closes connections
    and after reenabling the network card domino responds again
    - same if domino service is restarted
    - Server_Session_Timeout=10 parameter is not closing
    idle connections, which means that those connections are not in
    the IDLE stage, which is strange because it shows no transfer
    bytes IN/OUT for the connection
    - when the client is initiating the connection to the server it
    shows only one socket pair, which is normal, on which the
    exchange of bytes is taking place
    - beside that connection which is active, is starting to
    multiply the number of them but without activity on those new
    sockets created (and not even the original one is not exchanging
    anymore bytes)
    
    Bogus sessions seen in NSD:
    
    <@@ ------ Notes Data -> Server Data -> Server Task Vars (Time
    13:45:42) ------ @@>
    
    Indx       TaskId       VarBlock      SessionID    Ver Proto|ST
    TrId Fnc TS|#Dbs #DocR #DocW|Trans    NetW |Session
    Duration|UserName
    ----       ------       --------      ---------    --- -----|--
    ----  -- --|---- ----- -----|-----
    ------|----------------|--------
       1  [2515: 38458]   [175: 46786]  [2437:  1706]    0  0: 0| 2
    0 142  0|   0     0     0|    0        0| 474558h:40m:38s  |
       2  [2514: 38458]   [175: 42154]  [2438:  1706]    0  0: 0| 2
    0 142  0|   0     0     0|    0        0| 474558h:40m:38s  |
       3  [2516: 38458]   [175: 37522]  [2440:  1706]    0  0: 0| 2
    0 142  0|   0     0     0|    0        0| 474558h:40m:38s  |
       4  [2517: 38458]   [175: 32890]  [2439:  1706]    0  0: 0| 2
    0 142  0|   0     0     0|    0        0| 474558h:40m:38s  |
       5  [2518: 38458]   [175: 28258]  [2441:  1706]    0  0: 0| 2
    0 142  0|   0     0     0|    0        0| 474558h:40m:38s  |
       6  [2520: 38458]   [175: 23626]  [2442:  1706]    0  0: 0| 2
    0 142  0|   0     0     0|    0        0| 474558h:40m:38s  |
       7  [2519: 38458]   [175: 18994]  [2443:  1706]    0  0: 0| 2
    0 142  0|   0     0     0|    0        0| 474558h:40m:38s  |
       8  [2521: 38458]   [175: 14362]  [2445:  1706]    0  0: 0| 2
    0 142  0|   0     0     0|    0        0| 474558h:40m:38s  |
       9  [2522: 38458]   [175:  9730]  [2444:  1706]    0  0: 0| 2
    0 142  0|   0     0     0|    0        0| 474558h:40m:38s  |
      10  [2523: 38458]   [175:  5098]  [2446:  1706]    0  0: 0| 2
    0 142  0|   0     0     0|    0        0| 474558h:40m:38s  |
    
    Data review also suggests that this could possibly be
    caused by issue similar to SPR JRED7SNU25:
    ID Vault: Multiple "Change HTTP Password in Domino Directory"
    requests in the admin4 database.
    and FLII8BYBFY:  Server hang because all server working threads
    are doing MakeHttpPWChange (however the thread is not httpPW
    
    The workaround here is to disable the "Update Internet Password
    when Notes Client Password Changes" in security setting
    document, not suitable
    

Local fix

  • as per doc 1385788:
    WSKDMN_DEBUG_DONTLINGER=1 can be used to workaround the issue
    and need restart Domino server to make change effective. The
    setting will disable TCP SO_LINGER option to avoid bogus
    sessions to cause Domino Server
    hang. There is no observed implication like performance from
    implemented customer sites (confirmed solved problem)
    
    workaround #2:
    
    Rebuild admin4.nsf:
    1. "Tell AdminP quit" on the Domino server console
    2. Issue the following command on the domino server console:
    "dbcache flush Admin4.NSF" (This may need to be done multiple
    times in
    order to release the servers' handle from the database).
    3. Then, quickly rename the current Admin4.NSF to Admin4.OLD.
    4.  Replicate a new copy of the Admin4.NSF from a server known
    to have a
    good copy.
    5. Lastly, issue the command "load AdminP" on the Domino server
    console.
    

Problem summary

  • This APAR is closed as FIN. We have deferred the fix to a
     future release.
    

Problem conclusion

Temporary fix

Comments

  • This APAR is associated with SPR# JFRA8D7EZB.
    The problem will be fixed in the next release of the product.
    

APAR Information

  • APAR number

    LO57703

  • Reported component name

    DOMINO SERVER

  • Reported component ID

    5724E6200

  • Reported release

    851

  • Status

    CLOSED UR5

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt

  • Submitted date

    2011-01-17

  • Closed date

    2012-09-13

  • Last modified date

    2012-09-13

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

Applicable component levels

  • R851 PSN

       UP

[{"Business Unit":{"code":"BU055","label":"Cognitive Applications"},"Product":{"code":"SSKTMJ","label":"Lotus Domino"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"8.5.1","Edition":"","Line of Business":{"code":"","label":""}}]

Document Information

Modified date:
13 September 2012