Fixes are available
IBM Tivoli Monitoring 6.2.3 Fix Pack 5 (6.2.3-TIV-ITM-FP0005)
IBM Tivoli Monitoring 6.2.3 Fix Pack 2 (6.2.3-TIV-ITM-FP0002)
IBM Tivoli Monitoring 6.2.3 Fix Pack 4 (6.2.3-TIV-ITM-FP0004)
Tivoli Log File Agent, Version 6.3.0 Fix Pack 01 (6.3.0-TIV-ITM_LFA-FP0001)
Tivoli Log File Agent, Version 6.3.0 Interim Fix 04 6.3.0-TIV-ITM_LFA-IF0004
Tivoli Log File Agent, Version 6.3.0 Fix Pack 02 (6.3.0-TIV-ITM_LFA-FP0002)
Tivoli Log File Agent, Version 6.3.0 Interim Fix 05 6.3.0-TIV-ITM_LFA-IF0005
APAR status
Closed as program error.
Error description
1) When error code is 67 (i.e E_IPC_BROKEN ), the EIF sender should try to switch over or resend the event at the very least. 2) The head pointer keeps getting moved even though the event has not been sent. I think both issues can be addressed if we add checks for E_IPC_BROKEN. With several events written to cache it looks like; +4F97A3BD.00C1 maxsz: 65536 +4F97A3BD.00C1 head : 54 +4F97A3BD.00C1 tail : 746 I see for the first event; (4F97A3BD.00F6-2:sockeif.c,814,"_imp_eipc_recv_data") <0x41EEBB50,0x0> recv on fd 5, sock_error 0xFFFFFFFF, error 67 (4F97A3BD.00F7-2:eipc.c,564,"get_peer_response_timed") peer response PEER_RESPONSE_UNKNOWN +4F97A3BD.010B maxsz: 65536 +4F97A3BD.010B head : 227 +4F97A3BD.010B tail : 746 The head has moved up the number of bytes in the first event which was 173 - even though there was an error returned - it seems to have been ignored and carried on. then for the second event, the same; (4F97A3BD.012B-2:sockeif.c,814,"_imp_eipc_recv_data") <0x41EEBB50,0x0> recv on fd 5, sock_error 0xFFFFFFFF, error 67 (4F97A3BD.012C-2:eipc.c,564,"get_peer_response_timed") peer response PEER_RESPONSE_UNKNOWN Which I did not expect as in general the first event gets lost and the connection made for the second to be sent, again it appears not to be caught and the cache move the head up one event worth; +4F97A3BD.013A maxsz: 65536 +4F97A3BD.013A head : 400 +4F97A3BD.013A tail : 746 Actualy, what I didn't expect was this for the third event; (4F97A3BD.015A-2:sockeif.c,814,"_imp_eipc_recv_data") <0x41EEBB50,0x0> recv on fd 5, sock_error 0xFFFFFFFF, error 67 (4F97A3BD.015B-2:eipc.c,564,"get_peer_response_timed") peer response PEER_RESPONSE_UNKNOWN But this time it's rapidly followed by; (4F97A3BD.015D-2:sockeif.c,338,"_imp_do_send") send 40 bytes (4F97A3BD.015E-2:socket_imp.c,1741,"send_to") 174 bytes on send rc=-1 (4F97A3BD.015F-2:socket_imp.c,1639,"socket_put_event_conn") Connection Oriented send failed will wait 120 seconds before resend. Which I didn't see for the first two, it then does a count down; (4F97A3C4.0000-2:socket_imp.c,1658,"socket_put_event_conn") resend approximate time remaining: 110 seconds (4F97A3CB.0000-2:socket_imp.c,1658,"socket_put_event_conn") resend approximate time remaining: 99 seconds (4F97A3D2.0000-2:socket_imp.c,1658,"socket_put_event_conn") resend approximate time remaining: 89 seconds (4F97A3D9.0000-2:socket_imp.c,1658,"socket_put_event_conn") resend approximate time remaining: 78 seconds (4F97A3E0.0000-2:socket_imp.c,1658,"socket_put_event_conn") resend approximate time remaining: 68 seconds (4F97A3E7.0000-2:socket_imp.c,1658,"socket_put_event_conn") resend approximate time remaining: 57 seconds (4F97A3EE.0000-2:socket_imp.c,1658,"socket_put_event_conn") resend approximate time remaining: 47 seconds (4F97A3F5.0000-2:socket_imp.c,1658,"socket_put_event_conn") resend approximate time remaining: 36 seconds (4F97A3FC.0000-2:socket_imp.c,1658,"socket_put_event_conn") resend approximate time remaining: 26 seconds (4F97A403.0000-2:socket_imp.c,1658,"socket_put_event_conn") resend approximate time remaining: 15 seconds (4F97A40A.0000-2:socket_imp.c,1658,"socket_put_event_conn") resend approximate time remaining: 5 seconds Then looks like it closes the connection; 4F97A40D.0004-2:sockeif.c,255,"_imp_eipc_shutdown") _imp_eipc_shutdown fd 5 option 2 rc=-1 (4F97A40D.0005-2:sockeif.c,259,"_imp_eipc_shutdown") _imp_eipc_shutdown shutdown - [sys errno 107] fd 5 option 2 rc=-1 (surely rc=1 is a fail to close the connection?) Then the connection is created; (4F97A40D.001E-2:socket_imp.c,1920,"_create_eipc_client") Connected to [legacy_01] fujiobj <fujiobj.test.com@10.22.58.99>:9998 1 The third event gets sent; +4F97A40D.0034 maxsz: 65536 +4F97A40D.0034 head : 573 +4F97A40D.0034 tail : 919 (more events added, but the first AND second return the fail error 67 BUT only the third event then does anything about it, closes the connection and re-establishes a good connection, it's purely my opinion (DS) but I think it should do what it did for the third event, for the first. A really good log, it seems to do the right thing for the third event, in that when the connection is detected as bad, it does not remove the current event from cache, but makes the connection and tries again, for event one and two it seems to ignore that the connection was bad and moves the cache marker up and effectively loses/deletes the event. customer's env is ITM 6.2.2 FP03 Curious about the 120 second delay to re-establish a connection, unsure what the original intention there might have been, surely a) detect the connection has gone when dealing with the first event. b> remake the connection without a 120 second delay. All files on ecurep under pmr. RHEL 5.5 64bit given as cust env in pmr.
Local fix
Problem summary
EIF: Error Code 67 is not handled while sending events. If Error code 67 (Connection is broken) is seen while sending events to the Event Integration Facility (EIF) receiver, then the EIF sender ignores it and keeps sending events forward, even though the events are not being received by the EIF receiver.
Problem conclusion
The code has been modified to check for this error code and take action accordingly. The action would be either to try to connect to a failover EIF receiver, if configured, or to keep the event in the cache file and mark it as unsent. This would ensure that for this error condition, events are not lost. This fix is not applicable to 32 bit unix/linux or windows platforms. The fix for this APAR is contained in the following maintenance packages: | fix pack | 6.2.3-TIV-ITM-FP0002
Temporary fix
Comments
APAR Information
APAR number
IV21752
Reported component name
TEC GUI INTEGRA
Reported component ID
5724C04TG
Reported release
622
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt
Submitted date
2012-05-25
Closed date
2012-09-14
Last modified date
2012-10-08
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
TEC GUI INTEGRA
Fixed component ID
5724C04TG
Applicable component levels
R623 PSY
UP
[{"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSCTLMS","label":"ITM TEC GUI Integration V6"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"622","Edition":"","Line of Business":{"code":"","label":""}}]
Document Information
Modified date:
08 October 2012