APAR status
Closed as program error.
Error description
Environment: ITM 6.2 FP1 Solaris OS Agent 6.2 FP1 Do you think latest patch applied was involved? No. Problem Description: When running the Solaris OS Agent, occasionally the forked child process spawned by the Agent deadlocks waiting on a mutex to become available. The OS Agent continues to spawning additional children all waiting for the availability of the same mutex. Detailed Recreation Procedure: Start the Solaris OS Agent, running all factory provided Situations. Wait for condition to arise, potentially days. Related Files and Output: Log files are available on ECuRep under this PMR. Approver: RL
Local fix
Problem summary
When running the Solaris OS Agent and executing situations, sometimes multiple copies of an OS Agent process will exist at the same time in a suspended wait state. The Solaris OS Agent spawns external programs to collect its metrics, such as ifconfig, stat_daemon, vm_stat, proc_stat, and nfs_stat. In order to do this the OS Agent creates a copy of its process using the Unix fork() system call and directs the child process to execute the external program. Since forking of processes copies all of the parent's memory, excluding mutexes, the state of the memory location of the mutex is indeterminant. This can result in a process waiting for the mutex to clear when it is not truely set due to the memory location value. When the parent process returns to fork itself again in order to spawn another external program, it forks a child process that then waits indefinitley for the mutex to clear, which never happens. The results is the presence of many OS Agent "children" forked by the parent OS Agent process, all waiting for the mutex to clear so that they may spawn their external programs.
Problem conclusion
When the OS Agent is forked to create a child OS Agent process all of the parent process' memory is duplicated except for the state of mutexes. In the OS Agent child process, the KBB_RAS1= environment variable is set to a new trace file name. This is done to redirect the RAS1 tracing output of the spawned program to this trace file and not to the original trace file reserved for the parent OS Agent. In the assignment of this environment variable, the use of the NLS2_toUTF16 and NLS2_fromUTF16 functions are used to construct the string which is assigned to the environment variable. These functions use a mutex to guard a linked list which is used in validating the NLS2_Locale object. It is this mutex that can, under rare circumstances, be in an indeterminant state when the child OS Agent is forked. So if this mutex is unlocked by the parent OS Agent process, and the child is forked with 50-50% chance of the correct state, then when that child attempts to acquire the lock for that mutex which is randomly set to the locked state, then that child will wait forever until that lock is freed. The fix for this APAR is contained in the following maintenance packages: | fix pack | 6.2.0-TIV-ITM-FP0003
Temporary fix
Comments
APAR Information
APAR number
IZ42267
Reported component name
TEMS
Reported component ID
5724C04MS
Reported release
620
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt
Submitted date
2009-01-21
Closed date
2009-04-22
Last modified date
2009-04-22
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
TEMS
Fixed component ID
5724C04MS
Applicable component levels
R620 PSY
UP
[{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SSCTLMP","label":"ITM Tivoli Enterprise Mgmt Server V6"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"620","Edition":"","Line of Business":{"code":"","label":""}}]
Document Information
Modified date:
22 April 2009