IBM Support

Server performance degrades significantly during client backups

Troubleshooting


Problem

During client backups, the performance of the IBM Spectrum Protect server can degrade due to lock contention between the various threads attempting to allocate memory.

Symptom

In addition to the client backup sessions taking an excessive amount of time to complete, it may also be seen that CPU utilization is consistently pegged at or near 100% during the client backup window, with programs running in kernel/system mode accounting for the greatest percentage of CPU usage. Furthermore, the dsmserv process will be identified as consuming the largest amount of CPU.

Cause

This behavior may be seen if a large number of threads on the IBM Spectrum Protect server are attempting to concurrently allocate memory

Diagnosing The Problem

To verify the source of the problem it is necessary to capture performance data on the IBM Spectrum Protect server at a time when the problem condition is occurring. This performance data can be captured via the servermonV6.pl monitoring script, which can be obtained from the following URL:


In reviewing the data gathered by the servermonV6.pl script, the output from the QUERY SESSION command will likely show the client backup sessions in a 'Run' state for an extended period of time, as seen in the following example:

      Sess Comm.  Sess  Wait Bytes Bytes Sess Platform Client Name
    Number Method State Time Sent  Recvd Type
    ------ ------ ----- ---- ----- ----- ---- -------- -----------
    38,551 Tcp/Ip Run    0 S  8.0M 10.9K Node    WinNT NODE_A

The output from the same QUERY SESSION command more than 20 minutes later shows that the session is still in a 'Run' state, yet no data has been backed up to the Server and only about 40MB of data has been sent to the client during this time:

      Sess Comm.  Sess  Wait Bytes Bytes Sess Platform Client Name
    Number Method State Time Sent  Recvd Type
    ------ ------ ----- ---- ----- ----- ---- -------- -----------
    38,551 Tcp/Ip Run    0 S 47.7M 10.9K Node    WinNT NODE_A

The corresponding instrumentation trace data shows that this session is spending the vast majority of it's time in an 'Unknown' state:

    Thread 69581 psSessionThread parent=113 18:20:10.787-->18:41:23.607
    OperationCount  Tottime Avgtime Maxtime InstTput Total KB
    ---------------------------------------------------------------
    Network Send    1205    0.103   0.000   0.000 373137.2    38560
    DB2 MFtch Prep 77392   20.144   0.000   0.007
    DB2 Fetch Exec 39536   10.146   0.000   0.027
    DB2 MFtch Exec 77672   42.991   0.001   0.094
    DB2 Fetch      39536    0.440   0.000   0.002
    DB2 MFetch    313858   38.087   0.000   3.514
    DB2 Reg Exec   38696   28.060   0.001   3.633
    DB2 Reg Fetch  77390    0.441   0.000   0.000
    Unknown              1132.402
    ---------------------------------------------------------------
    Total                1272.819                     30.3    38560

The output from the SHOW THREADS command shows us that the thread associated with this client backup session is waiting to allocate memory:

    Thread 69581, Parent 113: psSessionThread, Storage 8129911,
    AllocCnt 1079668 HighWaterAmt 8130678
     tid=73cd, ptid=2671, det=1, zomb=0, join=0, result=0, sess=38551
      Stack trace:
      0x090000000049f738 _global_lock_common
      0x0900000000099ff4 malloc_y
      0x0900000000010bcc malloc_common@AF102_86 
      0x0000000100003c04 pkAllocTracked
      0x000000010073d70c ImGetAllGroupMemberships
      0x0000000100865300 SetBackupQueryResponse
      0x000000010086945c imGetNextGroupMember
      0x0000000100699494 DoBackQryGroups
      0x0000000100685f3c SmNodeSession
      0x0000000100527d64 SmSchedSession
      0x000000010053d1f8 HandleNodeSession
      0x0000000100534f74 DoNodeSched
      0x000000010052fbf0 smExecuteSession
      0x0000000100177d5c psSessionThread
      0x000000010000c264 StartThread

What the stack trace above shows us is that the ImGetAllGroupMemberships() function has been called to obtain information about all of the groups to which a particular object is associated with. This function has issued a memory allocation request to the AIX host (malloc_y) and the request is waiting to acquire the required lock. Lock contention can occur if there are a large number of threads attempting to allocate or release (free) memory at the same time. For example, this behavior may be seen when there are a large number of concurrent Windows system state backups being performed.

Resolving The Problem

The default memory allocator on AIX (yorktown) uses a single application heap and, during periods of heavy, concurrent memory allocation requests, this can result in contention for the internal locks used by malloc. There are some AIX environment variables (eg. 'MALLOCOPTIONS=multiheap') that will allow for AIX to use multiple heaps to satisfy memory allocation requests which should eliminate the lock contention. It is strongly recommended that any proposed changes to the AIX memory management options be discussed first with the appropriate system administrator and/or AIX support team to ensure that these changes are appropriate for your environment.

[{"Product":{"code":"SSGSG7","label":"Tivoli Storage Manager"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":"Server","Platform":[{"code":"PF002","label":"AIX"}],"Version":"6.3;7.1","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]

Product Synonym

TSM

Document Information

Modified date:
17 June 2018

UID

swg21671759