IBM Tivoli Storage Manager, Version 7.1

Scenario: Improving performance of restore operations for database applications

This scenario illustrates how to interpret API client instrumentation data and server monitoring data to improve database application restore operations.

Problem

Kate, a Tivoli® Storage Manager administrator, notices a performance degradation in restore operations for an SAP application database

Goal

Kate expects to improve the performance of database restore operations.

Data collection

Kate collects the following data:

API client instrumentation data
Server monitoring data, which is generated by running the server monitoring script (servermonv6.pl)

Kate runs the API client instrumentation and the server monitoring script for the entire duration of the restore operation. She runs a restore operation and collects the instrumentation data during a time of little or no activity on the Tivoli Storage Manager server. Kate uses IBM® Tivoli Storage Manager for Enterprise Resource Planning to back up the database for an SAP application.

Analysis and determination of the bottleneck

During the restore operation, data is read from the tape drives, sent over the network to the Tivoli Storage Manager client, and is written to the client disk. The data is transferred from the tape system through a host bus adapter (HBA) that connects the tape library to the server. The server transfers the data to the client system through a local area network (LAN) connection. A Network Interface Card (NIC) connects the server to the LAN and a separate NIC connects the LAN to the client. The data is restored to the disk from the client through a separate HBA device.

Figure 1. Data flow for restore operations for database applications

The image shows the path that data takes during a data restore operation. The data flow transfers the data from the tape system through a host bus adapter (HBA) that connects the tape library to the Tivoli Storage Manager server. The server transfers the data to the client system through a local area network (LAN) connection. A Network Interface Card (NIC) connects the server to the LAN and a separate NIC connects the LAN to the client. The data is restored to the disk from the client through a separate HBA device.

Following the flow of data, Kate starts to review the server instrumentation data. First, Kate notes the time when the restore was started by reviewing the Tivoli Storage Manager activity log. Then, she locates the files that were collected by the server monitoring script, and finds the files that are time-stamped around the time when the restore operation was slow. The instrumentation file names have the following structure: YYYYMMDD-HHMM-YYYYMMDD-HHMM-instr.txt.

Kate examines the instrumentation output to find a thread that was reading from tape. She finds Thread 131787. Data for the thread shows that 9.100 seconds was spent in the Tape Read category and has a fast instantaneous throughput (InstTput) of 269584.5 KB per sec. She reviews the server instrumentation categories in Table 1 and finds that there is no problem reading from tape. The instantaneous throughput and high amount of thread wait time indicate that tape read is not the bottleneck. Kate sees that 99% of the time was spent in Thread Wait (1199.192 seconds):

Thread 131787 AgentThread parent=131782  13:16:25.938-->13:36:34.274
 Operation       Count  Tottime  Avgtime  Mintime  Maxtime InstTput  Total KB
 ----------------------------------------------------------------------------
 Tape Read        9583    9.100    0.001    0.000    0.354 269584.5   2453248
 Thread Wait      9585 1199.192    0.125    0.000  267.561
 Unknown                  0.042
 ----------------------------------------------------------------------------
 Total                 1208.335                              2030.3   2453248

Following the data flow, she looks for a thread with a Network Send category and a thread that processed approximately the same amount of data (2453248 KB). She finds Thread 131781, which processed a total of 2452368 KB. Thread 131781 is a psSessionThread thread, which shows the timing of the data as it was sent from the server, through the network to the client. Kate sees that the InstTput for the thread was 2052.8 KB per second, which was slower than expected. The Thread Wait phase took 0.786 seconds, which was less than 1% of the total time. The data indicates that most of the time was spent in the Network Send phase. The findings suggest performance degradation of the network on the client side.

Thread 131781 psSessionThread parent=299  13:16:25.938-->13:36:34.274
 Operation       Count  Tottime  Avgtime  Mintime  Maxtime InstTput  Total KB
 ----------------------------------------------------------------------------
 Data Copy           1    0.000    0.000    0.000    0.000
 Network Recv        5   12.778    2.556    0.001   12.719      0.0         0
 Network Send    19170 1194.666    0.062    0.000  267.561   2052.8   2452368
 DB2 Fetch Prep      1    0.003    0.003    0.003    0.003
 DB2 Fetch Exec      8    0.003    0.000    0.000    0.002
 DB2 MFtch Exec      3    0.008    0.003    0.000    0.004
 DB2 Fetch           8    0.000    0.000    0.000    0.000
 DB2 MFetch          5    0.000    0.000    0.000    0.000
 DB2 Commit          2    0.000    0.000    0.000    0.000
 DB2 Reg Prep        2    0.003    0.002    0.000    0.003
 DB2 Reg Exec        2    0.000    0.000    0.000    0.000
 DB2 Reg Fetch       2    0.000    0.000    0.000    0.000
 Thread Wait      9584    0.786    0.000    0.000    0.351
 Unknown                  0.085
 ----------------------------------------------------------------------------
 Total                 1208.335                              2029.5   2452368

Next, Kate looks at the API client instrumentation report.

Detailed Instrumentation statistics for

Thread: 2911  Elapsed time =  1669.061 sec

Section                  Actual(sec)	Average(msec)	Frequency used
------------------------------------------------------------------
Waiting on App          1435.153	     94.8	    15131
API Get Data             233.909	     15.5	    15131
Other                      0.000	      0.0	        0

------------------------------------------------------------------

Detailed Instrumentation statistics for

Thread: 2902  Elapsed time =  1913.868 sec

Section                  Actual(sec)	Average(msec)	Frequency used
------------------------------------------------------------------
Waiting on App          1681.437	    110.4	    15224
API Get Data             232.432	     15.3	    15224
Other                      0.000	      0.0	        0

------------------------------------------------------------------

Detailed Instrumentation statistics for

Thread: 2893  Elapsed time =  2093.726 sec

Section                  Actual(sec)	Average(msec)	Frequency used
------------------------------------------------------------------
Waiting on App          1926.681	    127.1	    15153
API Get Data             167.045	     11.0	    15153
Other                      0.000	      0.0	        0

------------------------------------------------------------------

By reviewing the data from the API client instrumentation, Kate sees that the longest amount of time was spent on the Waiting on App category. This category shows the time that was spent by the application to process the restore data. After Kate reviews the API report, she determines that most of the restore time is spent waiting for the application database. As a next step, Kate could verify the performance of the disk system.

Problem resolution

Kate contacts the storage administrator and the SAP administrator to help resolve the problem. After further investigation, the storage administrator determines that the database layout is the source of the problem, and corrects it.

Feedback