APAR status
INTRAN
Error description
The purpose of this INFO APAR is to document the procedure to follow and documentation to gather if SMSVSAM HANGs. It will also document basic recovery procedures for most types of hangs or other RLS problems. The basic procedure for diagnosing RLS problems is as follows: 1. Issue diagnostic commands to understand scope of problem 2. Dump affected systems and regions 3. Restart affected regions We will go over each item in detail below: ------------------------------- 1. DIAGNOSTIC COMMANDS ------------------------------- When a hang involving RLS is encountered, the first and best action is to issue the following diagnostic commands. . - D GRS,C - displays ENQ contention - D SMS,SMSVSAM,DIAG(C) - displays RLS latch contention - D SMS,SMSVSAM,QUIESCE - shows any outstanding quiesces - D SMS,CFLS - displays lock structure information - D XCF,STR,STRNM=IGWLOCK00 - another display of the lock str . The first three commands above are the most important. Each of these commands will identify specific information about which system is not moving. D GRS,C COMMAND --------------------------- There are wide array of RLS ENQs, but the main ones to watch are majorname SYSVSAM and SYSZIGW3. The D GRS,C command will show which system and TCBs are holding these ENQs, which can help us narrow down which system is at fault. Take note from these displays if there are any common systems/tcbs doing the holding. Note that RLS ENQs are plex-wide in scope. D SMS,SMSVSAM,DIAG(C) COMMAND ------------------------------- The DIAG(C) command reports only on the system on which is was issued, so this command needs to be run on each system. You can also use RO *ALL,D SMS,SMSVSAM,DIAG(C) to simplify the process. The report will list all of the LATCHES within RLS that have contention. Take note of the TCBs and compare those with the TCBs holding the ENQs from the GRS display. At this point, you may notice a deadlock between two or three TCBs holding latches and/or ENQs. This informaiton can be used to search the APAR database for potential fixes. Also, if a deadlock is found, the only solution is to restart RLS (see '#3 - recovery' below). D SMS,SMSVSAM,QUIESCE COMMAND ------------------------------------ The QUIESCE command is very important for helping identify if RLS is waiting on a registered CICS region. Note that this command is also system-specific, so it needs to be run on each system, or routed, as well. The output of the display will list the data set name and quiesce event, and then all of the registered CICS regions and their current status. Watch for any regions that have not responded in the same timeframe as the others, or in a timely manner. These are the problem CICS regions. Note these regions for the DUMP stage. Here is an example of the display: IGW540I 09.36.44 DISPLAY SMS,SMSVSAM,QUIESCE SPHERE NAME: MY.DATA.SET SYSTEM NAME: ST8 START TIME: 09.33.50 TOTAL ELAPSE TIME: 00.02.54 PARTICIPATING SUB-SYSTEM STATUS: SCHEDULED: COMPLETED: ELAPSE: SUB-SYSTEM NAME: ........ 09.33.50 09.33.51 00.00.00 SUB-SYSTEM NAME: CICS1 09.33.50 09.33.50 00.00.00 SUB-SYSTEM NAME: CICS2 09.33.50 00.00.00 02.54.00 SUB-SYSTEM NAME: CICS3 09.33.50 09.33.50 00.00.00 Note the ELAPSE time above is 02.54.00. This indicates that the CICS region CICS2 has not responded in adequate time. Track and dump any erroneous regions such as this one. ------------------------------- 2. DUMP AFFECTED REGIONS ------------------------------- Using the commands above, you should have a list of SYSTEMS and REGIONS which are involved. Now we need to dump SMSVSAM and any affected CICS regiosn on the involved systems for later review. For simplicity and completeness, capturing RLS on all systems of the plex is ideal. On each system, dump SMSVSAM and any affected CICS regions, making sure to include the DATASPACEs for RLS, and SDATA parms GRSQ and XESDATA. Here is an example command to dump RLS and a CICS region on one system: DUMP COMM=(some meaningful dump title) R xx,JOBNAME=(SMSVSAM,XCFAS,CICS1),CONT R yy,DSPNAME=('SMSVSAM'.*,'XCFAS'.*),CONT R nn,SDATA=(PSA,NUC,SQA,LSQA,SUM,RGN,GRSQ,LPA, TRT,CSA,XESDATA),END . You can also use the same command to dump SMSVSAM around the plex. Include 'CICS*' in the REMOTE SYSLIST to capture similarly-named CICS regiosn around the plex. DUMP COMM=(some meaningful dump title) R xx,JOBNAME=(SMSVSAM,XCFAS),CONT R yy,DSPNAME=('SMSVSAM'.*,'XCFAS'.*),CONT R nn,SDATA=(PSA,NUC,SQA,LSQA,SUM,RGN,GRSQ,LPA, TRT,CSA,XESDATA),CONT R zz,REMOTE=(SYSLIST=(*('SMSVSAM')),DSPNAME,SDATA),END You can also simplify the process by including and entry similar to this example in your IEADMCxx PARMLIB member: JOBNAME=(*MASTER*,SMSVSAM),DSPNAME=('SMSVSAM'.*), SDATA=(COUPLE,PSA,NUC,SQA,LSQA,SUM,RGN,GRSQ,LPA,TRT, CSA,XESDATA),REMOTE=(SYSLIST=(*('SMSVSAM')),DSPNAME,SDATA) Then issuing DUMP COMM=(title),PARMLIB=xx will dump RLS all around the plex. Please see the z/OS MVS System Commands manual for more information on dumping (book SA22-7627). In addition to the dumps, it is also very helpful to have: - SYSLOG from all systems involved - LOGREC for some time prior to the error - JOBLOGs, if available, of affected regions ------------------------------- 3. RESTART AFFECTED REGIONS ------------------------------- To decide which regions to restart, follow this flow. After each step, confirm that RLS is still hung. If so, reissue the commands from step 1 to see if the nature of the problem has shifted. Then, revisit the step below. 1. Restart any CICS regions that are holding up QUIESCE requests (from the D SMS,SMSVSAM,QUIESCE output) 2. Restart SMSVSAM on any system that showed LATCH contention 3. Restart SMSVSAM on any system that was holding an ENQ 4. Restart SMSVSAM on all systems, one at a time. Note that it may not be necessary to do all steps. Depending on the hang, SMSVSAM in the plex may free up after any step above. The command to restart SMSVSAM is - V SMS,SMSVSAM,TERMINATESERVER .. watch for IGW408I - V SMS,SMSVSAM,ACTIVE .. watch for IGW414I . In rare cases where the TERMINATESERVER fails, use the - FORCE SMSVSAM,ARM To restart SMSVSAM. Note that this is reserved as a last resort and not recommended in most cases. . Please contact your support representatives if the above does not solve the problem, and send the documentation for review. Additional keywords: RLS SMSVSAM VSAMRLS QUIESCE UNQUIESCE DOCUMENTATION DUMP DUMPS DOC DEBUG DIAGNOSIS DIAG LATCH ENQ CONT CONTENTION HANG DEADLOCK HOLD Related INFO APARs: II12927 - DOC COLLECTION GUIDELINES FOR SEVERAL COMPONENTS II12603 - RLS INIT AND RECOVERY BASICS II14171 - GETTING STARTED WITH VSAMRLS
Local fix
Problem summary
Problem conclusion
Temporary fix
Comments
APAR Information
APAR number
II14597
Reported component name
V2 LIB INFO ITE
Reported component ID
INFOV2LIB
Reported release
001
Status
INTRAN
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt
Submitted date
2010-09-29
Closed date
Last modified date
2012-12-19
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Applicable component levels
[{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG19N","label":"APARs - OS\/390 environment"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"001","Edition":"","Line of Business":{"code":"","label":""}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG19M","label":"APARs - z\/OS environment"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"001","Edition":"","Line of Business":{"code":"","label":""}},{"Business Unit":{"code":null,"label":null},"Product":{"code":"SG19O","label":"APARs - MVS environment"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"001","Edition":"","Line of Business":{"code":"","label":""}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SSSN3L","label":"z\/OS Communications Server"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"001","Edition":"","Line of Business":{"code":"LOB35","label":"Mainframe SW"}}]
Document Information
Modified date:
19 December 2012