Unplanned outage test scenarios

This topic describes the unplanned outage test scenarios that you can perform to verify the *SAPSRV add-on policy.

The failure of SAP resources and how SA z/OS® reacts, depends on the severity of the failure. For example, if the enqueue or message server fails (both are a single point of failure), then the SAP system is no longer operable. This is not true, for example, if the enqueue replication server fails, which has no direct impact for a running SAP system.

In order to simulate an unplanned outage of an SAP resource, two ways are used in the following:

  1. Sending a kill -2 signal to the process. This simulates, for example, an operator intervention. For a z/OS UNIX process it means a normal stop for the process, allowing the process to perform cleanup actions, if any are implemented.
  2. Sending a kill -9 signal to the process. This simulates, for example, a program crash. For z/OS UNIX this means that the operating system does not give back any control to the process.

This topic describes the following test scenarios:

For each test scenario, the following is documented:

  • Purpose of the test
  • Expected behavior
  • Initial setup
  • Preparation for the test
  • Phases of the execution
  • Observed results

Verifying the resource status describes the verification tasks that can be performed before and after each test to check the status of the SAP-related components. These steps are not repeated in this section. However, the description of each test may contain additional verification tasks that are specific to the scenario.

Failure of the SAP enqueue server with active ERS instance

This scenario simulates the failure of the enqueue server when an ERS instance is active, and tests the behavior of SA z/OS. You can also measure the impact of the failure on the SAP workload. For a scenario where replication into the coupling facility is used (no ERS active), see Failure of the SAP enqueue server with active CF replication.

The following table summarizes the execution of the test.

Table 1. Failure of the SAP enqueue server with active ERS instance
Scenario characteristics Description
Purpose Scope: Enqueue server

Action: Unplanned outage

Expected behavior Example ABAP enqueue server:

SA z/OS shows a PROBLEM or HARDDOWN status for the failed resource SAPHA2AEN and restarts its ASCS group (SAPHA2ASCS) on the LPAR where its enqueue replication server runs. Before, its corresponding SAP start service (SAPHA2A_SRV) moves because of the HasParent relationship to SAPHA2ASCS.

The enqueue replication server stops and restarts on COH1. Before, its corresponding SAP start service (SAPHA2AR_SR) moves to COH1 because of the HasParent relationship to SAPHA2AERS.

The failure has no impact on the SAP workload.

Setup COH1, COH2, and COH3 must be running, including all required z/OS resources and SAP-related resources, with:
  • the enqueue server running on COH2
  • the enqueue replication server running on another LPAR (COH3)
Preparation
  • Log on to an SAP application server.
  • Create a workload on one SAP application server. In Figure 1 you can see that we used p570coh1v.
  • Create entries in the enqueue table.
Execution Use the UNIX command kill once with signal -9 and once with signal -2 to kill the enqueue server process outside of SA z/OS.
Verifications
  • Check that the workload is still running (SM66).
  • Verify the number of entries in the enqueue table (SM12).
  • Look for error messages in the enqueue log file, in the file dev_enqserv, in the developer traces dev_disp and dev_wx, and in the system log (SM21).
Observed results,
if unexpected:
 

Before the test, all SAP-related resources are in AVAILABLE status. The SAP enqueue server is running on COH2, and the enqueue replication server is running on a different LPAR (in this test it, is in LPAR COH3).

To simulate the failure, kill the enqueue server process (en.sapHA2_ASCS20), using the UNIX command kill -9 <pid>.

The MVS log shows the failure of the enqueue server on COH2 and its restart on COH3:

COH2 12264 15:13:33.05 STC08573 00000210 BPXP023I THREAD 2129870000000001, IN PROCESS 16974063, WAS 719
                            719 00000210 TERMINATED BY SIGNAL SIGKILL, SENT FROM THREAD
                            719 00000210 2128440000000002, IN PROCESS 16973917, UID 40312, IN JOB HA2ADM.
COH2 12264 15:13:33.09 STC08497 00000000 AOF571I 15:13:33 : SAPHA2AEN SUBSYSTEM STATUS FOR JOB SAPHA2AE IS 720
                            720 00000000 ABENDING - SUBSYSTEM HAS SUFFERED A RECOVERABLE ERROR
COH2 12264 15:13:33.12 STC08497 00000000 AOF571I 15:13:33 : SAPHA2AEN SUBSYSTEM STATUS FOR JOB SAPHA2AE IS 721
                            721 00000000 STOPPED - ABENDED, RESTARTOPT=NEVER SPECIFIED
COH2 12264 15:13:33.30 STC08497 00000000 AOF571I 15:13:33 : SAPHA2ACV SUBSYSTEM STATUS FOR JOB SAPHA2AV IS 722
                            722 00000000 AUTODOWN - SET BY SHUTDOWN
COH2 12264 15:13:33.33 STC08497 00000000 AOF743I SHUTDOWN WILL NOT (RE)PROCESS SUBSYSTEM SAPHA2ACV AS IT IS 723
                            723 00000000 AUTODOWN
COH3 12264 15:13:33.35 STC08713 00000000 AOF570I 15:13:33 : ISSUED "INGUSS JOBNAME=HA2CPAS,/bin/tcsh -c 901
                            901 00000000 '/bin/cp -p ~/start_ASCS20_srv.COH3.log 
                            901 00000000 ~/start_ASCS20_srv.COH3.log.old'" FOR SUBSYSTEM SAPHA2A_SRV -
                            901 00000000 MSGTYPE IS PRESTART
COH3 12264 15:13:33.36 STC08713 00000010 *HSAL6010A SAPHA2AEN/APL/COH2; INTERVENTION REQUIRED; BEYOND AUTOMATION
COH1 12264 15:13:33.37 STC08275 00000010 *HSAL6010A SAPHA2AEN/APL/COH2; INTERVENTION REQUIRED; BEYOND AUTOMATION
COH2 12264 15:13:33.42 STC08497 00000010 *HSAL6010A SAPHA2AEN/APL/COH2; INTERVENTION REQUIRED; BEYOND AUTOMATION
COH3 12264 15:13:33.43 STC01743 00000201 $HASP100 BPXAS ON STCINRDR
COH3 12264 15:13:33.44 STC08713 00000000 AOF571I 15:13:33 : SAPHA2A_SRV SUBSYSTEM STATUS FOR JOB SHA2ASR IS 904
                            904 00000000 STARTED - STARTUP FOR SAPHA2A_SRV/APL/COH3 IN PROGRESS
COH3 12264 15:13:33.48 STC01743 00000010 $HASP373 BPXAS STARTED
COH2 12264 15:13:33.48 STC08497 00000000 AOF571I 15:13:33 : SAPHA2AMS SUBSYSTEM STATUS FOR JOB SAPHA2AM IS 726
                            726 00000000 AUTOTERM - SET BY SHUTDOWN
COH3 12264 15:13:33.48 STC01743 00000210 BPXP024I BPXAS INITIATOR STARTED ON BEHALF OF JOB NETVIEW RUNNING IN ASID 001F
COH2 12264 15:13:33.49 STC08497 00000000 AOF571I 15:13:33 : SAPHA2AGW SUBSYSTEM STATUS FOR JOB SAPHA2AW IS 727
                            727 00000000 AUTOTERM - SET BY SHUTDOWN
COH2 12264 15:13:33.49 STC08497 00000000 AOF571I 15:13:33 : SAPHA2AST SUBSYSTEM STATUS FOR JOB SHA2AST IS 728
                            728 00000000 AUTOTERM - SET BY SHUTDOWN
COH3 12264 15:13:33.51 STC08713 00000000 AOF570I 15:13:33 : ISSUED "INGUSS JOBNAME=HA2ASCS,/bin/tcsh -c 908
                            908 00000000 '~/start_sapsrv HA2 ASCS20 ha2ascsv SHA2ASR 9 >&
                            908 00000000 ~/start_ASCS20_srv.COH3.log'" FOR SUBSYSTEM SAPHA2A_SRV - MSGTYPE
                            908 00000000 IS STARTUP

After the failure, the resource SAPHA2AEN on COH2 has the status PROBLEM or HARDDOWN.

On COH3, all resources of SAPHA2ASCS and its corresponding SAP start service SAPHA2A_SRV are in AVAILABLE status. The enqueue replication server and its corresponding SAP start service have stopped on COH3.

When the enqueue server restarts on COH3, it reads the enqueue replication table from shared memory and rebuilds the enqueue table. Use the transaction SM12 to verify that the 10 lock entries you had generated are still in the enqueue table.

Look at the enqueue server log file (enquelog) to verify that the enqueue server restarted and the enqueue replication server is not running (there is no message specifying that replication is active).

Look at the developer trace file dev_disp to verify that the dispatcher lost its connection with the message server and reconnected later on.

The following log output shows the messages of the SAP system log (SM21) during the test interval.

Figure 1. SAP system log (SM21) from Application server p570coh1v
Screen of SAP system log (SM21) from Application server p570coh1v

Failure of the SAP enqueue server with active CF replication

This scenario simulates the failure of the enqueue server when coupling facility replication is active. It also tests the behavior of SA z/OS. In addition, you can measure the impact of the failure on the SAP workload. For a scenario where an ERS instance is active, see Table 1.

The scenario covers two tests, actually. One to verify that SAP's restart behavior restarts a failed enqueue server in place and SA z/OS does not start inadvertently. The second test shows that SA z/OS will take action, if SAP is not able to restart the enqueue server in place.

Table 2 summarizes the execution of the test. For this scenario, a sample SAP System ID (SAPSID) of HA1 is used instead of HA2, which is used for the other scenarios.

Table 2. Failure of the SAP enqueue server with active CF replication
Scenario characteristics Description
Purpose Scope: Enqueue server

Action: Unplanned outage

Expected behavior Example ABAP enqueue server:
  1. The enqueue server is automatically restarted by its parent SAP process sapstart. SA z/OS notices the outage, but does not intervene, because the enqueue server is already restarted at that time.
  2. If the enqueue server cannot be restarted by SAP, for example because the SAP restart limit has been reached, or the LPAR is going down, then SA z/OS restarts the enqueue server on another LPAR. SA z/OS shows a PROBLEM or HARDDOWN status for the failed resource SAPHA1AEN and restarts its ASCS group (SAPHA1ASCS) on any other eligible LPAR. Before, its corresponding SAP start service (SAPHA1A_SRV) moves because of the HasParent relationship to SAPHA1ASCS.

The failure has no impact on the SAP workload.

Setup COH1, COH2, and COH3 must be running, including all required z/OS resources and SAP-related resources, with the enqueue server running on COH2.
Preparation
  • Log on to an SAP application server.
  • Create a workload on one SAP application server. In Figure 2 you can see that we used ihlscoh1v.
  • Create entries in the enqueue table.
Execution Use the UNIX command kill once with signal -9 to perform the first test. Perform the verifications listed in the following row. Then, perform the second test. Kill the enqueue server seven times with signal -2. This exceeds SAP’s default restart limit and SA z/OS moves the ASCS instance, including the enqueue server.
Verifications
  • Check that the workload is still running (SM66).
  • Verify the number of entries in the enqueue table (SM12).
  • Look for error messages in the enqueue log file, in the file dev_enqserv, in the developer traces dev_disp and dev_wx, and in the system log (SM21).
Observed results,
if unexpected:
 

Before the test, all SAP-related resources are in AVAILABLE status. The SAP enqueue server is running on COH2.

First Test:
To simulate the failure, kill the enqueue server process (en.sapHA1_ASCS10), using the UNIX command kill -9 <pid>.

The MVS log shows the failure of the enqueue server on COH2 and its restart in place:


COH2 17242 15:13:09.78 STC07876 00000210 BPXP023I THREAD 21B5500000000001, IN PROCESS 33751804, WAS 032
                            032 00000210 TERMINATED BY SIGNAL SIGKILL, SENT FROM THREAD
                            032 00000210 21B5380000000001, IN PROCESS 33751108, UID 0, IN JOB VSCH.
COH2 17242 15:13:09.80 STC07815 00000000 AOF571I 15:13:09 : SAPHA1AEN SUBSYSTEM STATUS FOR JOB SAPHA1AE IS 034
                            034 00000000 ABENDING - SUBSYSTEM HAS SUFFERED A RECOVERABLE ERROR
COH2 17242 15:13:09.87 STC07815 00000000 AOF571I 15:13:09 : SAPHA1AEN SUBSYSTEM STATUS FOR JOB SAPHA1AE IS 035
                            035 00000000 RESTART - RESTARTING AFTER A RECOVERABLE ERROR
COH2 17242 15:13:09.94 STC07815 00000000 AOF313I 15:13:09 : START FOR SUBSYSTEM SAPHA1AEN (JOB SAPHA1AE) 036
                            036 00000000 WAS NOT ATTEMPTED - STATUS MISMATCH FIXED - SUBSYSTEM IS NOW
                            036 00000000 "EXTSTART".
COH2 17242 15:13:09.94 STC07815 00000000 AOF571I 15:13:09 : SAPHA1AEN SUBSYSTEM STATUS FOR JOB SAPHA1AE IS 037
                            037 00000000 EXTSTART - RESTART FOUND APPLICATION MONITOR STATUS TO BE ACTIVE
COH2 17242 15:13:12.66 00000210 IXC473I NOTE PAD SAPHA1.ENQUEUE.10 HAS BEEN DELETED 038
                            038 00000210 NOTE PAD CREATION TOD: 08/24/2017 16:55:20.543037
                            038 00000210 REQUESTER JOB NAME: HA1AS101 SYSTEM NAME:COH2
                            038 00000210 REASON: USER REQUEST
COH2 17242 15:13:12.97 STC07815 00000000 AOF571I 15:13:12 : SAPHA1AEN SUBSYSTEM STATUS FOR JOB SAPHA1AE IS 039
                            039 00000000 UP - UP MESSAGE RECEIVED
COH2 17242 15:13:14.01 00000210 IXL014I IXLCONN REQUEST FOR STRUCTURE IXCNP_SAPHA100 040
                            040 00000210 WAS SUCCESSFUL. JOBNAME: XCFAS ASID: 0006
                            040 00000210 CONNECTOR NAME: NOTEPAD_03000351 CFNAME: CF01
COH2 17242 15:13:14.02 00000210 IXC472I NOTE PAD SAPHA1.ENQUEUE.10 HAS BEEN CREATED 041
                            041 00000210 REQUESTER JOB NAME: HA1AS101 SYSTEM NAME: COH2
                            041 00000210 NOTE PAD CREATION TOD: 08/30/2017 15:13:13.776288
                            041 00000210 NUMBER OF NOTES: 56415 HOST STRUCTURE: IXCNP_SAPHA101
COH2 17242 15:13:14.04 STC07876 00000211 BPXF024I (HA1ADM) SAP HA1 instance ASCS10 enqueue replication 042
                            042 00000211 started

The enqueue server is restarted by SAP and SA z/OS does not intervene.

Second Test:
To simulate a failure, where SAP cannot restart the enqueue server in place anymore, kill the enqueue server process (en.sapHA1_ASCS10) seven times using the UNIX command kill -2 <pid>.

The subsequent excerpt from the MVS log shows the failure of the enqueue server on COH2 and its restart by SA z/OS on another LPAR, here COH1.


COH2     17242 15:48:47.11 STC07815 00000000  AOF571I 15:48:47 : SAPHA1AEN SUBSYSTEM STATUS FOR JOB SAPHA1AE IS 399  
                                399 00000000   STOPPING - SUBSYSTEM SHUTDOWN OUTSIDE OF AUTOMATION                   
COH2     17242 15:48:47.16 STC07815 00000000  AOF577E 15:48:47 : RECOVERY FOR SUBSYSTEM SAPHA1AEN (JOB SAPHA1AE) 400 
                                400 00000000   HALTED - CRITICAL THRESHOLD EXCEEDED                                  
COH2     17242 15:48:47.16 STC07815 00000000 *AOF575A 15:48:47 : JOB SAPHA1AE HAS ENDED - AUTOMATED RECOVERY NOT 401 
                                401 00000000   IN PROGRESS - OPERATION INTERVENTION REQUIRED                         
COH2     17242 15:48:47.17 STC07815 00000000  AOF571I 15:48:47 : SAPHA1AEN SUBSYSTEM STATUS FOR JOB SAPHA1AE IS 402  
                                402 00000000   STOPPED - SHUTDOWN OUTSIDE OF AUTOMATION, AOFRESTARTALWAYS IS OFF     
COH2     17242 15:48:47.20 STC07815 00000000  AOF571I 15:48:47 : SAPHA1ACV SUBSYSTEM STATUS FOR JOB SAPHA1AV IS 403  
                                403 00000000   AUTODOWN - SET BY SHUTDOWN                                            
COH1     17242 15:48:47.20 STC07540 00000010 *HSAL6010A SAPHA1AEN/APL/COH2; INTERVENTION REQUIRED; BEYOND AUTOMATION 
COH1     17242 15:48:47.22 STC07540 00000000  AOF571I 15:48:47 : SAPHA1A_SRV SUBSYSTEM STATUS FOR JOB SHA1ASR IS 259 
                                259 00000000   RESTART - PREPARE SAPHA1A_SRV/APL/COH1 FOR STARTUP                    
COH2     17242 15:48:47.22 STC07815 00000000  AOF743I SHUTDOWN WILL NOT (RE)PROCESS SUBSYSTEM SAPHA1ACV AS IT IS 404 
                                404 00000000   AUTODOWN                                                              
COH2     17242 15:48:47.23 STC07815 00000010 *HSAL6010A SAPHA1AEN/APL/COH2; INTERVENTION REQUIRED; BEYOND AUTOMATION 
COH2     17242 15:48:47.25 STC07815 00000000  AOF571I 15:48:47 : SAPHA1AMS SUBSYSTEM STATUS FOR JOB SAPHA1AM IS 406  
                                406 00000000   AUTOTERM - SET BY SHUTDOWN                                            
COH1     17242 15:48:47.25 STC07540 00000000  AOF570I 15:48:47 : ISSUED "INGUSS JOBNAME=HA1CPAS,/bin/tcsh -c 260     
                                260 00000000   '/bin/cp -p ~/start_ASCS10_srv.COH1.log                               
                                260 00000000   ~/start_ASCS10_srv.COH1.log.old'" FOR SUBSYSTEM SAPHA1A_SRV -         
                                260 00000000   PHASE IS PRESTART                                                     
COH1     17242 15:48:47.27 STC07540 00000000  AOF571I 15:48:47 : SAPHA1A_SRV SUBSYSTEM STATUS FOR JOB SHA1ASR IS 261 
                                261 00000000   STARTED - STARTUP FOR SAPHA1A_SRV/APL/COH1 IN PROGRESS                
COH2     17242 15:48:47.28 STC07815 00000000  AOF571I 15:48:47 : SAPHA1AST SUBSYSTEM STATUS FOR JOB SHA1AST IS 407   
                                407 00000000   AUTOTERM - SET BY SHUTDOWN                                            
COH1     17242 15:48:47.32 STC07540 00000000  AOF570I 15:48:47 : ISSUED "INGUSS JOBNAME=HA1ASCS,/bin/tcsh -c 262     
                                262 00000000   '~/start_sapsrv HA1 ASCS10 ha1ascsv SHA1ASR 9 >&                      
                                262 00000000   ~/start_ASCS10_srv.COH1.log'" FOR SUBSYSTEM SAPHA1A_SRV - PHASE       
                                262 00000000   IS STARTUP 
...
...
... 
COH1     17242 15:49:24.48 STC04513 00000211  BPXF024I (HA1ADM) SHA1AST ACTIVE                                        
COH1     17242 15:49:24.50 STC07540 00000000  AOF571I 15:49:24 : SAPHA1AST SUBSYSTEM STATUS FOR JOB SHA1AST IS 331    
                                331 00000000   UP - UP MESSAGE RECEIVED                                               
COH1     17242 15:49:24.52 STC07540 00000000  AOF313I 15:49:24 : START FOR SUBSYSTEM SAPHA1AEN (JOB SAPHA1AE) 332     
                                332 00000000   WAS NOT ATTEMPTED - STATUS MISMATCH FIXED - SUBSYSTEM IS NOW           
                                332 00000000   "EXTSTART".                                                            
COH1     17242 15:49:24.52 STC07540 00000000  AOF313I 15:49:24 : START FOR SUBSYSTEM SAPHA1AMS (JOB SAPHA1AM) 333     
                                333 00000000   WAS NOT ATTEMPTED - STATUS MISMATCH FIXED - SUBSYSTEM IS NOW           
                                333 00000000   "EXTSTART".                                                            
COH1     17242 15:49:24.53 STC07540 00000000  AOF571I 15:49:24 : SAPHA1AEN SUBSYSTEM STATUS FOR JOB SAPHA1AE IS 334   
                                334 00000000   EXTSTART - RESTART FOUND APPLICATION MONITOR STATUS TO BE ACTIVE       
COH1     17242 15:49:24.53 STC07540 00000000  AOF571I 15:49:24 : SAPHA1AMS SUBSYSTEM STATUS FOR JOB SAPHA1AM IS 335   
                                335 00000000   EXTSTART - RESTART FOUND APPLICATION MONITOR STATUS TO BE ACTIVE       
COH1     17242 15:49:27.57 STC07540 00000000  AOF571I 15:49:27 : SAPHA1AEN SUBSYSTEM STATUS FOR JOB SAPHA1AE IS 336
                                336 00000000   UP - UP MESSAGE RECEIVED                                             
COH1     17242 15:49:27.57 STC07540 00000000  AOF571I 15:49:27 : SAPHA1AMS SUBSYSTEM STATUS FOR JOB SAPHA1AM IS 337 
                                337 00000000   UP - UP MESSAGE RECEIVED

After the failure, the resource SAPHA1AEN on COH2 has the status PROBLEM or HARDDOWN.

On COH1, all resources of SAPHA1ASCS and its corresponding SAP start service SAPHA1A_SRV are in AVAILABLE status.

When the enqueue server restarts on COH1, it reads the enqueue replication table from the Coupling Facility and rebuilds the enqueue table. Use the transaction SM12 to verify that the 10 lock entries you had generated are still in the enqueue table

Look at the developer trace file dev_disp to verify that the dispatcher lost its connection with the message server and reconnected later on.

The following log output shows the messages of the SAP system log (SM21) during the test interval.

Figure 2. SAP system log (SM21) from two Application servers
Screen of SAP system log (SM21) from two Application servers

Failure of the message server

This scenario simulates the failure of the message server and tests the behavior of SA z/OS.

The following table summarizes the execution of the test.

Table 3. Failure of the message server
Scenario characteristics Description
Purpose Scope: Message server

Action: Unplanned outage

Expected behavior Example ABAP message server:

SA z/OS waits until the message server process is automatically restarted by SAP itself. The sapstart process of the ASCS instance restarts the message server. This is caused by the Restart_Program_<xx> entry in the ASCS profile. For details see SAP Note 2177923: Processes started by SAP start service are not auto-restarted when terminated due to error.

The short interrupt, until the message server is restarted, can have an impact to the workload, see Table 2.
Setup The LPARs COH1, COH2, and COH3 must be running, including all required z/OS resources and SAP-related resources, with:
  • The SCSs inclusive the message server running on COH3
  • The enqueue replication server, running on COH1.
Preparation
  • Log on to an SAP application server.
  • Create workload (SGEN) and entries in the enqueue table.
Execution Use the UNIX command kill once with signal -9 and once with signal -2 to kill the message server process outside of SA z/OS.

Both signals have the same effect to the SA processing.

Verifications
  • Check that the workload is still running (SM66).
  • Verify the number of entries in the enqueue table (SM12).
  • Look for error messages in the developer trace dev_disp and in the system log (SM21).
Observed results, if unexpected:

Before the test, all SAP-related resources are in AVAILABLE status. The message and enqueue servers are running on COH3, and the enqueue replication server is running on COH1.

The workload and enqueue entries are created.

Then, simulate the failure of the ABAP message server process ( ms.sapHA2_ASCS20 ) using the UNIX command kill -9 <pid>:

coh3vipa,~,7:15pm,1,#ps -ef | grep -i ha2
  ha2adm   16908559   33686161  - 13:38:15 ?         0:13 en.sapHA2_ASCS20 pf=/usr/sap/HA2/SYS/profile/HA2_ASCS20_ha2ascsv
    root     131345     131386  - 19:15:28 ttyp0000  0:00 grep -i ha2
  ha2adm   16908570          1  - 13:38:02 ?         0:15 /usr/sap/HA2/ASCS20/exe/sapstartsrv pf=/usr/sap/HA2/SYS/profile/HA2_ASCS20_ha2
  ha2adm    67240345    33686161  - 13:38:15 ?         0:05  ms.sapHA2_ASCS20  pf=/usr/sap/HA2/SYS/profile/HA2_ASCS20_ha2ascsv
  ha2adm   33686161          1  - 13:38:14 ?         0:00 sapstart pf=/usr/sap/HA2/SYS/profile/HA2_ASCS20_ha2ascsv

coh3vipa,~,7:15pm,2,#kill -9 67240345 

coh3vipa,~,7:20pm,3,#ps -ef | grep -i ha2
  ha2adm   16908559   33686161  - 13:38:15 ?         0:14 en.sapHA2_ASCS20 pf=/usr/sap/HA2/SYS/profile/HA2_ASCS20_ha2ascsv
    root     131347     131386  - 19:22:32 ttyp0000  0:00 grep -i ha2
  ha2adm   16908570          1  - 13:38:02 ?         0:15 /usr/sap/HA2/ASCS20/exe/sapstartsrv pf=/usr/sap/HA2/SYS/profile/HA2_ASCS20_ha2
  ha2adm   84017561   33686161  - 19:20:50 ?         0:00 ms.sapHA2_ASCS20 pf=/usr/sap/HA2/SYS/profile/HA2_ASCS20_ha2ascsv
  ha2adm   33686161          1  - 13:38:14 ?         0:00 sapstart pf=/usr/sap/HA2/SYS/profile/HA2_ASCS20_ha2ascsv

The sapstart process restarts SAP message server immediately in place, on COH3.

The failure is transparent: the workload is still running (SM66), and the lock entries that were generated are still in the enqueue table (SM12).

Looking at the trace file of the dispatcher (dev_disp), verify that it lost its connection with the message server and reconnected a few seconds later.

Failure of the enqueue replication server

This scenario simulates the failure of the enqueue replication server. The following table summarizes the execution of the test.

Table 4. Failure of the enqueue replication server
Scenario characteristics Description
Purpose Scope: Enqueue replication server

Action: Unplanned outage

Expected behavior SA z/OS restarts the enqueue replication server in place. The failure has no impact on the SAP workload.
Setup The LPARs COH1, COH2, and COH3 must be running, including all required z/OS resources and SAP-related resources, with:
  • The enqueue server, running on COH3.
  • The enqueue replication server running on COH1.
Preparation
  • Log on to an SAP application server.
  • Create entries in the enqueue table.
Execution Use the UNIX command kill once with signal -9 and once with signal -2 to kill the enqueue replication server process outside of SA z/OS.
Verifications
  • Check that the workload is still running (SM66).
  • Verify the number of entries in the enqueue table (SM12).
  • Look for error messages in the developer trace dev_disp and in the system log (SM21).
Observed results,
if unexpected:

Before the test, all SAP-related resources are in AVAILABLE status. The message- and enqueue servers are running on COH3, and the enqueue replication server is running on COH1.

As described in Preparing for the test, log on to all SAP application servers and generate 10 lock entries in the enqueue table.

Then, simulate the failure: kill the enqueue replication server process using the UNIX command kill -9 <pid>. Use the ps -ef and kill commands as shown in Failure of the message server to retrieve the process ID of the enqueue replication server to stop it, and to verify that the process is restarted.

SA z/OS immediately restarted the enqueue replication server in place on COH1. After restarting, the enqueue replication server reconnects to the enqueue server and rebuilds its enqueue replication table in shared memory.

The failure is transparent: the workload is still running (SM66), and the lock entries that you generated are still in the enqueue table (SM12). Additionally, you can check the connection between the enqueue server and the enqueue replication server with the SAP utility program ensmon. Calling ensmon with option 2 (that is: Get replication information) displays:

ha2adm> ensmon pf=/usr/sap/HA2/SYS/profile/HA2_ASCS20_ha2ascsv 2

This provides the following output:

Try to connect to host ha2ascsv service sapdp20
get replinfo request executed successfully
Replication is enabled in server, repl. server is connected
Replication is active
…

Failure of the SAP start service

This scenario simulates the failure of the SAP start service process sapstartsrv. The following table summarizes the execution of the test.

Table 5. Failure of the SAP start service
Scenario characteristics Description
Purpose Scope: SAP start service of ASCS, SCS or ERS.

Action: Unplanned outage

Expected behavior SA z/OS restarts the SAP start service.

The failure has no impact on the SAP workload, if sapstartsrv has been killed with signal -9. Also, the failure is transparent, if sapstartsrv of the ERSs has been killed with signal -2. In both cases, the enqueue and message servers are not restarted.

Workload can be impacted, if sapstartsrv of the SCS or ASCS group was stopped with kill -2, because then the enqueue and message servers are restarted, see Table 2.
Setup The LPARs COH1, COH2, and COH3 must be running, including all required z/OS resources and SAP-related resources, with:
  • The ABAP enqueue server running on COH1.
  • The ABAP enqueue replication server running on COH2.
Preparation
  • Log on to an SAP application server.
  • Create workload (SGEN) and entries in the enqueue table.
Execution Use the UNIX command kill once with signal -9 and once with signal -2 to kill the SAP start service outside of SA z/OS. The signals have different effects on the processing of SA z/OS.
Verifications
  • Check that the workload is still running (SM66).
  • Verify the number of entries in the enqueue table (SM12).
  • Look for error messages in the developer trace dev_disp and in the system log (SM21).
  • Check the SAP System log. There should be no error about the ERS failure (SM21).
Observed results,
if unexpected:

Before the test, all SAP-related resources are in AVAILABLE status. The sapstartsrv and the ABAP SCS run on COH1. The workload and enqueue entries are created.

Then, simulate the failure: kill the SAP start service sapstart of the ABAP Central Services using the UNIX command kill -9 <pid>. Use the ps -ef and kill commands as shown in Failure of the message server to retrieve the process ID of the process to stop it, and to verify that the process is moved or restarted.

You see that only the sapstartsrv of the ABAP SCS has been restarted in place. All other resources of the ABAP SCS Group are still running.

Stopping sapstartsrv of an inactive ASCS, SCS, or ERS with command kill -2 causes the same processing of SA: restarting sapstartsrv in place.

The result of stopping sapstartsrv with kill -2 of an active ASCS, SCS, or ERS is described in Failure of the sapstart process of the SAP Central Services and Failure of the sapstart process of the enqueue replication server.

Failure of the sapstart process of the SAP Central Services

This scenario simulates the failure of the sapstart process from the SAP Central Services and tests the behavior of SA z/OS. The following table summarizes the execution of the test.

Table 6. Failure of the sapstart process of the SAP Central Services
Scenario characteristics Description
Purpose Scope: sapstart process of ASCS or SCS.

Action: Unplanned outage

Expected behavior SA z/OS shows a PROBLEM/HARDDOWN status for the failed resource SAPHA2AST and restarts the ABAP Central Services (SAPHA2ASCS) on the LPAR, where the enqueue replication server was active before the test. Before the restart, its corresponding SAP start service (SAPHA2A_SRV) moves because of the HasParent relationship to SAPHA2ASCS.

The enqueue replication server stops and restarts on a free LPAR. Before the restart, its corresponding SAP start service moves to the free LPAR because of the HasParent relationship to SAPHA2AERS or SAPHA2JERS.

The short interrupt until the message server is started on the other LPAR, can have an impact to the workload, see Table 2.
Setup The LPARs COH1, COH2, and COH3 must be running, including all required z/OS resources and SAP-related resources, with:
  • The ABAP SAP Central Services, running on COH1.
  • The ABAP enqueue replication server, running on COH2.
Preparation
  • Log on to an SAP application server.
  • Create workload (SGEN) and entries in the enqueue table.
Execution Use the UNIX command kill once with signal -9 and once with signal -2 to kill the SAP start service outside of SA z/OS. Both signals result in the same SA z/OS reaction.
Verifications
  • Check that the workload is still running (SM66).
  • Verify the number of entries in the enqueue table (SM12).
  • Look for error messages in the developer trace dev_disp and in the system log (SM21).
  • Check in the System log, when the SAP application server lost and re-established the connection to the message server (SM21).
Observed results,
if unexpected:

Before the test, all SAP-related resources are in AVAILABLE status. The enqueue servers are running on COH1, and the enqueue replication server is running on COH2. Workload and enqueue entries are created.

Then, simulate the failure: kill the SAP start service sapstart of the ABAP SAP Central Services using the UNIX command kill -9 <pid>. Use the ps -ef and kill commands as shown in Failure of the message server to retrieve the process ID of the sapstart process to stop it, and to verify that the ASCS instance has moved as well as its corresponding ERS instance.

You will see that the sapstart process of the ABAP SAP central service has been restarted on LPAR COH2 (where ERS is running). All other SAP resources of the ABAP SAP central service Group have been restarted as well on LPAR COH2 at the same time. The ERS for ABAP was moved from LPAR COH2 to COH3.

Note: After the failure, the resource SAPHA2AST on COH1 has the SA z/OS agent status PROBLEM/HARDDOWN. With this status, the ASCS does not restart on COH1. To make COH1 available again for ASCS, you must update the SA z/OS agent status to AUTODOWN for SAPHA2AST on COH1.

In a productive environment, you should always verify the cause of the process failure before you change the agent status from HARDDOWN to AUTODOWN.

Failure of the sapstart process of the enqueue replication server

This scenario simulates the failure of the sapstart process from the SAP Central Services and tests the behavior of SA z/OS. The following table summarizes the execution of the test.

Table 7. Failure of the sapstart process of the enqueue replication server
Scenario characteristics Description
Purpose Scope: sapstart process of ABAP or Java™ ERS.

Action: Unplanned outage

Expected behavior SA z/OS restarts the sapstart process in place.
  • If the service was stopped by a kill -2 command, then the enqueue replication server is also restarted in place.
  • If the service was stopped by a kill -9 command, then SA z/OS does neither restart the sapstart process, nor the enqueue replication server itself.

The failure has no impact on the SAP workload.

Setup The LPARs COH1, COH2, and COH3 must be running, including all required z/OS resources and SAP-related resources, with:
  • The ABAP SAP Central Services running on COH1
  • The ABAP enqueue replication server running on COH2.
Preparation
  • Log on to an SAP application server.
  • Create entries in the enqueue table.
Execution Use the UNIX command kill once with signal -9 and once with signal -2 to kill the SAP start service process outside of SA z/OS. The signals have different effects on the processing of System Automation.
Verifications
  • Check that the workload is still running (SM66).
  • Verify the number of entries in the enqueue table (SM12).
  • Look for error messages in the developer trace dev_disp and in the system log (SM21).
Observed results,
if unexpected:

Before the test, all SAP-related resources are in AVAILABLE status. The ABAP enqueue replication server is running on COH3. The enqueue entries are created.

Then, simulate the failure: kill the SAP start service sapstart of the ABAP Central Services using the UNIX command kill -2 <pid>. Use the ps -ef and kill commands as shown in Failure of the message server to retrieve the process ID of the sapstart process to stop it, and to verify that the ERS instance was restarted in place.

You will see that the sapstart process of the ABAP ERS has been restarted on the same LPAR COH3. Also, the enqueue replication server process has been restarted in place at the same time.

Note: If the sapstart process of the ERS has been killed with signal -9, then the sapstart process is not restarted by SA z/OS. This is because the sapcontrol GetProcessList command, which checks whether the instance is running OK, returns no errors (rc=3), even without the sapstart process running. The sapstart process is indeed not needed for running the ERS process properly. To clear this situation, perform the following steps:
  1. Additionally, kill the ERS process with signal -9. It can take up to 30 seconds until the ERS status changes into status HARDDOWN.
  2. Change (for example, in SA z/OS) the agent status from HARDDOWN into AUTODOWN. Then, the ERS instance restarts including the sapstart process on the same LPAR.

Failure of the NFS server

This scenario simulates the failure of the NFS server and tests the behavior of SA z/OS. It also measures the impact of the failure on the SAP workload.

The following table summarizes the execution of the test.

Table 8. Failure of the NFS server
Scenario characteristics Description
Purpose Scope: NFS server

Action: Unplanned outage

Expected behavior SA z/OS should restart the NFS server.

Existing NFS mounts should be reestablished.

The global file systems can be accessed from each SAP application server

Setup COH1, COH2, and COH3 must be up, including all required z/OS resources and SAP-related resources. The NFS server runs on COH1. The address space name of the NFS server is assumed to be MVSNFSHA.
Preparation
  • Log on to an SAP application server.
Execution Cancel the address space MVSNFSHA on COH1.
Verifications
  • Check that the file systems are accessible (AL11).
  • Look for error messages in the system log (SM21).
Observed results,
if unexpected:

Before the test, all SAP-related resources are in AVAILABLE status. The NFS and enqueue servers are running on COH1.

Simulate the failure by canceling the address space of the NFS server on COH1 using the following command:

/C MVSNFSHA

Because at the time of the test, the effective preference of COH2 was higher than that of COH3, SA z/OS immediately restarted the NFS server on COH2 (along with its VIPA) and put the resource MVSNFSHA on COH3 in a RESTART status:

AOFKSTA5                 SA z/OS  - Command Dialogs      Line  1    of 3
Domain ID   = IPXFO     -------- DISPSTAT ----------     Date = 03/30/10
Operator ID = HEIKES                                     Time = 15:37:01
 A dispflgs  B setstate  C ingreq-stop  D thresholds  E explain  F info G tree
 H trigger   I service  J all children  K children  L all parents  M parents
CMD  RESOURCE     STATUS    SYSTEM   JOB NAME  A I S R T RS TYPE     Activity
---  -----------  --------  -------- --------  ------------ -------- ---------
     NFSSERV      DOWN      COH1     MVSNFSHA  - - - - - -  MVS      --none--
     NFSSERV      UP        COH2     MVSNFSHA  - - - - - -  MVS      --none--
     NFSSERV      RESTART   COH3     MVSNFSHA  - - - - - -  MVS      --none-- 

The SAP global file systems that are NFS-mounted on AIX® machine p570coh2v are accessible with SAP transaction AL11. No error messages are written to the SAP system log (SM21).

Failure of a TCP/IP stack

This scenario simulates the failure of the TCP/IP stack on the system where the enqueue server and the NFS server are running, and tests the behavior of SA z/OS. It also measures the impact of the failure on the SAP workload.

The samples from the scenario in this section use a TCPIP stack name of TCPIPA.

The following table summarizes the execution of the test.

Table 9. Failure of a TCP/IP stack
Scenario characteristics Description
Purpose Scope: TCP/IP stack

Action: Unplanned outage

Expected behavior SA z/OS restarts the TCP/IP stack. OMPROUTE goes into a PROBLEM/HARDDOWN status, and therefore the NFS server fails and SA z/OS moves it.

Because of the HasParent relationship definition in the SAP policy, SA z/OS stops SCS and restarts it on the LPAR where the enqueue replication server is running.

As a consequence, the enqueue replication server starts on a different LPAR, as well due to the HasParent relationship definition in the SAP policy.

For the remote SAP application server connected to the database server running on the LPAR where the failure occurs, running transactions should be rolled back and work processes should reconnect either to the same database server, or failover to the standby database server.

For the SAP application server running on the other LPAR, the failure should have no impact.

Setup LPARs COH1, COH2, and COH3 must be up, including all required z/OS resources and SAP-related resources, with:
  • The enqueue server running on COH1
  • The enqueue replication server running on COH2
  • The NFS server running on COH1
Preparation
  • Log on to an SAP application, which is connected to a Db2® subsystem on COH1.
  • Create workload with client copy.
Execution Cancel the address space TCPIPA on COH1.
Verifications
  • Check whether the workload is still running (SM50/SM66).
  • Look for error messages in the enqueue log file, in the developer traces dev_disp and dev_w<x>, where x is the number of the work process it belongs to, and in the system log (SM21).
Observed results,
if unexpected:

Before the test, all SAP-related resources are in AVAILABLE status. The NFS and the SCS (with enqueue- and message servers) are running on COH1, and the enqueue replication server runs on COH2.

Simulate the failure by stopping TCPIPA on COH1 using the following command:

/C TCPIPA

Because the critical threshold is not reached, SA z/OS immediately restarts TCPIPA on COH1.

The failure of the TCP/IP stack leads to the failure of the NFS server and SCS on COH1.

SA z/OS immediately restarts the NFS server on another LPAR (COH2).

SA z/OS restarts SCS on the LPAR where the enqueue replication server is running that is COH2. The enqueue replication server was restarted on COH3:

INGKYST0                  SA z/OS  - Command Dialogs     Line  1     of 18
Domain ID   = IPXFO      -------- INGLIST   ---------    Date = 03/05/10
Operator ID = HEIKES          Sysplex = COHPLEX          Time = 14:00:25
 A Update   B Start    C Stop     D INGRELS  E INGVOTE  F INGINFO  G Members
 H DISPTRG  I INGSCHED J INGGROUP K INGCICS  L INGIMS   M DISPMTR  T INGTWS
 U User     X INGLKUP  / scroll
CMD Name         Type System    Compound      Desired      Observed    Nature
--- ------------ ---- --------  ------------  -----------  ----------  ------
    SAPHA2AER    APL  COH1      SATISFACTORY  UNAVAILABLE  SOFTDOWN
    SAPHA2AER    APL  COH2      SATISFACTORY  UNAVAILABLE  SOFTDOWN
    SAPHA2AER    APL  COH3      SATISFACTORY  AVAILABLE    AVAILABLE
    

The client copy is started interactively (not as background job). ClientCopy fails because it cannot handles the TCPIP failure transparently. After TCPIP is up again and the SCS and enqueue replication server are restarted, the client copy can be restarted and finishes without errors and without lock entries in the lock table (can be checked with SAP transaction SM12).

In SAP transaction dbacockpit under Performance the Thread Activity shows that work processes from SAP AS ihlscoh1 are connected to DB host coh1vipa. Verify with Db2 command -DIS THREAD(*) that all the threads are connected to COH1. Connection information for each work process can be found in the developer trace file dev_w<x>, where x is the number of the work process it belongs to.

The developer trace dev_disp shows that the dispatcher lost its connection with the message server and reconnected later on.

The SAP system log (SM21) shows error and recovery messages that are issued during the interval of the test, similar to the ones shown in Figure 1.

All the SAP-related resources are in AVAILABLE status after the failover. The NFS and enqueue servers are running on COH2. The enqueue replication server is running on COH3.

Failure of an LPAR

This scenario simulates the failure of the LPAR where the enqueue server and the NFS server were running and test the behavior of SA z/OS. It also measures the impact of the failure on the SAP workload.

Table 10 summarizes the execution of the test.

Table 10. Failure of the LPAR where the ES and NFS servers are running
Scenario characteristics Description
Purpose Scope: One LPAR

Action: Unplanned outage

Expected behavior SA z/OS should restart the master of the failing DB2® subsystem on aher LPAR in light mode. The Db2 subsystem goes down after successful startup.

SA z/OS should restart the NFS server on another LPAR.

SA z/OS should restart SCS on the LPAR where the enqueue replication server is running.

The enqueue replication server should stop or move to another LPAR if more than two LPARs are available.

For the SAP application server connected to the database server is running on the failing LPAR, running transactions should be rolled back and work processes should failover to the standby database server, which.

For the SAP application server to that database server that is connected is running on the other LPAR, the failure should have no impact.

Setup The LPARs COH1, COH2, and COH3 must be up, including all required z/OS resources and SAP-related resources, with:
  • The SCS with enqueue server, which running on COH2.
  • The enqueue replication server, running on another LPAR (COH3).
Preparation
  • Log on to an SAP application server, which is connected to database server on COH2.
  • Create a workload with client copy on the application server, which is connected to the database of COH2.
Execution System reset at the HMC for COH2 (Primary Automation Manager (PAM) in SA z/OS).
Verifications
  • Check whether the workload is still running (SM50, SM66).
  • Verify the number of entries in the enqueue table (SM12).
  • Look for error messages in the enqueue log file, in the developer traces dev_disp and dev_w<x>, where x is the number of the work process it belongs to, and in the system log (SM21).
Observed results,
if unexpected:

Before the test, all SAP-related resources are in UP status. The NFS was started on COH2. The SCS started on COH2 (the LPAR where enqueue replication server was running before), and the enqueue replication server is running on COH1.

Simulate the failure by doing a system reset at the HMC.

Use the SA z/OS command INGLIST */*/COH2 to display the status of the resources on COH2. They are all displayed with a status INHIBITED/SYSGONE.

SA z/OS restarts the Db2 subsystem HA22 on COH1 with the in light mode in order to quickly release the retained locks. When the LPAR startup was complete, the Db2 subsystem in light mode stops and the Db2 primarily subsystem on COH2 starts.

SA z/OS restarts the NFS server on COH1.

SA z/OS restarts SCS on the LPAR COH1, where the enqueue replication server is running.

SA z/OS restarts the enqueue replication server on COH3.

The transaction Db2 showed that the current DB host was now coh1vipa. Check with the Db2 command -DIS THREAD(*), that all the threads are connected to COH1. Connection information for each work process can be found in the developer trace files.

All SAP-related resources are in AVAILABLE status after the failover and running on COH2, including the enqueue servers. The enqueue replication server was running on COH3.