Unplanned outage test scenarios
This topic describes the unplanned outage test scenarios that you can perform to verify the *SAPSRV add-on policy.
The failure of SAP resources and how SA z/OS® reacts, depends on the severity of the failure. For example, if the enqueue or message server fails (both are a single point of failure), then the SAP system is no longer operable. This is not true, for example, if the enqueue replication server fails, which has no direct impact for a running SAP system.
In order to simulate an unplanned outage of an SAP resource, two ways are used in the following:
- Sending a
kill -2
signal to the process. This simulates, for example, an operator intervention. For a z/OS UNIX process it means a normal stop for the process, allowing the process to perform cleanup actions, if any are implemented. - Sending a
kill -9
signal to the process. This simulates, for example, a program crash. For z/OS UNIX this means that the operating system does not give back any control to the process.
This topic describes the following test scenarios:
- Failure of the SAP enqueue server with active ERS instance
- Failure of the SAP enqueue server with active CF replication
- Failure of the message server
- Failure of the enqueue replication server
- Failure of the SAP start service
- Failure of the sapstart process of the SAP Central Services
- Failure of the sapstart process of the enqueue replication server
- Failure of the NFS server
- Failure of a TCP/IP stack
- Failure of an LPAR
For each test scenario, the following is documented:
- Purpose of the test
- Expected behavior
- Initial setup
- Preparation for the test
- Phases of the execution
- Observed results
Verifying the resource status describes the verification tasks that can be performed before and after each test to check the status of the SAP-related components. These steps are not repeated in this section. However, the description of each test may contain additional verification tasks that are specific to the scenario.
Failure of the SAP enqueue server with active ERS instance
This scenario simulates the failure of the enqueue server when an ERS instance is active, and tests the behavior of SA z/OS. You can also measure the impact of the failure on the SAP workload. For a scenario where replication into the coupling facility is used (no ERS active), see Failure of the SAP enqueue server with active CF replication.
The following table summarizes the execution of the test.
Scenario characteristics | Description |
---|---|
Purpose | Scope: Enqueue server Action: Unplanned outage |
Expected behavior | Example ABAP enqueue server: SA z/OS shows a PROBLEM or HARDDOWN status for the
failed resource SAPHA2AEN and restarts its ASCS group (SAPHA2ASCS) on the LPAR where its enqueue
replication server runs. Before, its corresponding SAP start service (SAPHA2A_SRV) moves because of
the HasParent relationship to The enqueue replication server stops and restarts on COH1. Before, its corresponding SAP start
service (SAPHA2AR_SR) moves to COH1 because of the HasParent relationship to
The failure has no impact on the SAP workload. |
Setup | COH1, COH2, and COH3 must be running, including all required z/OS resources and SAP-related resources, with:
|
Preparation |
|
Execution | Use the UNIX command
kill once with signal -9 and once with signal -2
to kill the enqueue server process outside of SA z/OS. |
Verifications |
|
Observed results,
if unexpected: |
Before the test, all SAP-related resources are in AVAILABLE status. The SAP enqueue server is running on COH2, and the enqueue replication server is running on a different LPAR (in this test it, is in LPAR COH3).
To simulate the failure, kill the enqueue server process (en.sapHA2_ASCS20), using the UNIX command kill -9 <pid>
.
The MVS log shows the failure of the enqueue server on COH2 and its restart on COH3:
COH2 12264 15:13:33.05 STC08573 00000210 BPXP023I THREAD 2129870000000001, IN PROCESS 16974063, WAS 719
719 00000210 TERMINATED BY SIGNAL SIGKILL, SENT FROM THREAD
719 00000210 2128440000000002, IN PROCESS 16973917, UID 40312, IN JOB HA2ADM.
COH2 12264 15:13:33.09 STC08497 00000000 AOF571I 15:13:33 : SAPHA2AEN SUBSYSTEM STATUS FOR JOB SAPHA2AE IS 720
720 00000000 ABENDING - SUBSYSTEM HAS SUFFERED A RECOVERABLE ERROR
COH2 12264 15:13:33.12 STC08497 00000000 AOF571I 15:13:33 : SAPHA2AEN SUBSYSTEM STATUS FOR JOB SAPHA2AE IS 721
721 00000000 STOPPED - ABENDED, RESTARTOPT=NEVER SPECIFIED
COH2 12264 15:13:33.30 STC08497 00000000 AOF571I 15:13:33 : SAPHA2ACV SUBSYSTEM STATUS FOR JOB SAPHA2AV IS 722
722 00000000 AUTODOWN - SET BY SHUTDOWN
COH2 12264 15:13:33.33 STC08497 00000000 AOF743I SHUTDOWN WILL NOT (RE)PROCESS SUBSYSTEM SAPHA2ACV AS IT IS 723
723 00000000 AUTODOWN
COH3 12264 15:13:33.35 STC08713 00000000 AOF570I 15:13:33 : ISSUED "INGUSS JOBNAME=HA2CPAS,/bin/tcsh -c 901
901 00000000 '/bin/cp -p ~/start_ASCS20_srv.COH3.log
901 00000000 ~/start_ASCS20_srv.COH3.log.old'" FOR SUBSYSTEM SAPHA2A_SRV -
901 00000000 MSGTYPE IS PRESTART
COH3 12264 15:13:33.36 STC08713 00000010 *HSAL6010A SAPHA2AEN/APL/COH2; INTERVENTION REQUIRED; BEYOND AUTOMATION
COH1 12264 15:13:33.37 STC08275 00000010 *HSAL6010A SAPHA2AEN/APL/COH2; INTERVENTION REQUIRED; BEYOND AUTOMATION
COH2 12264 15:13:33.42 STC08497 00000010 *HSAL6010A SAPHA2AEN/APL/COH2; INTERVENTION REQUIRED; BEYOND AUTOMATION
COH3 12264 15:13:33.43 STC01743 00000201 $HASP100 BPXAS ON STCINRDR
COH3 12264 15:13:33.44 STC08713 00000000 AOF571I 15:13:33 : SAPHA2A_SRV SUBSYSTEM STATUS FOR JOB SHA2ASR IS 904
904 00000000 STARTED - STARTUP FOR SAPHA2A_SRV/APL/COH3 IN PROGRESS
COH3 12264 15:13:33.48 STC01743 00000010 $HASP373 BPXAS STARTED
COH2 12264 15:13:33.48 STC08497 00000000 AOF571I 15:13:33 : SAPHA2AMS SUBSYSTEM STATUS FOR JOB SAPHA2AM IS 726
726 00000000 AUTOTERM - SET BY SHUTDOWN
COH3 12264 15:13:33.48 STC01743 00000210 BPXP024I BPXAS INITIATOR STARTED ON BEHALF OF JOB NETVIEW RUNNING IN ASID 001F
COH2 12264 15:13:33.49 STC08497 00000000 AOF571I 15:13:33 : SAPHA2AGW SUBSYSTEM STATUS FOR JOB SAPHA2AW IS 727
727 00000000 AUTOTERM - SET BY SHUTDOWN
COH2 12264 15:13:33.49 STC08497 00000000 AOF571I 15:13:33 : SAPHA2AST SUBSYSTEM STATUS FOR JOB SHA2AST IS 728
728 00000000 AUTOTERM - SET BY SHUTDOWN
COH3 12264 15:13:33.51 STC08713 00000000 AOF570I 15:13:33 : ISSUED "INGUSS JOBNAME=HA2ASCS,/bin/tcsh -c 908
908 00000000 '~/start_sapsrv HA2 ASCS20 ha2ascsv SHA2ASR 9 >&
908 00000000 ~/start_ASCS20_srv.COH3.log'" FOR SUBSYSTEM SAPHA2A_SRV - MSGTYPE
908 00000000 IS STARTUP
After the failure, the resource SAPHA2AEN on COH2 has the status PROBLEM or HARDDOWN.
On COH3, all resources of SAPHA2ASCS and its corresponding SAP start service SAPHA2A_SRV are in AVAILABLE status. The enqueue replication server and its corresponding SAP start service have stopped on COH3.
When the enqueue server restarts on COH3, it reads the enqueue replication table from shared memory and rebuilds the enqueue table. Use the transaction SM12 to verify that the 10 lock entries you had generated are still in the enqueue table.
Look at the enqueue server log file (enquelog) to verify that the enqueue server restarted and the enqueue replication server is not running (there is no message specifying that replication is active).
Look at the developer trace file dev_disp to verify that the dispatcher lost its connection with the message server and reconnected later on.
The following log output shows the messages of the SAP system log (SM21) during the test interval.

Failure of the SAP enqueue server with active CF replication
This scenario simulates the failure of the enqueue server when coupling facility replication is active. It also tests the behavior of SA z/OS. In addition, you can measure the impact of the failure on the SAP workload. For a scenario where an ERS instance is active, see Table 1.
The scenario covers two tests, actually. One to verify that SAP's restart behavior restarts a failed enqueue server in place and SA z/OS does not start inadvertently. The second test shows that SA z/OS will take action, if SAP is not able to restart the enqueue server in place.
Table 2 summarizes the execution of the test. For this scenario, a sample SAP System ID (SAPSID) of HA1 is used instead of HA2, which is used for the other scenarios.
Scenario characteristics | Description |
---|---|
Purpose | Scope: Enqueue server Action: Unplanned outage |
Expected behavior | Example ABAP enqueue server:
The failure has no impact on the SAP workload. |
Setup | COH1, COH2, and COH3 must be running, including all required z/OS resources and SAP-related resources, with the enqueue server running on COH2. |
Preparation |
|
Execution | Use the UNIX command
kill once with signal -9 to perform the first test. Perform the
verifications listed in the following row. Then, perform the second test. Kill the enqueue server
seven times with signal -2 . This exceeds SAP’s default restart limit and SA z/OS moves the ASCS instance, including the
enqueue server. |
Verifications |
|
Observed results,
if unexpected: |
Before the test, all SAP-related resources are in AVAILABLE status. The SAP enqueue server is running on COH2.
First Test:
To simulate the failure, kill the enqueue server process
(en.sapHA1_ASCS10), using the UNIX
command kill -9 <pid>
.
The MVS log shows the failure of the enqueue server on COH2 and its restart in place:
COH2 17242 15:13:09.78 STC07876 00000210 BPXP023I THREAD 21B5500000000001, IN PROCESS 33751804, WAS 032
032 00000210 TERMINATED BY SIGNAL SIGKILL, SENT FROM THREAD
032 00000210 21B5380000000001, IN PROCESS 33751108, UID 0, IN JOB VSCH.
COH2 17242 15:13:09.80 STC07815 00000000 AOF571I 15:13:09 : SAPHA1AEN SUBSYSTEM STATUS FOR JOB SAPHA1AE IS 034
034 00000000 ABENDING - SUBSYSTEM HAS SUFFERED A RECOVERABLE ERROR
COH2 17242 15:13:09.87 STC07815 00000000 AOF571I 15:13:09 : SAPHA1AEN SUBSYSTEM STATUS FOR JOB SAPHA1AE IS 035
035 00000000 RESTART - RESTARTING AFTER A RECOVERABLE ERROR
COH2 17242 15:13:09.94 STC07815 00000000 AOF313I 15:13:09 : START FOR SUBSYSTEM SAPHA1AEN (JOB SAPHA1AE) 036
036 00000000 WAS NOT ATTEMPTED - STATUS MISMATCH FIXED - SUBSYSTEM IS NOW
036 00000000 "EXTSTART".
COH2 17242 15:13:09.94 STC07815 00000000 AOF571I 15:13:09 : SAPHA1AEN SUBSYSTEM STATUS FOR JOB SAPHA1AE IS 037
037 00000000 EXTSTART - RESTART FOUND APPLICATION MONITOR STATUS TO BE ACTIVE
COH2 17242 15:13:12.66 00000210 IXC473I NOTE PAD SAPHA1.ENQUEUE.10 HAS BEEN DELETED 038
038 00000210 NOTE PAD CREATION TOD: 08/24/2017 16:55:20.543037
038 00000210 REQUESTER JOB NAME: HA1AS101 SYSTEM NAME:COH2
038 00000210 REASON: USER REQUEST
COH2 17242 15:13:12.97 STC07815 00000000 AOF571I 15:13:12 : SAPHA1AEN SUBSYSTEM STATUS FOR JOB SAPHA1AE IS 039
039 00000000 UP - UP MESSAGE RECEIVED
COH2 17242 15:13:14.01 00000210 IXL014I IXLCONN REQUEST FOR STRUCTURE IXCNP_SAPHA100 040
040 00000210 WAS SUCCESSFUL. JOBNAME: XCFAS ASID: 0006
040 00000210 CONNECTOR NAME: NOTEPAD_03000351 CFNAME: CF01
COH2 17242 15:13:14.02 00000210 IXC472I NOTE PAD SAPHA1.ENQUEUE.10 HAS BEEN CREATED 041
041 00000210 REQUESTER JOB NAME: HA1AS101 SYSTEM NAME: COH2
041 00000210 NOTE PAD CREATION TOD: 08/30/2017 15:13:13.776288
041 00000210 NUMBER OF NOTES: 56415 HOST STRUCTURE: IXCNP_SAPHA101
COH2 17242 15:13:14.04 STC07876 00000211 BPXF024I (HA1ADM) SAP HA1 instance ASCS10 enqueue replication 042
042 00000211 started
The enqueue server is restarted by SAP and SA z/OS does not intervene.
Second Test:
To simulate a failure, where SAP cannot restart the enqueue
server in place anymore, kill the enqueue server process (en.sapHA1_ASCS10) seven times using the
UNIX command kill -2
<pid>
.
The subsequent excerpt from the MVS log shows the failure of the enqueue server on COH2 and its restart by SA z/OS on another LPAR, here COH1.
COH2 17242 15:48:47.11 STC07815 00000000 AOF571I 15:48:47 : SAPHA1AEN SUBSYSTEM STATUS FOR JOB SAPHA1AE IS 399
399 00000000 STOPPING - SUBSYSTEM SHUTDOWN OUTSIDE OF AUTOMATION
COH2 17242 15:48:47.16 STC07815 00000000 AOF577E 15:48:47 : RECOVERY FOR SUBSYSTEM SAPHA1AEN (JOB SAPHA1AE) 400
400 00000000 HALTED - CRITICAL THRESHOLD EXCEEDED
COH2 17242 15:48:47.16 STC07815 00000000 *AOF575A 15:48:47 : JOB SAPHA1AE HAS ENDED - AUTOMATED RECOVERY NOT 401
401 00000000 IN PROGRESS - OPERATION INTERVENTION REQUIRED
COH2 17242 15:48:47.17 STC07815 00000000 AOF571I 15:48:47 : SAPHA1AEN SUBSYSTEM STATUS FOR JOB SAPHA1AE IS 402
402 00000000 STOPPED - SHUTDOWN OUTSIDE OF AUTOMATION, AOFRESTARTALWAYS IS OFF
COH2 17242 15:48:47.20 STC07815 00000000 AOF571I 15:48:47 : SAPHA1ACV SUBSYSTEM STATUS FOR JOB SAPHA1AV IS 403
403 00000000 AUTODOWN - SET BY SHUTDOWN
COH1 17242 15:48:47.20 STC07540 00000010 *HSAL6010A SAPHA1AEN/APL/COH2; INTERVENTION REQUIRED; BEYOND AUTOMATION
COH1 17242 15:48:47.22 STC07540 00000000 AOF571I 15:48:47 : SAPHA1A_SRV SUBSYSTEM STATUS FOR JOB SHA1ASR IS 259
259 00000000 RESTART - PREPARE SAPHA1A_SRV/APL/COH1 FOR STARTUP
COH2 17242 15:48:47.22 STC07815 00000000 AOF743I SHUTDOWN WILL NOT (RE)PROCESS SUBSYSTEM SAPHA1ACV AS IT IS 404
404 00000000 AUTODOWN
COH2 17242 15:48:47.23 STC07815 00000010 *HSAL6010A SAPHA1AEN/APL/COH2; INTERVENTION REQUIRED; BEYOND AUTOMATION
COH2 17242 15:48:47.25 STC07815 00000000 AOF571I 15:48:47 : SAPHA1AMS SUBSYSTEM STATUS FOR JOB SAPHA1AM IS 406
406 00000000 AUTOTERM - SET BY SHUTDOWN
COH1 17242 15:48:47.25 STC07540 00000000 AOF570I 15:48:47 : ISSUED "INGUSS JOBNAME=HA1CPAS,/bin/tcsh -c 260
260 00000000 '/bin/cp -p ~/start_ASCS10_srv.COH1.log
260 00000000 ~/start_ASCS10_srv.COH1.log.old'" FOR SUBSYSTEM SAPHA1A_SRV -
260 00000000 PHASE IS PRESTART
COH1 17242 15:48:47.27 STC07540 00000000 AOF571I 15:48:47 : SAPHA1A_SRV SUBSYSTEM STATUS FOR JOB SHA1ASR IS 261
261 00000000 STARTED - STARTUP FOR SAPHA1A_SRV/APL/COH1 IN PROGRESS
COH2 17242 15:48:47.28 STC07815 00000000 AOF571I 15:48:47 : SAPHA1AST SUBSYSTEM STATUS FOR JOB SHA1AST IS 407
407 00000000 AUTOTERM - SET BY SHUTDOWN
COH1 17242 15:48:47.32 STC07540 00000000 AOF570I 15:48:47 : ISSUED "INGUSS JOBNAME=HA1ASCS,/bin/tcsh -c 262
262 00000000 '~/start_sapsrv HA1 ASCS10 ha1ascsv SHA1ASR 9 >&
262 00000000 ~/start_ASCS10_srv.COH1.log'" FOR SUBSYSTEM SAPHA1A_SRV - PHASE
262 00000000 IS STARTUP
...
...
...
COH1 17242 15:49:24.48 STC04513 00000211 BPXF024I (HA1ADM) SHA1AST ACTIVE
COH1 17242 15:49:24.50 STC07540 00000000 AOF571I 15:49:24 : SAPHA1AST SUBSYSTEM STATUS FOR JOB SHA1AST IS 331
331 00000000 UP - UP MESSAGE RECEIVED
COH1 17242 15:49:24.52 STC07540 00000000 AOF313I 15:49:24 : START FOR SUBSYSTEM SAPHA1AEN (JOB SAPHA1AE) 332
332 00000000 WAS NOT ATTEMPTED - STATUS MISMATCH FIXED - SUBSYSTEM IS NOW
332 00000000 "EXTSTART".
COH1 17242 15:49:24.52 STC07540 00000000 AOF313I 15:49:24 : START FOR SUBSYSTEM SAPHA1AMS (JOB SAPHA1AM) 333
333 00000000 WAS NOT ATTEMPTED - STATUS MISMATCH FIXED - SUBSYSTEM IS NOW
333 00000000 "EXTSTART".
COH1 17242 15:49:24.53 STC07540 00000000 AOF571I 15:49:24 : SAPHA1AEN SUBSYSTEM STATUS FOR JOB SAPHA1AE IS 334
334 00000000 EXTSTART - RESTART FOUND APPLICATION MONITOR STATUS TO BE ACTIVE
COH1 17242 15:49:24.53 STC07540 00000000 AOF571I 15:49:24 : SAPHA1AMS SUBSYSTEM STATUS FOR JOB SAPHA1AM IS 335
335 00000000 EXTSTART - RESTART FOUND APPLICATION MONITOR STATUS TO BE ACTIVE
COH1 17242 15:49:27.57 STC07540 00000000 AOF571I 15:49:27 : SAPHA1AEN SUBSYSTEM STATUS FOR JOB SAPHA1AE IS 336
336 00000000 UP - UP MESSAGE RECEIVED
COH1 17242 15:49:27.57 STC07540 00000000 AOF571I 15:49:27 : SAPHA1AMS SUBSYSTEM STATUS FOR JOB SAPHA1AM IS 337
337 00000000 UP - UP MESSAGE RECEIVED
After the failure, the resource SAPHA1AEN on COH2 has the status PROBLEM or HARDDOWN.
On COH1, all resources of SAPHA1ASCS and its corresponding SAP start service SAPHA1A_SRV are in AVAILABLE status.
When the enqueue server restarts on COH1, it reads the enqueue replication table from the Coupling Facility and rebuilds the enqueue table. Use the transaction SM12 to verify that the 10 lock entries you had generated are still in the enqueue table
Look at the developer trace file dev_disp to verify that the dispatcher lost its connection with the message server and reconnected later on.
The following log output shows the messages of the SAP system log (SM21) during the test interval.

Failure of the message server
This scenario simulates the failure of the message server and tests the behavior of SA z/OS.
The following table summarizes the execution of the test.
Scenario characteristics | Description |
---|---|
Purpose | Scope: Message server Action: Unplanned outage |
Expected behavior | Example ABAP message server: SA z/OS waits until the message server process is
automatically restarted by SAP itself. The The short interrupt, until the message server is restarted, can have an impact to the workload,
see Table 2.
|
Setup | The LPARs COH1, COH2, and COH3 must be running, including all required z/OS resources and SAP-related resources, with:
|
Preparation |
|
Execution | Use the UNIX command
kill once with signal -9 and once with signal -2
to kill the message server process outside of SA z/OS. Both signals have the same effect to the SA processing. |
Verifications |
|
Observed results, if unexpected: |
Before the test, all SAP-related resources are in AVAILABLE status. The message and enqueue servers are running on COH3, and the enqueue replication server is running on COH1.
The workload and enqueue entries are created.
Then, simulate the failure of the ABAP message server process
( ms.sapHA2_ASCS20 ) using the UNIX
command kill -9 <pid>
:
coh3vipa,~,7:15pm,1,#ps -ef | grep -i ha2
ha2adm 16908559 33686161 - 13:38:15 ? 0:13 en.sapHA2_ASCS20 pf=/usr/sap/HA2/SYS/profile/HA2_ASCS20_ha2ascsv
root 131345 131386 - 19:15:28 ttyp0000 0:00 grep -i ha2
ha2adm 16908570 1 - 13:38:02 ? 0:15 /usr/sap/HA2/ASCS20/exe/sapstartsrv pf=/usr/sap/HA2/SYS/profile/HA2_ASCS20_ha2
ha2adm 67240345 33686161 - 13:38:15 ? 0:05 ms.sapHA2_ASCS20 pf=/usr/sap/HA2/SYS/profile/HA2_ASCS20_ha2ascsv
ha2adm 33686161 1 - 13:38:14 ? 0:00 sapstart pf=/usr/sap/HA2/SYS/profile/HA2_ASCS20_ha2ascsv
coh3vipa,~,7:15pm,2,#kill -9 67240345
coh3vipa,~,7:20pm,3,#ps -ef | grep -i ha2
ha2adm 16908559 33686161 - 13:38:15 ? 0:14 en.sapHA2_ASCS20 pf=/usr/sap/HA2/SYS/profile/HA2_ASCS20_ha2ascsv
root 131347 131386 - 19:22:32 ttyp0000 0:00 grep -i ha2
ha2adm 16908570 1 - 13:38:02 ? 0:15 /usr/sap/HA2/ASCS20/exe/sapstartsrv pf=/usr/sap/HA2/SYS/profile/HA2_ASCS20_ha2
ha2adm 84017561 33686161 - 19:20:50 ? 0:00 ms.sapHA2_ASCS20 pf=/usr/sap/HA2/SYS/profile/HA2_ASCS20_ha2ascsv
ha2adm 33686161 1 - 13:38:14 ? 0:00 sapstart pf=/usr/sap/HA2/SYS/profile/HA2_ASCS20_ha2ascsv
The sapstart
process restarts
SAP message server immediately in place, on COH3.
The failure is transparent: the workload is still running (SM66), and the lock entries that were generated are still in the enqueue table (SM12).
Looking at the trace file of the dispatcher (dev_disp), verify that it lost its connection with the message server and reconnected a few seconds later.
Failure of the enqueue replication server
This scenario simulates the failure of the enqueue replication server. The following table summarizes the execution of the test.
Scenario characteristics | Description |
---|---|
Purpose | Scope: Enqueue replication server Action: Unplanned outage |
Expected behavior | SA z/OS restarts the enqueue replication server in place. The failure has no impact on the SAP workload. |
Setup | The LPARs COH1, COH2, and COH3 must be running, including all required z/OS resources and SAP-related resources, with:
|
Preparation |
|
Execution | Use the UNIX command kill once with
signal -9 and once with signal -2 to kill the enqueue replication
server process outside of SA z/OS. |
Verifications |
|
Observed results,
if unexpected: |
Before the test, all SAP-related resources are in AVAILABLE status. The message- and enqueue servers are running on COH3, and the enqueue replication server is running on COH1.
As described in Preparing for the test, log on to all SAP application servers and generate 10 lock entries in the enqueue table.
Then, simulate the failure: kill the enqueue replication server process using the UNIX command kill -9 <pid>
. Use the
ps -ef
and kill
commands as shown in Failure of the message server to retrieve the process ID of the enqueue
replication server to stop it, and to verify that the process is restarted.
SA z/OS immediately restarted the enqueue replication server in place on COH1. After restarting, the enqueue replication server reconnects to the enqueue server and rebuilds its enqueue replication table in shared memory.
The failure is transparent: the workload is
still running (SM66), and the lock entries that you generated are
still in the enqueue table (SM12). Additionally, you can check the connection between the enqueue server
and the enqueue replication server with the SAP utility program ensmon
.
Calling ensmon
with option 2 (that is: Get replication
information) displays:
ha2adm> ensmon pf=/usr/sap/HA2/SYS/profile/HA2_ASCS20_ha2ascsv 2
This provides the following output:
Try to connect to host ha2ascsv service sapdp20
get replinfo request executed successfully
Replication is enabled in server, repl. server is connected
Replication is active
…
Failure of the SAP start service
This scenario simulates the failure of the SAP
start service process sapstartsrv
. The following
table summarizes the execution of the test.
Scenario characteristics | Description |
---|---|
Purpose | Scope: SAP start service of ASCS, SCS or ERS.
Action: Unplanned outage |
Expected behavior | SA z/OS restarts the SAP start
service.
The failure has no impact on the SAP workload, if Workload can be impacted, if
sapstartsrv of the SCS or ASCS group was stopped
with kill -2 , because then the enqueue and message servers are restarted, see Table 2. |
Setup | The LPARs COH1, COH2, and COH3 must be running, including all required z/OS resources and SAP-related resources, with:
|
Preparation |
|
Execution | Use the UNIX command kill once with
signal -9 and once with signal -2 to kill the SAP start service
outside of SA z/OS. The signals have
different effects on the processing of SA z/OS. |
Verifications |
|
Observed results,
if unexpected: |
Before the test, all SAP-related resources are in AVAILABLE status. The
sapstartsrv
and the ABAP SCS run on COH1. The workload and enqueue entries are
created.
Then, simulate the failure: kill the SAP start service sapstart
of the ABAP
Central Services using the UNIX command kill -9
<pid>
. Use the ps -ef
and kill
commands as shown in
Failure of the message server to retrieve the process ID of the process
to stop it, and to verify that the process is moved or restarted.
You see that only the sapstartsrv
of
the ABAP SCS has been restarted in place. All other resources of the
ABAP SCS Group are still running.
Stopping sapstartsrv
of
an inactive ASCS, SCS, or ERS with command kill -2
causes
the same processing of SA: restarting sapstartsrv
in
place.
The result of stopping sapstartsrv
with kill
-2
of an active ASCS, SCS, or ERS is described in Failure of the sapstart process of the SAP Central Services and Failure of the sapstart process of the enqueue replication server.
Failure of the sapstart process of the SAP Central Services
This scenario simulates the failure of the sapstart
process from the SAP
Central Services and tests the behavior of SA z/OS. The following table summarizes the
execution of the test.
Scenario characteristics | Description |
---|---|
Purpose | Scope: sapstart process of ASCS or SCS. Action: Unplanned outage |
Expected behavior | SA z/OS shows a PROBLEM/HARDDOWN
status for the failed resource SAPHA2AST and restarts the ABAP Central Services (SAPHA2ASCS) on the
LPAR, where the enqueue replication server was active before the test. Before the restart, its
corresponding SAP start service (SAPHA2A_SRV) moves because of the HasParent
relationship to SAPHA2ASCS . The enqueue replication server stops and restarts on
a free LPAR. Before the restart, its corresponding SAP start service moves to the free LPAR because
of the HasParent relationship to The short interrupt until the message server is started on the other LPAR, can have an impact to
the workload, see Table 2.
|
Setup | The LPARs COH1, COH2, and COH3 must be running, including all required z/OS resources and SAP-related resources, with:
|
Preparation |
|
Execution | Use the UNIX command kill once with
signal -9 and once with signal -2 to kill the SAP start service
outside of SA z/OS. Both signals result in
the same SA z/OS reaction. |
Verifications |
|
Observed results,
if unexpected: |
Before the test, all SAP-related resources are in AVAILABLE status. The enqueue servers are running on COH1, and the enqueue replication server is running on COH2. Workload and enqueue entries are created.
Then, simulate the failure: kill the SAP start service sapstart
of the ABAP SAP
Central Services using the UNIX command
kill -9 <pid>
. Use the ps -ef
and kill
commands as shown in Failure of the message server to retrieve the
process ID of the sapstart
process to stop it, and to verify that the ASCS instance
has moved as well as its corresponding ERS instance.
You will see that the sapstart
process
of the ABAP SAP central service has been restarted on LPAR COH2 (where
ERS is running). All other SAP resources of the ABAP SAP central service
Group have been restarted as well on LPAR COH2 at the same time. The
ERS for ABAP was moved from LPAR COH2 to COH3.
In a productive environment, you should always verify the cause of the process failure before you change the agent status from HARDDOWN to AUTODOWN.
Failure of the sapstart process of the enqueue replication server
This scenario simulates the failure of the sapstart
process from the SAP
Central Services and tests the behavior of SA z/OS. The following table summarizes the
execution of the test.
Scenario characteristics | Description |
---|---|
Purpose | Scope: sapstart process of ABAP or Java™ ERS.
Action: Unplanned outage |
Expected behavior | SA z/OS restarts the
sapstart process in place.
The failure has no impact on the SAP workload. |
Setup | The LPARs COH1, COH2, and COH3 must be running, including all required z/OS resources and SAP-related resources, with:
|
Preparation |
|
Execution | Use the UNIX command
kill once with signal -9 and once with signal -2
to kill the SAP start service process outside of SA z/OS. The signals have different effects on the
processing of System Automation. |
Verifications |
|
Observed results,
if unexpected: |
Before the test, all SAP-related resources are in AVAILABLE status. The ABAP enqueue replication server is running on COH3. The enqueue entries are created.
Then, simulate the failure: kill the SAP start service sapstart
of the ABAP
Central Services using the UNIX command kill -2
<pid>
. Use the ps -ef
and kill
commands as shown in
Failure of the message server to retrieve the process ID of the
sapstart
process to stop it, and to verify that the ERS instance was restarted in
place.
You will see that the sapstart
process of the ABAP ERS has been restarted on
the same LPAR COH3. Also, the enqueue replication server process has been restarted in place at the
same time.
sapstart
process of the ERS has been killed with signal
-9
, then the sapstart
process is not restarted by SA z/OS. This is because the sapcontrol
GetProcessList
command, which checks whether the instance is running OK, returns no errors
(rc=3), even without the sapstart
process running. The sapstart
process is indeed not needed for running the ERS process properly. To clear this situation, perform
the following steps: - Additionally, kill the ERS process with signal
-9
. It can take up to 30 seconds until the ERS status changes into status HARDDOWN. - Change (for example, in SA z/OS) the
agent status from HARDDOWN into AUTODOWN. Then, the ERS instance restarts including the
sapstart
process on the same LPAR.
Failure of the NFS server
This scenario simulates the failure of the NFS server and tests the behavior of SA z/OS. It also measures the impact of the failure on the SAP workload.
The following table summarizes the execution of the test.
Scenario characteristics | Description |
---|---|
Purpose | Scope: NFS server Action: Unplanned outage |
Expected behavior | SA z/OS should restart the NFS
server. Existing NFS mounts should be reestablished. The global file systems can be accessed from each SAP application server |
Setup | COH1, COH2, and COH3 must be up, including all required z/OS resources and SAP-related resources. The NFS server runs on COH1. The address space name of the NFS server is assumed to be MVSNFSHA. |
Preparation |
|
Execution | Cancel the address space MVSNFSHA on COH1. |
Verifications |
|
Observed results,
if unexpected: |
Before the test, all SAP-related resources are in AVAILABLE status. The NFS and enqueue servers are running on COH1.
Simulate the failure by canceling the address space of the NFS server on COH1 using the following command:
/C MVSNFSHA
Because at the time of the test, the effective preference of COH2 was higher than that of COH3, SA z/OS immediately restarted the NFS server on COH2 (along with its VIPA) and put the resource MVSNFSHA on COH3 in a RESTART status:
AOFKSTA5 SA z/OS - Command Dialogs Line 1 of 3
Domain ID = IPXFO -------- DISPSTAT ---------- Date = 03/30/10
Operator ID = HEIKES Time = 15:37:01
A dispflgs B setstate C ingreq-stop D thresholds E explain F info G tree
H trigger I service J all children K children L all parents M parents
CMD RESOURCE STATUS SYSTEM JOB NAME A I S R T RS TYPE Activity
--- ----------- -------- -------- -------- ------------ -------- ---------
NFSSERV DOWN COH1 MVSNFSHA - - - - - - MVS --none--
NFSSERV UP COH2 MVSNFSHA - - - - - - MVS --none--
NFSSERV RESTART COH3 MVSNFSHA - - - - - - MVS --none--
The SAP global file systems that are NFS-mounted on AIX® machine p570coh2v are accessible with SAP transaction AL11. No error messages are written to the SAP system log (SM21).
Failure of a TCP/IP stack
This scenario simulates the failure of the TCP/IP stack on the system where the enqueue server and the NFS server are running, and tests the behavior of SA z/OS. It also measures the impact of the failure on the SAP workload.
The samples from the scenario in this section use a TCPIP stack name of TCPIPA.
The following table summarizes the execution of the test.
Scenario characteristics | Description |
---|---|
Purpose | Scope: TCP/IP stack Action: Unplanned outage |
Expected behavior | SA z/OS restarts the TCP/IP stack.
OMPROUTE goes into a PROBLEM/HARDDOWN status, and therefore the NFS server fails and SA z/OS moves it. Because of the HasParent relationship definition in the SAP policy, SA z/OS stops SCS and restarts it on the LPAR where the enqueue replication server is running. As a consequence, the enqueue replication server starts on a different LPAR, as well due to the HasParent relationship definition in the SAP policy. For the remote SAP application server connected to the database server running on the LPAR where the failure occurs, running transactions should be rolled back and work processes should reconnect either to the same database server, or failover to the standby database server. For the SAP application server running on the other LPAR, the failure should have no impact. |
Setup | LPARs COH1, COH2, and COH3 must be up, including all required z/OS resources and SAP-related resources, with:
|
Preparation |
|
Execution | Cancel the address space TCPIPA on COH1. |
Verifications |
|
Observed results,
if unexpected: |
Before the test, all SAP-related resources are in AVAILABLE status. The NFS and the SCS (with enqueue- and message servers) are running on COH1, and the enqueue replication server runs on COH2.
Simulate the failure by stopping TCPIPA on COH1 using the following command:
/C TCPIPA
Because the critical threshold is not reached, SA z/OS immediately restarts TCPIPA on COH1.
The failure of the TCP/IP stack leads to the failure of the NFS server and SCS on COH1.
SA z/OS immediately restarts the NFS server on another LPAR (COH2).
SA z/OS restarts SCS on the LPAR where the enqueue replication server is running that is COH2. The enqueue replication server was restarted on COH3:
INGKYST0 SA z/OS - Command Dialogs Line 1 of 18
Domain ID = IPXFO -------- INGLIST --------- Date = 03/05/10
Operator ID = HEIKES Sysplex = COHPLEX Time = 14:00:25
A Update B Start C Stop D INGRELS E INGVOTE F INGINFO G Members
H DISPTRG I INGSCHED J INGGROUP K INGCICS L INGIMS M DISPMTR T INGTWS
U User X INGLKUP / scroll
CMD Name Type System Compound Desired Observed Nature
--- ------------ ---- -------- ------------ ----------- ---------- ------
SAPHA2AER APL COH1 SATISFACTORY UNAVAILABLE SOFTDOWN
SAPHA2AER APL COH2 SATISFACTORY UNAVAILABLE SOFTDOWN
SAPHA2AER APL COH3 SATISFACTORY AVAILABLE AVAILABLE
The client copy is started interactively (not as background job). ClientCopy fails because it cannot handles the TCPIP failure transparently. After TCPIP is up again and the SCS and enqueue replication server are restarted, the client copy can be restarted and finishes without errors and without lock entries in the lock table (can be checked with SAP transaction SM12).
In SAP transaction dbacockpit under Performance the Thread
Activity shows that work processes from SAP AS ihlscoh1
are connected to DB
host coh1vipa. Verify with Db2 command -DIS THREAD(*) that all the threads are
connected to COH1. Connection information for each work process can be found in the developer trace
file dev_w<x>, where x is the number of the work process it belongs to.
The developer trace dev_disp shows that the dispatcher lost its connection with the message server and reconnected later on.
The SAP system log (SM21) shows error and recovery messages that are issued during the interval of the test, similar to the ones shown in Figure 1.
All the SAP-related resources are in AVAILABLE status after the failover. The NFS and enqueue servers are running on COH2. The enqueue replication server is running on COH3.
Failure of an LPAR
This scenario simulates the failure of the LPAR where the enqueue server and the NFS server were running and test the behavior of SA z/OS. It also measures the impact of the failure on the SAP workload.
Table 10 summarizes the execution of the test.
Scenario characteristics | Description |
---|---|
Purpose | Scope: One LPAR Action: Unplanned outage |
Expected behavior | SA z/OS should restart the master of
the failing DB2® subsystem on aher LPAR in light mode. The Db2 subsystem goes down after successful startup. SA z/OS should restart the NFS server on another LPAR. SA z/OS should restart SCS on the LPAR where the enqueue replication server is running. The enqueue replication server should stop or move to another LPAR if more than two LPARs are available. For the SAP application server connected to the database server is running on the failing LPAR, running transactions should be rolled back and work processes should failover to the standby database server, which. For the SAP application server to that database server that is connected is running on the other LPAR, the failure should have no impact. |
Setup | The LPARs COH1, COH2, and COH3 must be up, including all required z/OS resources and SAP-related resources, with:
|
Preparation |
|
Execution | System reset at the HMC for COH2 (Primary Automation Manager (PAM) in SA z/OS). |
Verifications |
|
Observed results,
if unexpected: |
Before the test, all SAP-related resources are in UP status. The NFS was started on COH2. The SCS started on COH2 (the LPAR where enqueue replication server was running before), and the enqueue replication server is running on COH1.
Simulate the failure by doing a system reset at the HMC.
Use the SA z/OS command INGLIST */*/COH2 to display the status of the resources on COH2. They are all displayed with a status INHIBITED/SYSGONE.
SA z/OS restarts the Db2 subsystem HA22 on COH1 with the in light mode in order to quickly release the retained locks. When the LPAR startup was complete, the Db2 subsystem in light mode stops and the Db2 primarily subsystem on COH2 starts.
SA z/OS restarts the NFS server on COH1.
SA z/OS restarts SCS on the LPAR COH1, where the enqueue replication server is running.
SA z/OS restarts the enqueue replication server on COH3.
The transaction Db2 showed that the current DB host was now coh1vipa. Check with the Db2 command -DIS THREAD(*), that all the threads are connected to COH1. Connection information for each work process can be found in the developer trace files.
All SAP-related resources are in AVAILABLE status after the failover and running on COH2, including the enqueue servers. The enqueue replication server was running on COH3.