Troubleshooting for replication problems
With EnqCF replication enabled, problems can occur during the startup of the enqueue server, or the server can encounter replication problems at run time. This topic discusses how to handle such problems.
In addition to troubleshooting guidance for replication problems, this publication also provides information on failover and recovery if you use EnqCF replication in Failover and recovery of SAP Central Services using EnqCF replication.
Problems during enqueue server startup
Please make sure that the patch level of the enqueue server is at least patch level that is documented in SAP Kernel: Important News. Also, define a monitor resource for it as described in Monitoring the health state of SAP enqueue replication.
An enqueue server that is started with EnqCF replication might encounter problems when accessing the z/OS coupling facility. The reaction to such errors depends on the underlying scenario. This topic discusses the following situations:
- Scenario 1: enqueue server cannot access the CF
- Scenario 2: enqueue server can access the CF, but during startup does not find any replication information in the CF. Creation of the new XCF note pad fails.
- Scenario 3: enqueue server can access the CF, and there is valid replication information in the CF. However, creation of the new XCF note pad fails.
Scenario 1 – no access to CF:
The enqueue server will start but replication will not be active. This is
indicated by System Automation showing a Warning health
state for the enqueue server that is triggered by its monitoring resource. Also, an error message is
written into the SAP developer trace file dev_enqrepl.
Problem resolution: Solve the underlying problem why the enqueue server is not able to access the CF / note pad structure / note pad.
Scenario 2 – no replication information available:
This scenario might occur in the following situation:
- The enqueue server is started with EnqCF replication enabled.
- It does not find any replication information in the CF. This is, for example the case
when:
- EnqCF replication is turned on for the first time.
- The CF structures were deleted before starting the enqueue server (by using the
cleanrepstzOSCftool or by entering original z/OS XCF commands).
- An error message is written into the SAP developer trace file dev_enqrepl .
- The SAP enqueue server will start, but replication will not be active.
- System Automation detects this termination and tries to start the enqueue server on a different LPAR (or attempts a restart in place).
- System Automation will show a Warning health state for the enqueue server that is triggered by its monitoring resource.
Problem resolution: The problem may be caused by wrong structure definitions in the CFRM policy, which defines the CF structures and their sizes. For definition and sizing of CF structures, see the PDF file that is attached to SAP Note 1753638. If the underlying problem (for example, wrong CFRM policy) cannot be resolved fast enough, then a short-term solution might be to change the SAP profile such that replication is temporarily disabled.
Scenario 3 – with replication information available:
This scenario may occur in the following situation:
- The enqueue server is started with EnqCF replication enabled.
- It does find valid replication information in the CF, but it is not able to create the
new XCF note pad, for example, because of:
- Changes in enqueue table size, which requires a larger CF structure than defined in the current CFRM policy
- CF structures being deleted or redefined.
The most likely reasons for these errors are SAP profile parameter changes. If you, for example,
increase the SAP enqueue table size parameter enque/table_size in such a way that
the CF structures are no longer large enough to hold the replication information for the new enqueue
table size, then the (re)start of your enqueue server with EnqCF replication enabled
will fail. The enqueue server does not terminate in this case, so that the enqueue locks,
encountered in the old CF structure at startup, are not lost. Instead, the enqueue server continues,
with replication being temporarily disabled. Messages indicating this situation are written into the
SAP developer trace file dev_enqrepl. The enqueue server tries to re-establish
replication into CF at intervals defined via the SAP profile variable
enque/enrep/stop_timeout_s.
The default for this parameter is 300 seconds. You may want to set this timeout to a lower value to allow for a quicker restart of the replication mechanism after the cause for the CF problem has been resolved.
Problem resolution: the problem should be solved by defining a CF structure large enough such that the replication into CF is enabled again.
Replication problems at runtime
Like in traditional TCPIP-based replication, any disruption of the replication is not visible in System Automation unless you are using the monitor mechanism described in Monitoring the health state of SAP enqueue replication. The enqueue server resources continue to be shown with status AVAILABLE. Messages indicating the failing replication are written to the SAP developer trace file dev_enqrepl. Like TCPIP-based replication, EnqCF replication tries to re-establish the replication into the coupling facility at regular intervals. The interval length is specified by the SAP profile variable enque/enrep/stop_timeout_s.
You should consider setting the value for enque/enrep/stop_timeout_s lower than the default of 300 seconds. This enables a quicker restart of the replication mechanism after the cause for the CF problem has been resolved.
With TCPIP-based replication, the total number of retries is limited through the SAP profile variable: enque/enrep/stop_retries. When this limit is reached, the enqueue server stops the attempts to re-establish replication until the replication server is restarted.
With EnqCF replication, the value of the stop_retries profile variable specifies the number of attempts to reuse the old CF note pad. When this number of retries is exhausted, the enqueue server tries to create a new CF note pad. The recommendation is to leave this variable at its default value of 1 to allow for fast recovery.
How to resume replication after a CF outage
- If you have a secondary CF and you have APAR OA61404 applied and if you have the CFRM policy set up to allow takeover of the SAP note pad structures to the secondary CF, then replication will automatically resume into the secondary CF.
- If you have APAR OA61404
not applied, then you need to be aware of the following. The way the enqueue server handles
the restart of a failed replication into the CF and timing during CF outage can result in different
manual operator interventions needed. If the enqueue server loses connection to its note pad, it suspends replication and starts the loop to restart/resume it. For replica consistency, the first thing it does during restart, is to try to delete the old note pad before it tries to create a new one. During CF outage the connection to the note pad is dropped. This triggers the restart loop and the delete of the note pad. Now two things can happen:
- The enqueue server LPAR has lost connectivity to the CF when it tries to delete the note
pad. This is the 'normal' case. The enqueue server fails to delete the note pad and waits for
enque/enrep/stop_timeout_s seconds before the next restart attempt. This lasts as
long as either the CF comes back or a manual
SETXCF FORCE,STRUCTURE,STRNAME=strnameis issued against the note pad structure. Successful processing of theSETXCF FORCEcleans up the structure and implicitly the note pad, details see How to clean up the note pad structure and implicitly the SAP note pad itself. Then, at the next restart attempt the delete of the note pad is successful (it is not there) and the create of the note pad in the 'old/primary' note pad structure in the secondary CF is successful and replication resumes. - The enqueue server LPAR still has connectivity to the CF long enough to get the delete note pad through. This is a 'rare' case but can happen caused by CF outage timing. In this case, the enqueue server will continue and try to create the note pad again.
- The enqueue server LPAR has lost connectivity to the CF when it tries to delete the note
pad. This is the 'normal' case. The enqueue server fails to delete the note pad and waits for
enque/enrep/stop_timeout_s seconds before the next restart attempt. This lasts as
long as either the CF comes back or a manual
- There is no secondary note pad structure that is defined in CFRM, which can host the note
pad. The 'old/primary' note pad structure is inaccessible in the failed CF and
cannot be reused in the secondary CF. The create of the note pad fails because there is no structure
to hold it in the second CF. The enqueue server stays up and running, but there will be no restart
of replication as long as the primary note pad structure is not cleaned up manually via
SETXCFForce command, details see How to clean up the note pad structure and implicitly the SAP note pad itself. - There is a secondary note pad structure that is defined in CFRM, which can host the note pad. Now in contrast to case 1. the create of the note pad is successful because there is the secondary structure to host it in the second CF. Restart of replication will automatically resume. Be aware that the note pad now resides in the secondary structure.
- Enter the following MVS command to display the currently
hosting note pad structure for the SAP system HA1:
D XCF,NOTEPAD,NOTEPADNAME=SAPHA1.ENQUEUE.*Alternatively you can issue:which will show you the hosting structures for all SAP note pads. Note the structure name. In this case for example IXCNP_SAPHA100.D XCF,NP SAP* - Run the following command to display which CF (CF name) is currently hosting the structure:
D XCF,STRUCTURE,STRNAME=IXCNP_SAPHA100 - If the structure is in the failed CF, then delete the note pad structure (and implicitly the
note pad) that uses the following command:
SETXCF FORCE,STRUCTURE,STRNAME=IXCNP_SAPHA100The last step is necessary to clean out old information that exists for the failed coupling. It is an asynchronous command and you need to wait until it is successfully processed. Then, the reuse of the structure in the secondary CF is possible. The next attempt to create the SAP note pad is successful and replication will resume. For more information, see: Deleting XCF note pad structures.