Improved streaming cluster replication during server restarts

IBM® Domino® 10 implements streaming cluster replication (SCR) improvements that help preserve the state of the SCR queue after Domino server restarts. These improvements allow SCR to be used after server restart without first re-initializing SCR for databases, which can delay replication.

SCR is a fast and efficient replication method that Domino uses to replicate within clusters. SCR is a push replication method that captures changes on the local server and pushes the changes to other replicas within the cluster. The following improvements are new in this release.

Local server restarts

Once every minute, SCR saves its current state to the file scrstate.dat in the data directory. Then, when the server shuts down and when it restarts, SCR refers to scrstate.dat to determine the known state before the restart. Prior to Domino 10, the SCR state was lost after server restarts and SCR needed to be re-initialized for all databases, potentially delaying replication until the re-enablement completed.

When a server restarts, the server console log output shows information about restoration of the SCR state:
[001460:000002-0000000000001650] RestoreSCRState: Starting SCR restore at 04/16/2018 03:51:48 PM
[001460:000002-0000000000001650] RestoreSCRState: Finished SCR restore at 04/16/2018 03:51:48 PM
[001460:000002-0000000000001650] RestoreSCRState: Input Lines = 642, Destinations Restored = 636

Input Lines = 642 means that there were 642 items in the SCR state in the last saved scrstate.dat file. Destinations Restored = 636 means that only 636 of the 642 items were restored to the SCR queue. The lower restored number is expected and is due to local database changes occurring after the last saved scrstate.dat file.

Remote server restarts

When SCR detects that a remote server is down, it allows changes targeted for that server to remain in its SCR queue for up to 20 minutes. Prior to Domino 10, if a server detected that a remote server was down, SCR didn't recognize the changes queued for the server, potentially causing replication delays as SCR was re-initialized.

When a remote server begins to restart, the local server logs information similar to this:
[001460:000008-00000000000015DC] ClientSCRDestHandler: Starting wait for server
CN=ServerB/O=Domino10 to restart at 04/17/2018 08:12:57 AM
When the remote server comes back up and the SCR connection is restored, the local server logs information such as this:
[001460:000008-00000000000015DC] ClientSCRDestHandler: Connection re-established to server
CN=ServerB/O=Domino10 at 04/17/2018 08:13:59 AM