Monitoring and diagnosing advanced replication problems
An LDAP administrator can monitor the state of advanced replication
processing and troubleshoot problems by using LDAP search requests
to retrieve operational attributes available for the roots of the
replication contexts (entries with an objectclass of ibm-replicationContext)
and replication agreements (entries with an objectclass of ibm-replicationAgreement).
Because these are operational attributes, either the +
attribute
or each individual attribute must be requested on a search request
in order to be returned. Also, operational attributes cannot be used
in search filters.
The following tables describe the operational attributes for the replication context and replication agreement entries. Replication context entries use the auxiliary objectclass of ibm-replicationContext and replication agreement entries use the structural objectclass ibm-replicationAgreement. See Table 1 for the operational attributes for the ibm-replicationContext objectclass. See Table 2 for the operational attributes for the ibm-replicationAgreement objectclass.
When retrieved for a replication context or replication agreement entry, the operational attributes provide information concerning that entry. It is important to take notice of attributes that have values that contain failureId or changeId values. The failureId and changeId numbers increase sequentially. However, some numbers might be skipped by the server for various reasons. For example, if Db2® is restarted while the server is running, the changeId might skip numbers. These IDs are often required when working with the Control replication error log and the Control replication queue extended operations with the ldapexop utility. See ldapexop utility for more information about the ldapexop utility.
Attribute and description |
---|
ibm-replicationThisSeverIsMaster A boolean (true or false) indicating whether the server is the master of the replication context. If set to true, the server is the master of the replication context. If set to false, the server is a not the master of the replication context. |
ibm-replicationIsQuiesced A boolean (true or false) indicating whether the replication context is quiesced. If set to true, the replication context is quiesced. If set to false, the replication context is not quiesced. Updates under a quiesced replication context are restricted to an LDAP root administrator if using the Server Administration control (OID 1.3.18.0.2.10.15), and any replication master DNs with authority under this context. Advanced replication continues for a quiesced context. If the server is restarted, all replication contexts are then unquiesced. |
See Table 1 for the optional non-operational attribute for the ibm-replicationContext objectclass.
Attribute and description |
---|
ibm-replicationChangeLDIF The LDIF representation of the next pending change that has not yet been replicated and has resulted in advanced replication being stalled to the consumer server. If there is not a stalled replication change, the value is N/A. Examples
of when an advanced replication queue might be stalled include:
|
ibm-replicationFailedChangeCount Specifies the number of advanced replication operations that have failed in this replication agreement. This number is shared among all replication agreement entries on the backend level by the ibm-slapdReplMaxErrors attribute in the CDBM backend configuration entry cn=Replication, cn=Configuration. See Table 2 for more information about the ibm-slapdReplMaxErrors attribute value. |
ibm-replicationFailedChanges A multi-valued attribute that lists all the logged replication operations that have failed. The number of attribute values is shared among all replication agreement entries on the backend level by the ibm-slapdReplMaxErrors attribute in the CDBM backend configuration entry cn=Replication, cn=Configuration. See Table 2 for more information about the ibm-slapdReplMaxErrors attribute value. A string value of the form: failureId timestamp returnCode numOfAttempts changeId operation entryDn The failureId identifies the update that has failed to replicate to the consumer server. The failureId is used with the Control replication error log extended operation to display, delete, or retry the failing replication update. The ldapexop utility supports the Control replication error log extended operation. See ldapexop utility for more information about the ldapexop utility. The timestamp is the time in Zulu format when this operation was last attempted to be replicated to the consumer server. The returnCode is the LDAP return code from the consumer server. The numOfAttempts is the number of times the error has been tried again on the consumer server. The changeId is the ID that this failureId had when it was in the pending replication queue. The operation indicates the update operation that encountered the failure. It has one of the following values: add, delete, modify, or modifydn The entryDn indicates the distinguished name of the entry that caused the failure. Example:
|
ibm-replicationLastActivationTime Specifies the Zulu format timestamp when advanced replication actively began replicating queued updates. |
ibm-replicationLastChangeID Specifies the replication change ID of the last successfully completed advanced replication update. |
ibm-replicationLastFinishTime Specifies the Zulu format timestamp when advanced replication updates in the queue were all attempted and the server awaits a new scheduled start time or more operations to appear in the advanced replication queue. See Schedule entries for more information about replication schedule entries. |
ibm-replicationLastResult A description of the result from the last advanced replication operation or connection attempt to a consumer server. A string value of the form: timestamp changeId returnCode operation entryDn The timestamp is the time in Zulu format when this operation was last attempted to be replicated to the consumer server. The changeId is the ID of the last replication update. The returnCode is the LDAP return code from the consumer server. The operation indicates the last LDAP operation. It has one of the following values: add, connect, delete, modify, or modifydn The entryDn indicates the distinguished name of the entry that was last added, deleted, modified, or renamed. If operation is connect, entryDn is set to NULL. Example:
|
ibm-replicationLastResultAdditional The descriptive reason code message text that supplements the return code message with the purpose of providing additional information from the last replication attempt. |
ibm-replicationNextTime Specifies the Zulu format timestamp of the next time advanced replication would begin if pending changes existed. When this value is set to 19000101000000z, replication begins immediately when a change is ready to be replicated if the ibm-replicationState operational attribute is set to active. |
ibm-replicationPendingChangeCount The number of replication operations that are waiting to be replicated to a consumer server. |
ibm-replicationPendingChanges A multi-valued attribute that lists all changes waiting to be replication to a consumer server. A string value of the form: changeId operation entryDn The changeId is the ID of the pending replication update. The operation indicates the LDAP operation that is pending. It has one of the following values: add, delete, modify, or modifydn The entryDn indicates the distinguished name of the entry that is to be added, deleted, modified, or renamed. Example:
|
ibm-replicationState Identifies the
current state of the advanced replication queue. It has one of the
following values:
|
See Table 4 for the required non-operational attributes for the ibm-replicationAgreement objectclass. See Table 5 for the optional non-operational attributes for the ibm-replicationAgreement objectclass.
Recovering from advanced replication errors
Replication errors can be handled proactively, before they are allowed to accumulate, or reactively, after replication has already stalled. Replication stalls occur when the number of failures reaches the limit as specified by the ibm-slapdReplMaxErrors attribute value in the cn=Replication,cn=configuration entry. See Table 2 for more information about the cn=Replication,cn=configuration entry.
When replication is stalled, the latest failed change occupies the beginning of the pending changes queue. The latest failed change gets retried every minute until it succeeds or the failed change is removed from the queue by an LDAP administrator with the appropriate authority. See Administrative group and roles for more information about administrative role authority. When this failed change occupies the lead position in the pending replication queue, all other replication updates are blocked and replication is stalled.
- Increase the size of the ibm-slapdReplMaxErrors attribute in the cn=Replication,cn=configuration entry. This allows more replication failures to be stored in the backend where the replication agreement entry exists.
- Delete or retry one or more failed replication changes.
- Skip the latest failed replication change.
- If the stalled replication problem is severe enough, the entire
replication context where the replication agreement entry exists might
need to be resynchronized. In order to do this, you must:
- Quiesce the replication context
- Suspend replication for all replication agreements
- Delete all failed replication changes for all replication agreements
- Skip all pending changes for all replication agreements
- Resynchronize the replication context
- Resume replication for the suspended replication agreements
- Unquiesce the replication context
- The ibm-replicationChangeLdif operational attribute in the replication agreement entry shows the LDIF representation of the latest failure. The ibm-replicationLastResult and ibm-replicationLastResultAdditional operational attributes in the replication agreement have further detail for the reason the change failed.
- The ibm-replicationPendingChanges operational attribute in the replication agreement shows the change ID, the operation type, and the target DN of the next changes to be replicated. The number of pending changes that are displayed is limited by the ibm-slapdMaxPendingChangesDisplayed attribute in the cn=Replication,cn=configuration entry. See Table 2 for more information about the ibm-slapdMaxPendingChangesDisplayed attribute. See Table 2 for more information about the ibm-replicationPendingChanges operational attribute.
- The ibm-replicationFailedChanges operational attribute in the replication agreement shows each of the failed changes, including the failure ID. See Table 2 for more information about the ibm-replicationFailedChanges operational attribute.
- The Control replication error log extended operation can be used to display information about a failure by providing the failureId obtained from the ibm-replicationFailedChanges operational attribute. The controlreplerr extended operation -show option in the ldapexop utility can be used to display the latest failure. See ldapexop utility for more information about the ldapexop utility.
- Increase the size of the ibm-slapdReplMaxErrors attribute in the cn=Replication,cn=configuration entry. This allows more replication failures to be stored in the backend where the replication agreement entry exists. See Table 2 for more information about the ibm-slapdReplMaxErrors attribute.
- Delete or retry one or more failed changes for the replication agreement by using the Control replication error log extended operation with the ldapexop utility. The -retry option on the controlreplerr extended operation in the ldapexop utility allows a single failure (identified by its failureId) to be retried or all failures to be retried. The ability to retry all failures is especially useful when you have corrected the problem that caused a change to fail the first time. When a failed change is retried successfully, it is removed from the list of failed changes and there is space for a new one. The -delete option on the controlrepler extended operation in the ldapexop utility allows a single failure (identified by its failureId) to be deleted or all failures to be deleted. This delete option is especially useful when a change is deemed to be unnecessary, the problem has been fixed manually, or a synchronization tool such as the ldapdiff utility has been used to resynchronize the directories. Deleting a failed change frees space in the list of failed changes so that a new failure can be added. See ldapexop utility for more information about the ldapexop utility. See ldapdiff utility for more information about the ldapdiff utility.
- Skip the latest failure for the replication agreement by using the Control replication queue extended operation. The ldapexop utility supports the Control replication queue extended operation that allows the next pending change (identified by its changeId) or all pending changes to be skipped. This extended operation is useful when the ibm-slapdReplMaxErrors attribute in the cn=Replication,cn=configuration entry is set to 0 in which case the replication failure is not allowed and replication stalls on the first failure. Also, the Control replication queue extended operation is useful when replication failures are not deleted, the ibm-slapdReplMaxErrors attribute value is increased, or after using the ldapdiff utility to resynchronize the replication context. See ldapexop utility for more information about the ldapexop utility.
- If there are multiple failed and pending replication changes,
the entire replication context where the replication agreement entry
exists might need to be resynchronized. In order to do this, you
must:
- Quiesce the replication context on all servers in the replication topology by using the Cascading control replication extended operation on the ldapexop utility. The Cascading control replication extended operation is targeted against the master server which in turn quiesces the replication context on all consumer servers. A quiesced replication context only accepts updates from an LDAP root administrator when using the Server Administration control and any replication master server DNs with authority under this context. See Cascading control replication for more information about the Cascading control replication extended operation. See ldapexop utility for more information about the ldapexop utility.
- Suspend replication for all replication agreements in the replication context by using the Control replication extended operation on the ldapexop utility. A suspended replication agreement queues all replication changes updates until it is resumed. See Control replication for more information about the Control replication extended operation.
- Use the ldapdiff utility with the -L option to compare the replication contexts on each of the servers within the replication context. The -L option allows the entry differences to be written to an output LDIF file. See ldapdiff utility for more information about the ldapdiff utility.
- Delete all failed replication changes for all replication agreements by using the Control replication error log extended operation on the ldapexop utility. See Control replication error log for more information about the Control replication error log extended operation.
- Skip all pending replication changes by using the Control replication queue extended operation on the ldapexop utility. See Control replication queue for more information about the Control replication queue extended operation.
- Resynchronize the replication context by using the fix option on the ldapdiff utility.
- Resume replication for all suspended replication agreements by using the Control replication extended operation on the ldapexop utility.
- Unquiesce the replication context on all servers in the replication topology by using the Cascading control replication extended operation on the ldapexop utility.
The other methodology for handling replication failures is to take a proactive, preventive approach. An LDAP administrator monitors the replication failure queue and resolves problems before the queue reaches capacity and replication stalls. An LDAP administrator with the appropriate authority can use the Control replication error log extended operation and the ibm-replicationFailedChanges and ibm-replicationState operational attributes in the replication agreement entry to monitor the current replication status. See Administrative group and roles for information about administrative authority.
Advanced replication error recovery example
This advanced replication error recovery example uses the master-replica topology that has been configured in Creating a master-replica topology. This example assumes the ibm-slapdReplMaxErrors attribute value in the cn=Replication,cn=configuration entry is set to one.
o=ibm,c=us
replication
context by querying the replication agreement operational attribute
values. See Table 2 for more
information about the replication agreement operational attributes.+
attribute
is specified or each operational attribute is requested.ldapsearch -p 389 –h server1.us.ibm.com -D adminDn -w adminPw -b o=ibm,c=us
"(objectclass=ibm-replicationAgreement)" "*" ibm-replicationChangeLdif
ibm-replicationFailedChangeCount ibm-replicationFailedChanges ibm-replicationLastActivationTime
ibm-replicationLastChangeID ibm-replicationLastFinishTime ibm-replicationLastResult
ibm-replicationLastResultAdditional ibm-replicationNextTime ibm-replicationPendingChangeCount
ibm-replicationPendingChanges ibm-replicationState
The
ldapsearch command returns the following
entry:cn=Replica, ibm-replicaServerId=Master, ibm-replicaGroup=default, o=ibm, c=us
objectclass=top
objectclass=ibm-replicationAgreement
ibm-replicaconsumerid=Replica
ibm-replicaurl=ldap://server1.us.ibm.com:389
ibm-replicacredentialsdn=cn=ReplicaBindCredentials,o=ibm, c=us
description=Replication agreement from master to replica
cn=Replica
ibm-replicationonhold=FALSE
ibm-replicationstate=retrying
ibm-replicationpendingchanges=46 modify OU=SUB,O=IBM,C=US
ibm-replicationpendingchangecount=1
ibm-replicationnexttime=19000101000000
ibm-replicationlastresultadditional=R004071 DN 'OU=SUB,O=IBM,C=US' does not exist
(ldbm_process_request:406)
ibm-replicationlastresult=20090206145054Z 46 32 modify OU=SUB,O=IBM,C=US
ibm-replicationlastfinishtime=20090206144954Z
ibm-replicationlastchangeid=45
ibm-replicationlastactivationtime=20090206144354Z
ibm-replicationfailedchanges=12 20090206144954Z 32 1 45 add cn=entry,ou=sub,o=ibm,c=us
ibm-replicationfailedchangecount=1
ibm-replicationchangeldif=
dn: ou=sub,o=ibm,c=us
control: 2.16.840.1.113730.3.4.2 true
control: 1.3.18.0.2.10.19 false:: MIGPMCAKAQIwGwQNbW9kaWZpZXJzTmFtZTEKBAhjbj1
hZG1pbjAwCgECMCsED21vZGlmeVRpbWVzdGFtcDEYBBYyMDA5MDIwNjE0NDg1My41ODM4MjVaMDk
KAQIwNAQYUmVwbGljYXRpb25CYXNlVGltZXN0YW1wMRgEFjIwMDkwMjA2MTM0ODQ2Ljc4Njg4NFo
=
changetype: modify
add: description
description: A small division
- The ibm-replicationState operational attribute value is
set to
retrying
which indicates replication is currently stalled. Replication is stalled because the number of replication failures exceeds one. (The ibm-slapdMaxReplErrors attribute value has been set to one in the cn=Replication,cn=configuration entry). - The ibm-replicationChangeLdif operational attribute in
the replication agreement shows the LDIF representation of the latest
failure. The LDIF shows that the last failure is a modify of the
ou=sub,o=ibm,c=us
entry on the consumer server. The ibm-replicationLastResult and ibm-replicationLastResultAdditional operational attributes in the replication agreement indicate that the modify failed on the consumer server because theou=sub,o=ibm,c=us
entry does not exist. - The ibm-replicationPendingChanges operational attribute
in the replication agreement shows the changeId of the next
pending update is 46. The next pending change is also the same modify
operation of the
ou=sub,o=ibm,c=us
entry. It will be replicated to the consumer server after the add failure in the ibm-replicationFailedChanges operational attribute is resolved. - The ibm-replicationFailedChanges operational attribute
in the replication agreement shows one failed replication update.
The attribute value indicates that the failureId is 12, the
LDAP return code from the consumer server is 32, it is an add operation
of the
cn=entry,ou=sub,o=ibm,c=us
entry, and the supplier server has tried once to replicate the update.
To determine why the addition of the cn=entry,ou=sub,o=ibm,c=us
entry
failed, the ldapexop utility can be used to perform a Control
replication error log extended operation to show the failed replication
change. See ldapexop utility for more information
about the ldapexop utility.
ldapexop -p 389 –h server1.us.ibm.com -D adminDn -w adminPw -op controlreplerr
-ra "cn=Replica, ibm-replicaServerId=Master, ibm-replicaGroup=default,
o=ibm,c=us" -show 12
dn: cn=entry,ou=sub,o=ibm,c=us
control: 2.16.840.1.113730.3.4.2 true
control: 1.3.18.0.2.10.19 false:: MIGnMDAKAQAwKwQPbW9kaWZ5dGltZXN0YW1wMRgEFjI
wMDkwMjA2MTQ0MzU0LjY1NzcwMFowIAoBADAbBA1tb2RpZmllcnNuYW1lMQoECGNuPWFkbWluMDA
KAQAwKwQPY3JlYXRldGltZXN0YW1wMRgEFjIwMDkwMjA2MTQ0MzU0LjY1NzcwMFowHwoBADAaBAx
jcmVhdG9yc25hbWUxCgQIY249YWRtaW4=
changetype: add
cn: entry
ibm-entryuuid: A091A000-4CAA-198C-8D7D-402084027431
sn: entry
objectclass: person
objectclass: top
An LDAP administrator can either fix the replication differences manually or use the ldapdiff utility to resynchronize the replication contexts on all servers in the replication topology. The ldapdiff utility is a useful tool for comparing and verifying that the entries within a replication context on supplier and consumer server are synchronized. For the purposes of this example, an LDAP administrator has chosen to resynchronize the replication context by using the ldapdiff utility. See ldapdiff utility for more information about the ldapdiff utility.
Before you use the ldapdiff utility to compare or fix entries within a replication context, quiesce the replication context on all servers within the replication topology by using the Cascading control replication extended operation quiesce option on the ldapexop utility. See ldapexop utility for more information about the ldapexop utility.
o=ibm,c=us
replication
context on the master and replica server in the replication topology:ldapexop –p 389 –h server1.us.ibm.com –D adminDn –w adminPw –op cascrepl –action
quiesce –rc “o=ibm,c=us”
After the replication context is quiesced on all servers, the Control replication extended operation can be used to suspend replication for all replication agreements within the replication context.
o=ibm,c=us
.
The cn=Replica, ibm-replicaServerId=Master, ibm-replicaGroup=default,
o=ibm, c=us
is the only replication agreement within the o=ibm,c=us
replication
context so that it is the only agreement that is suspended.ldapexop –p 389 –h server1.us.ibm.com –D adminDn –w adminPw –op controlrepl
–action suspend –rc “o=ibm,c=us”
o=ibm,c=us
on the
master server on server1.us.ibm.com
and the replica
server on server2.us.ibm.com
. If there are any differences
between the two servers, they are written to the output LDIF file
called differences.ldif
. The ldapdiff -a option
is specified to write the Server Administration control to
the output LDIF for each entry that is different between the two servers.
See Server Administration for more information about the Server
Administration control.ldapdiff –a -b “o=ibm,c=us” -L differences.ldif -sh server1.us.ibm.com -sp 389
-sD adminDn -sw adminPw
-ch server2.us.ibm.com -cp 389 -cD adminDn –cw adminPw
where differences.ldif
contains:dn: ou=sub,o=ibm,c=us
control: 1.3.18.0.2.10.15 true
control: 1.3.18.0.2.10.19 false::
MIGnMB8KAQAwGgQMY3JlYXRvcnNOYW1lMQoECGNuPWFkbWluMDAKAQAwKwQP
Y3JlYXRlVGltZVN0YW1wMRgEFjIwMDkwMjA2MTM0ODQ2Ljc4Njg4NFowIAoB
ADAbBA1tb2RpZmllcnNOYW1lMQoECGNuPWFkbWluMDAKAQAwKwQPbW9kaWZ5
VGltZVN0YW1wMRgEFjIwMDkwMjA2MTQ0ODUzLjU4MzgyNVo=
changeType: add
ibm-entryuuid: C01B9000-3FBE-198C-98A7-402084027431
ou: sub
description: A small division
objectclass: organizationalUnit
objectclass: top
dn: cn=entry,ou=sub,o=ibm,c=us
control: 1.3.18.0.2.10.15 true
control: 1.3.18.0.2.10.19 false::
MIGnMB8KAQAwGgQMY3JlYXRvcnNOYW1lMQoECGNuPWFkbWluMDAKAQAwKwQP
Y3JlYXRlVGltZVN0YW1wMRgEFjIwMDkwMjA2MTQ0MzU0LjY1NzcwMFowIAoB
ADAbBA1tb2RpZmllcnNOYW1lMQoECGNuPWFkbWluMDAKAQAwKwQPbW9kaWZ5
VGltZVN0YW1wMRgEFjIwMDkwMjA2MTQ0MzU0LjY1NzcwMFo=
changeType: add
ibm-entryuuid: A091A000-4CAA-198C-8D7D-402084027431
objectclass: person
objectclass: top
sn: entry
cn: entry
The contents of the differences.ldif
file indicates
that the ou=sub,o=ibm,c=us
entry does not exist on
the consumer server. This explains why the addition of the child
entry cn=entry,ou=sub,o=ibm,c=us
failed on the consumer
server.
Before synchronizing entries within a replication context on the master and replica servers, all replication failures are deleted and all pending replication changes are skipped. Replication failures are deleted by using the Control replication error log extended operation on the ldapexop utility. Pending replication changes are skipped by using the Control replication queue extended operation on the ldapexop utility.
cn=Replica, ibm-replicaServerId=Master,
ibm-replicaGroup=default, o=ibm, c=us
:ldapexop -p 389 –h server1.us.ibm.com -D adminDn -w adminPw -op controlreplerr
-delete all -ra "cn=Replica, ibm-replicaServerId=Master, ibm-replicaGroup=default,
o=ibm, c=us"
ldapexop -p 389 –h server1.us.ibm.com -D adminDn -w adminPw -op controlqueue
–skip all -ra "cn=Replica, ibm-replicaServerId=Master, ibm-replicaGroup=default,
o=ibm, c=us"
To synchronize the o=ibm,c=us
replication context
on the master and replica servers, run the ldapdiff utility
again with the -F (Fix) option specified or use the ldapmodify command
to add the entries in the differences.ldif
file to
the consumer server.
Because the master and replica servers are now synchronized, the replication agreement can now be resumed and the replication context unquiesced. The replication agreement is resumed by using the Control replication extended operation on the ldapexop utility. The replication context is unquiesced on all servers in the replication topology by using the Cascading control replication extended operation on the ldapexop utility.
cn=Replica, ibm-replicaServerId=Master,
ibm-replicaGroup=default, o=ibm, c=us
:ldapexop –p 389 –h server1.us.ibm.com –D adminDn –w adminPw –op controlrepl
–action resume –ra “cn=Replica, ibm-replicaServerId=Master, ibm-replicaGroup=default,
o=ibm, c=us”
o=ibm,c=us
on all servers in the replication
topology:ldapexop –p 389 –h server1.us.ibm.com –D adminDn –w adminPw –op cascrepl
–action unquiesce –rc “o=ibm,c=us”
ldapsearch -p 389 –h server1.us.ibm.com -D adminDn -w adminPw -b “o=ibm,c=us”
“(objectclass=ibm-replicationAgreement)” "*" ibm-replicationChangeLdif
ibm-replicationFailedChangeCount ibm-replicationFailedChanges ibm-replicationLastActivationTime
ibm-replicationLastChangeID ibm-replicationLastFinishTime ibm-replicationLastResult
ibm-replicationLastResultAdditional ibm-replicationNextTime ibm-replicationPendingChangeCount
ibm-replicationPendingChanges ibm-replicationState
cn=Replica, ibm-replicaServerId=Master, ibm-replicaGroup=default, o=ibm, c=us
objectclass=top
objectclass=ibm-replicationAgreement
ibm-replicaconsumerid=Replica
ibm-replicaurl=ldap://server2.us.ibm.com:389
ibm-replicacredentialsdn=cn=ReplicaBindCredentials,o=ibm, c=us
description=Replication agreement from master to replica
cn=Replica
ibm-replicationonhold=FALSE
ibm-replicationstate=ready
ibm-replicationpendingchangecount=0
ibm-replicationnexttime=19000101000000
ibm-replicationlastfinishtime=20090206165454Z
ibm-replicationlastchangeid=46
ibm-replicationlastactivationtime=20090206144354Z
ibm-replicationfailedchangecount=0
ibm-replicationchangeldif=N/A
Because the ibm-replicationState operational attribute value in the replication agreement entry is set to ready, replication from the master to the replica is now no longer stalled.