Debugging Replication in IBM Tivoli Directory Server

Replication : Basics and advanced concepts


Replication is a technique used by directory servers to improve performance and reliability. The replication process keeps the data in multiple directories synchronized. Replication provides two main benefits:

  • Redundancy of information - replicas back up the content of their supplier servers.
  • Faster searches - search requests can be spread among several different servers, all having the same content, instead of a single server. This improves the response time for the request completion.

IBM Tivoli Directory Server (TDS), version 5.1 onward, supports a form of replication called subtree replication. Subtree replication can be defined as follows: "A portion of the DIT is replicated from one server to another. Under this design, a given subtree can be replicated to some servers and not to others". This article is written with reference to TDS V5.2. Therefore the word directory indicates/implies TDS V5.2 in this document.

This article assumes that the reader is comfortable with directory server concepts like the Directory Information Tree. This article will:

  1. Provide a clear logical view of subtree replication so as to enable directory administrators to troubleshoot their replication setups and maintain them.
  2. Provide users who are familiar with subtree replication with information, best practices and methods to deal with errors that can hinder their replication setup.
  3. Provide a clear idea of how to report replication problems to TDS support so as to get faster and accurate responses from their personnel.

Replication Concepts

As mentioned, TDS replication works on the concept of subtree replication. In this section, we will cover some terms related to subtree replication which we will need in the rest of this writeup.

  • Consumer server: A server which receives changes through replication from another (supplier) server.
  • Supplier server: A server which sends changes to another (consumer) server.
  • Replication context: Identifies the root of a replicated subtree. The ibm-replicationContext auxiliary object class may be added to an entry to mark it as the root of a replicated area. The configuration information related to replication is maintained in a set of entries created below a replication context.
  • Replica group: The first entry created under a replication context has objectclass ibm-replicaGroup and represents a collection of servers participating in replication. It provides a convenient location to set ACLs to protect the replication topology information. The administration tools currently support one replica group under each replication context, named ibm-replicagroup=default.
  • Replica subentry: Below a replica group entry, one or more entries with objectclass ibm-replicaSubentry may be created; one for each server participating in replication as a supplier. The replica subentry identifies the role the server plays in replication: master or read-only. A read-only server might, in turn, have replication agreements to support cascading replication.
  • Replicated subtree: A portion of the DIT that is replicated from one server to another. Under this design, a given subtree can be replicated to some servers and not to others. A subtree can be writable on a given server, while other subtrees may be read-only.
  • Replication agreement: Information contained in the directory that defines the connection or replication path between two servers. One server is called the supplier (the one that sends the changes) and the other is the consumer (the one that receives the changes). The agreement contains all the information needed for making a connection from the supplier to the consumer and scheduling replication.
  • Credentials: Identify the method and required information that the supplier uses in binding to the consumer. For simple binds, this includes the DN and password. The credentials are stored in an entry the DN of which is specified in the replication agreement.
  • Schedule: Replication can be scheduled to occur at particular times, with changes on the supplier accumulated and sent in a batch. The replication agreement contains the DN for the entry that supplies the schedule.

Specific entries in the directory are identified as the roots of replicated subtrees, by adding the ibm-replicationContext objectclass to them. Each subtree is replicated independently. The subtree continues down through the directory information tree (DIT) until reaching the leaf entries or other replicated subtrees. Entries are added below the root of the replicated subtree to contain the replication configuration information. These entries are one or more replica group entries, under which are created replica subentries. Associated with each replica subentry are replication agreements that identify the servers that are supplied (replicated to) by each server, as well as defining the credentials and schedule information. Through replication, a change made to one directory is propagated to one or more additional directories. In effect, a change to one directory shows up on multiple different directories.

I supply, you consume

Whenever there is a change (addition, deletion and modification) which affects an entry under a replication context in server A, for example, this change is picked by the server to be replicated to the servers which are mentioned in the replication agreements that are mentioned under the subentry for server A. The server on which the change occurs supplies the change to the server with which it has agreements. The server which receives the changes sent by the supplier is said to consume the change and apply the change to its own DIT; hence the term consumer.

The agreement and the replication queue

To make things easier, let us consider a simple example. Our example setup is made up of one master and one replica. Masters are servers which accept client updates; replicas do not. In other words, our master can be updated by a client, and the replica acts as if it is read-only for that context.

Figure 1. Master-Replica setup
Master Replica setup
Master Replica setup

The Master and the Replica in the above setup are both TDS servers. Note the arrow between the two blocks. It indicates that the changes that will occur for a certain replication context will be sent across to the replica server. We have the following information:

  1. o=ibm,c=us is the replication context.
  2. The master is on and is listening on port 389 and the replica is on and listening on the same port.
  3. The server ID of the master is Master and the server ID of the replica is Replica.
  4. We will be using non-secure means for the master to contact and communicate with the replica.

The following entries would go into the DIT of the master and the replica:

Listing 1. Entries in Master and Replica server
dn: o=ibm, c=us
objectclass: top
objectclass: organization
objectclass: ibm-replicationContext
o: ibm

dn: ibm-replicaGroup=default, o=ibm, c=us
objectclass: top
objectclass: ibm-replicaGroup
ibm-replicaGroup: default

dn: ibm-replicaServerId=Master,ibm-replicaGroup=default,o=ibm,c=us
objectclass: top
objectclass: ibm-replicaSubentry
ibm-replicaServerId: Master
ibm-replicationServerIsMaster: true
cn: Master
description: Master

dn: cn=ReplicaBindCredentials, o=ibm,c=us
objectclass: ibm-replicationCredentialsSimple
cn: ReplicaBindCredentials
replicaBindDN: cn=master
replicaCredentials: master
description: Bind method for the master to connect to the replica.

dn:cn=Replica, ibm-replicaServerId=Master, ibm-replicaGroup=default,o=ibm,c=us
objectclass: top
objectclass: ibm-replicationAgreement
cn: Replica
ibm-replicaConsumerId: Replica
ibm-replicaUrl: ldap:// 389
ibm-replicaCredentialsDN: cn=ReplicaBindCredentials,o=ibm,c=us
description: Replication agreement describing how the master will bind to the replica.

The following entry should go in the replica's configuration file:

Listing 2. Entries in Replica server configuration file
dn: cn=master server, cn=configuration
cn: master server
ibm-slapdMasterDN: cn=master
ibm-slapdMasterPW: master
ibm-slapdMasterReferral: ldap://
objectclass: ibm-slapdReplication

Let us take a close look at the entries above:

  1. The entry o=ibm,c=us is a replication context. This means that if a client binds to the master and updates the DIT below o=ibm,c=us, these changes will be propagated to the servers under the subentry for the master.
  2. The entry ibm-replicagroup=default,o=ibm,c=us presents a convenient location to keep ACLs and has no special significance in replication as such.
  3. The entry ibm-replicaServerId=Master,ibm-replicaGroup=default,o=ibm, c=us is the subentry. Notable among the attributes are ibm-replicaServerId and ibm-replicationServerIsMaster. The former attribute is the server ID (Master in this case, as we had already mentioned) and the latter is a Boolean attribute indicating that the server mentioned by the ibm-replicaServerId attribute will be a read-write or read-only copy. The value true indicates that the server Master will be a read-write copy.
  4. The entry cn=Replica, ibm-replicaServerId=Master, ibm-replicaGroup =default, o=ibm, c=us is perhaps the most important entry in the above set (though the importance of other entries cannot be undermined!). It is the agreement which enables the master server to contact the replica server, authenticate if needed, authorize and replicate. Authentication will be necessary if we are using secure means like SSL or Kerberos, which we are not, for the purpose of simplicity.The following attributes are worth a look in the agreement:
    • ibm-replicaConsumerId: This attribute indicates the ID of the consumer for this agreement
    • ibm-replicaUrl: This is an LDAP URL which indicates that the master will be supplying to an LDAP server on host name which is listening on port 389.
    • ibm-replicaCredentialsDN: This is the DN for the entry which holds the credentials that the master will use to bind to the replica. Please note that when we mean bind, we actually mean an LDAP bind, and as all LDAP binds require a bind DN and a bind password, the master also needs the bind DN and bind password that it is supposed to use to bind to the replica. The entry cn=ReplicaBindCredentials,o=ibm,c=us has a replicaBindDN and a replicaCredentials attribute which the master uses.
  5. Also worthy of a look is the cn=master server,cn=configuration entry. We need to put this in the replica configuration file. This is entry the replica (consumer) would use to authorize the master (supplier) when the master binds to the replica.

Please note that your replica ALSO NEEDS the DIT entries. This enables it to realize that it is a read-only copy; the absence of its subentry and/or the value of the ibm- replicationServerIsMaster attribute, in case it is a forwarder, is an indicator that it is indeed read-only.

Every agreement has an associated queue. This queue is necessary so that changes that the supplier failed to send to the consumer can be stored in it. Another need for the queue is to queue updates which are waiting on a schedule. These changes can be propagated when the communication actually begins or in the latter case, when the scheduled time arrives.

Figure 2. Replication queue
Replication Queue
Replication Queue

Bind Credentials

We have already looked at bind credentials. We will do a refresh here. The supplier needs a bind DN and a bind password to bind to the consumer. On the consumer side, there needs to be a means against which this bind can be verified. The supplier in the above case had the following entry:

Listing 3. Bind credentials
dn: cn=ReplicaBindCredentials,o=ibm,c=us
objectclass: ibm-replicationCredentialsSimple
cn: ReplicaBindCredentials
replicaBindDN: cn=master
replicaCredentials: master
description: Bind method for the master to connect to the replica.

This indicates that our supplier will use the DN cn=master and the password master to bind to the consumer in whose agreement the DN for the above entry is included. It should be obvious now that bind credential entries are reusable.

On the consumer side there are two type of entries that can be used to verify the bind credentials of any supplier:

  1. This one comes from the days before IBM Directory Server 4.1. Including this entry in the configuration file of the consumer creates a user with DN cn=master and password master. Please note that this user is more powerful than the root administrator itself. Here is the entry:

    Listing 4. Master Server entry
    dn: cn=master server, cn=configuration
    cn: master server
    ibm-slapdMasterDN: cn=master
    ibm-slapdMasterPW: master
    ibm-slapdMasterReferral: ldap://
    objectclass: ibm-slapdReplication
  2. The other type of entry which can be used to verify the bind credentials of a supplier is as follows:

    Listing 5. Supplier entry
    dn: cn=Supplier Master, cn=configuration
    cn: Supplier Master
    ibm-slapdMasterDN: cn=master
    ibm-slapdMasterPW: master
    ibm-slapdReplicaSubtree: o=IBM,c=US
    objectclass: ibm-slapdSupplier

This indicates that a bind made using the DN cn=master and password master, has authority to make changes to the subtree o=ibm,c=us, and nowhere else.

Stages in which replication works

Replication works in three stages:

  1. Connect: In the connect phase, the LDAP URL in the agreement is analyzed and an attempt is made to connect to the consumer. If the connection fails, an appropriate error is logged in the slapd log file and an attempt is made to reconnect at intervals.
  2. Bind: In the bind phase, the bind credentials on the supplier are used to bind to the consumer. Please note that this can be a simple bind, an SSL authenticated bind or a Kerberos authenticated bind.
  3. Replicate: The actual replication can occur on either a simple connection or a secure connection. The replication can be immediate or scheduled. It is in this phase that the changes are actually propagated.

Replication Debugging practices

These are some of the best practices that we advise administrators to set right their replication setup:

  1. Divide and conquer. If replication is failing at multiple points, it is best to get one supplier-consumer link working properly and then move to the other failure points.
  2. Your ibmslapd.log file is the best point to start the troubleshooting. Please refer to the TDS documentation to find the exact location of this log file. In fact in version 6.0, TDS comes with a full-fledged message guide, which indicates what each log entry in your ibmslapd log file means and what is the expected response from the operator. We will be covering quite a few errors later in this writeup, but the message guide can be treated as a reference.
  3. Start from the supplier, as it is the server which initiates the replication steps. Check whether it is able to connect. If it is able to connect check whether it is able to bind and if it is able to bind correctly, check the reason replication is failing. If any of the above are not working, you will have zeroed in on the real problem. Resolve that and then go to the next problem area if replication still does not work.

Some common replication errors and their solutions

  • Unable to connect to replica host name on xxx port number. Please verify that the replica is started. The consumer is down or is not reachable. Always try to include the fully distinguished name of the consumer in the LDAP URL in the agreement. Using the IP address of the consumer is a better option. If a short name is to be used, then the relevant updates should go into the Hosts file (/etc/hosts file on UNIX).

  • The DN of the credential entry 'entry name' defined for the replication agreement entry name cannot be found. Please check whether the credential entry which you have provided in the agreement actually exists. It might also be the case that the DN that you have provided in the agreement is not correct. Please rectify this DN.

  • Error 'error' string occurred for replica <entry name>: operation failed for entry <entry name> change ID change eid. There might be multiple problems when you encounter this message. For example, the supplier is trying to add an entry to the consumer, but the parent of that entry does not exist on the consumer. Or the supplier is trying to propagate an entry deletion, but the entry does not exist. Please check the error string and take appropriate actions.

  • Error 'error' string occurred for replica <entry name>: bind failed using masterDn <DN>. Check the credential object defined in the replication agreement and make sure the DN used to bind is correct. In addition, make sure the consumer has the proper master DN defined in the configuration file. This is particularly true for credentials of the type ibm-slapdSupplier objectclass. If an entry of this objectclass is defined in the configuration file of the consumer, it is only for that subtree. So an entry defined for say o=ibm,c=us will NOT be used to authorize a replicated change for say cn=ibm-policies. You will need to define another such entry specifically for cn=ibm-policies.

  • Replication for replica <replication agreement DN> will continue to retry the same update after receiving an error.Your replication queue has a blocked update. It is blocking other updates from flowing to the consumer. Check what is wrong, and clear the update. We will be discussing some tools that come in handy when debugging a replication setup in the next section. We will see then how we can skip a blocked update.

Some advanced replication errors and solutions

  • If you get a "Transaction Log Full" error or the server takes a long time to start, look into the replication queue. This happens in scenarios where the change queue grows a lot due to stuck replication changes.

  • When replication is set up through the Web Administration Tool, then by default all the replication queues go in suspended mode. You need to explicitly resume replication.

  • While setting up a replication topology, there might be instances where the topology setup process times out. This happens if cascaded replication is involved, where it is expected that all the replication queues are flushed out, before the setup could be completed.

Your replication toolbox

TDS, in its current form, provides operational attributes and extended operations which you can use, not only for debugging replication, but also to monitor the health of your replication setup. Operational attributes are maintained by the server, which also stores information which affect server operation. These attributes are not returned by a search query unless they are specifically requested in a search request. An extension mechanism has been added in LDAP, in order to allow additional operations to be defined for services not available elsewhere in the LDAP protocol. The extended operation allows clients to make requests and receive responses with predefined syntaxes and semantics. These may be defined in RFCs or be private to particular implementations.

Operational attributes

The following operational attributes can help administrators in debugging replication. They are extensively used by the Web Administration Tool:

  • ibm-replicationChangeLDIF: This attribute provides the pending changes in an LDIF format. Pending changes are the updates in the DIT that are to be propagated to the consumer, but are still pending in the queue. For example if we fire the following ldapadd operation on a supplier whose consumer is down:

    Listing 6. Sample ldapadd entry, when consumer is down:
    ldapadd -D cn=root -w root
    dn : cn=test1,o=ibm,c=us
    objectclass : person
    sn : test1

    The ldapsearch operation for the ibm-replicationChangeLDIF is fired as shown below, and the results follow:

    Listing 7. Querying ibm-replicationChangeLDIF:
    $ ldapsearch -D cn=root -w root -b "o=ibm,c=us" -s sub objectclass=* ibm-replicationchangeldif 
    dn: cn=test1,o=ibm,c=us
    objectclass: person
    objectclass: top
    sn: test1
    cn: test1
    ibm-entryuuid: 687fb540-3fb4-1029-88fa-9694cc6dfa8b
    control: false

    Note: The control components that you see are LDAP v3 controls that the server internally propagates with the replicated update.

  • ibm-replicationPendingChangeCount: This indicates the number of changes pending on an agreement. If there are pending changes as in the above example, the ldapsearch operation can be fired and the pending change count for that agreement can be procured:

    Listing 7. Querying ibm-replicationPendingChangeCount:
    $ ldapsearch -D cn=root -w root -b "o=ibm,c=us" -s sub objectclass=* ibm-replicationpendingchangecount

    If any further changes are made and if they accumulate in the pending change queue, the value of ibm-replicationPendingChangeCount goes up. You can write scripts which can monitor the pending change count and alert the administrator in case it rises above a threshold value.

  • ibm-replicationPendingChanges: When an agreement is queried with this attribute, the search returns the changes that are pending on that agreement.

    Listing 8. Querying ibm-replicationPendingChanges:
    $ ldapsearch -D cn=root -w root -b "o=ibm,c=us" -s sub objectclass=* ibm-replicationpendingchanges
    ibm-replicationpendingchanges=19 add cn=test1,o=ibm,c=us
    ibm-replicationpendingchanges=20 modify cn=test1,o=ibm,c=us
    ibm-replicationpendingchanges=21 delete cn=test1,o=ibm,c=us

    This is the pending queue that we have been mentioning. Please note the output that we get when we fire an ldapsearch operation with this attribute. ibm-replicationpendingchanges=19: This is a change that is pending; it has a change ID of 19 add cn=test1,o=ibm,c=us : This indicates that an add operation is pending and shows the DN of the entry whose addition needs to be propagated to the replica. This is one of the important operational attributes which can help you to zero in on a blocking update and clear it.

  • ibm-replicationState: This attribute can be used to query the state of replication for a certain agreement.

    Listing 9. Querying ibm-replicationState:
    $ ldapsearch -D cn=root -w root -p 3389 -b "o=ibm,c=us" -s sub objectclass=* ibm-replicationstate

    This indicates that the agreement cn=Replica,ibm-replicaServerId=Master,ibm-replicaGroup=default,o=ibm,c=us is now ready for accepting updates. The following are the values that ibm-replicationState can take:

    • Active means that replication is going on over this agreement.
    • Waiting indicates that the agreement is currently waiting for the supplier to connect to the consumer.
    • Suspended indicates that the agreement is suspended and no more replication updates will be sent to the consumer by this agreement (until it returns to the ready state)
    • Full indicates that the queue for this agreement is full and also displays a value which indicates the amount of progress.
    • Ready indicates immediate replication mode, ready to send updates as they occur.
  • ibm-replicationThisServerIsMaster: This attribute can be queried to check whether a certain server is a master server in the topology. If "o=ibm,c=us" is the replication topology, then the following search on "o=ibm,c=us" would reveal whether the servers in the topology are masters or not.

    Listing 10. Querying ibm-replicationThisServerIsMaster:
    $ ldapsearch -D cn=root -w root -b "o=ibm,c=us" -s sub objectclass=ibm-replicasubentry ibm-replicationThisServerIsMaster

    As you can see, this attribute provides information about whether a certain server is a master or not in this topology. This can come handy if you need to verify that the topology that you loaded is valid or not.

  • ibm-replicationLastResult: The results of the last attempted update to this consumer, in the form: timestamp changeid resultcode operation entry DN

    Listing 11. Querying ibm-replicationLastResult:
    $ ldapsearch -D cn=root -w root -b "o=ibm,c=us" -s sub objectclass=* ibm-replicationLastResult
    ibm-replicationLastResult=20050412140436Z 19 81 add cn=testpendingchange,o=ibm,c=us
    $ ldapsearch -D cn=root -w root -p 3389 -b "o=ibm,c=us" -s sub objectclass=* ibm-replicationLastResult
    ibm-replicationLastResult=20050412142136Z 21 0 delete cn=testpendingchange,o=ibm,c=us
  • ibm-replicationLastResultAdditional:The consumer server returns an LDAP Result when replication is finished. This attribute will provide the text from the message component of the LDAP Result PDU.

  • ibm-replicationIsQuiesced:By quiescing, we mean making a replication context readonly. This attribute returns whether a certain context is quiesced or not. For example, the following LDAP extended operation quiesces the context o=ibm,c=us:

    Listing 12. Quiescing replication context:
    $ ldapexop -D cn=root -w root -op quiesce -rc o=ibm,c=us
    Operation completed successfully.

    In a quiesced state, an ldapadd operation done on the quiesced context returns an error "ldap_add: DSA is unwilling to perform". The search on the replication context looks like the following:

    Listing 13. Querying ibm-replicationIsQuiesced:
    $ ldapsearch -D cn=root -w root -b "o=ibm,c=us" -s sub objectclass=ibm-replicationcontext ibm-replicationIsQuiesced

    The following LDAP extended operation unquiesces the context:

    Listing 14. Unquiesces replication context:
    $ ldapexop -D cn=root -w root -op quiesce -rc o=ibm,c=us -end
    Operation completed successfully.
  • ibm-replicationLastFinishTime: This attribute returns the time when the last pending change was dispatched by this agreement to the consumer.

    Listing 15. Querying ibm-replicationLastFinishTime:
    $ ldapsearch -D cn=root -w root -p 3389 -b "o=ibm,c=us" -s sub objectclass=* ibm-replicationLastFinishTime

    This attribute can come in handy if you need to see when replication was last working. This can help to relate the times in the ibmslapd.log file to replication failures.

Extended Operations

The following extended operations can be used to manage replication:

  • controlqueue: The usage for control replication queue operation is :

    Listing 16. controlqueue exop usage:
    -op controlqueue -skip skipValue -ra agreementDn
      skipValue - all skip all pending changes for this   
      change-id                  skip the specified change
      agreementDn            DN of the replication agreement

    This extended operation can be used to skip a change in the replication pending queue. This can be handy if a pending change is blocking your replication from proceeding. An example of how to use this extended operation is shown below:

    Listing 17. controlqueue exop example:
    $ ldapexop -D cn=root -w passwd123 -op controlqueue -skip all -ra \
    1 changes skipped.
  • controlrepl: The usage for control replication operation is:

    Listing 18. controlrepl exop usage:
    -op controlrepl -action actionValue -ra agreementDn
    -op controlrepl -action actionValue -rc contextDn
     actionValue  suspend    suspend replication
                  resume     resume replication
                  replnow    start immediate replication
     contextDn    specifies the root of the replication context. The action will be performed for all agreements for this context
     agreementDn  specifies the replication agreement. The action will be performed for the specified agreement.

    When blocked, replication, retries to replicate the updates after a certain amount of time. If you need to replicate the changes now, you can use the replnow option of the controlrepl extended operation.

    Listing 19. controlrepl exop example:
    $ ldapexop -D cn=root -w passwd123 -op controlrepl -action replnow -rc o=ibm,c=us
    Operation completed successfully.

How to report replication problems

You can provide the following information to help support personnel effectively solve your replication problems:

  1. Please provide a dump of the topology information. This can be easily obtained by running an ldapsearch such as ldapsearch -D cn=root -w root -b <replication context> -s sub objectclass=ibm-replica*. Also provide the credential entries that the topology is using
  2. The ibmslapd.log files for all the servers in the topology are necessary to find the problem. If you are sure of the servers that seem to have a problem, provide the logs for those servers. If you are not sure, send the logs for all the servers. Please name these files so that they are differentiable.
  3. If the replication seems to be blocking, send the following listings:
    • ldapsearch -D cn=root -w root -b <replication context> -s sub objectclass=* ibm-replicationpendingchanges
    • ldapsearch -D cn=root -w root -b <replication context> -s sub objectclass=* ibm-replicationpendingstate
    • ldapsearch -D cn=root -w root -b <replication context> -s sub objectclass=* ibm-replicationlastresult
    These searches need to be fired on the server which you think is failing to replicate. Further information can be provided as the support personnel need it. Providing the above information would help the support personnel to respond very quickly if it is a trivial/common problem.

Downloadable resources

Related topics


Sign in or register to add and subscribe to comments.

Zone=Tivoli (service management), Tivoli
ArticleTitle=Debugging Replication in IBM Tivoli Directory Server