Troubleshooting multiple data center configurations
Use this information to troubleshoot multiple data center configurations, including linking between catalog service domains.
Before you begin
Procedure
- Problem: You must determine whether
data replication is synchronized across container servers and catalog
service domains.
Solution: Run the xscmd -c showReplicationState or xscmd.sh -c showDomainReplicationState command. These commands display information about the status of replication in the environment. For more information, see Monitoring with the xscmd utility.
- Problem: You must check which
catalog service domains are linked to your local catalog service domain.
Solution: Run the xscmd -c showLinkedDomains command. This command lists the foreign catalog service domains that are linking to the local catalog service domain.
- Problem: You want to detect any
configuration problems with your primary shard links to catalog service
domains, without going through the entire output of the xscmd
-c showLinkedPrimaries command. Solution: Use the -hc or the --linkHealthCheck option with this command. For example, xscmd -c showLinkedPrimaries -hc or xscmd -c showLinkedPrimaries --linkHealthCheck. The command verifies that the primary shards have the appropriate number of catalog service domain links. The command lists any primary shards that have the wrong number of links. If they are all linked correctly (for example, your domain is linked to 1 other domain, then all of the individual primary shards are expected to have 1 link), you get a message that indicates they are linked:
CWXSI0092I: All primary shards for {0} data grid and {1} map set have the correct number of links to foreign primary shards.
If you discover problems, try some of the following possible solutions:- Review your network and firewall settings to ensure that the servers that are hosting container servers in the domains can communicate with each other.
- Review the SystemOut and FFDC logs for the primary shards with the incorrect links for more specific error messages.
- Close and re-establish the link between the domains.
- Problem: Data is missing in one or more catalog
service domains. For example, you might run the xscmd -c
establishLink command. When you look at the data for each
linked catalog service domain, the data looks different, for example
from the xscmd -c showMapSizes command.
Solution: You can troubleshoot this problem with the xscmd -c showLinkedPrimaries command. This command prints each primary shard, and including which foreign primaries are linked.
In the described scenario, you might discover from running the xscmd -c showLinkedPrimaries command that the first catalog service domain primary shards are linked to the second catalog service domain primary shards, but the second catalog service domain does not have links to the first catalog service domain. You might consider rerunning the xscmd -c establishLink command from the second catalog service domain to the first catalog service domain.
- Problem: The catalog service domains are not replicating
data. The output of the command showMapsizes or showDomainReplicationState do
not match between the catalog service domains as expected. The command showLinkedPrimaries shows
links in the recovery state instead of the online state.
Diagnosis: Investigate the multi-master links between the primary shards in the recovery state. The recovery state indicates that WebSphere eXtreme Scale cannot successfully replicate between the primary shards in each catalog service domain. When a primary shard encounters an exception, it goes into an auto-recovery state and sends a ping to the foreign primary shard. If the ping is successful, replication starts again. If the ping fails, the primary shard sleeps and pings again in the future. Each primary shard is responsible for maintaining replication with its foreign primary in the foreign domain. For example, the primary shard for partition 1 in domain 1 replicates directly with the primary shard for partition 1 in domain 2.
- Review the output for the command showLinkedPrimaries and
locate a shard in recovery state. Example output:
CWXSI0068I: Executing command: showLinkedPrimaries CWXSI0091I: Verifying the primary shards have the correct number of links to foreign primary shards. *** Displaying results for inventory data grid and aSet map set. Expected number of online links: 1. *** Listing Primary Shards with the incorrect number of links for local domain: domain1, Container: server0_C-0, Server: server0, Host: myHost.rchland.ibm.com *** Grid Name Map Set Name Partition Domain Container Status --------- ------------ --------- ------ --------- ------- inventory aSet 0 domain2 server20_C-1 recovery inventory aSet 1 domain2 server20_C-1 recovery
- Review the SystemOut or JVM logs and FFDC of a link
in recovery state. In the showLinkedPrimaries example that is provided, take note of the first entry, that is partition
0
, for the grid inventory and map setaSet
. The local primary shard for partition0
runs onserver0
and the foreign primary shard for partition0
runs onserver20
. To find out more information about the link, locate the SystemOut or JVM log file forserver0
. Search the file for the inventory grid for partition0
. To aid in the search, the shard identification string is formatted asobjectGridName:mapSetName:partitionID
in the log. In this case, the shard identification string isinventory:aSet:0
. You should search for several messages in theCWOBJ1500-CWOBJ1599
range. The relevant messages for this showLinkedPrimaries example includeCWOBJ1511I, CWOBJ1542I, CWOBJ1550W
andCWOBJ1551I
.Example log messages:
ReplicatedPar I CWOBJ1511I: inventory:aSet:0 (primary) is open for business. PrimaryShardI I CWOBJ1542I: Primary inventory:aSet:0 started or continued replicating from foreign primary (domain2:server20_C-1). Replicating for maps: [movie, book] PrimaryShardI W CWOBJ1550W: The primary (inventory:aSet:0) shard received exceptions while replicating from the primary shard on the domain2:server20_C-1 primary container. The primary shard continues to poll the primary shard. Exception received: org.omg.CORBA.NO_RESPONSE: Request 180 timed out vmcid: IBM minor code:B01 completed: Maybe at com.ibm.rmi.iiop.Connection.getCallStream(Connection.java:2339) at com.ibm.rmi.iiop.Connection.send(Connection.java:2266) at com.ibm.rmi.iiop.ClientRequestImpl.invoke(ClientRequestImpl.java:330) at com.ibm.rmi.corba.ClientDelegate.invoke(ClientDelegate.java:445) at com.ibm.CORBA.iiop.ClientDelegate.invoke(ClientDelegate.java:1193) at com.ibm.rmi.corba.ClientDelegate.invoke(ClientDelegate.java:800) at com.ibm.CORBA.iiop.ClientDelegate.invoke(ClientDelegate.java:1223) at org.omg.CORBA.portable.ObjectImpl._invoke(ObjectImpl.java:484) at com.ibm.ws.objectgrid.partition._IDLPrimaryShardStub.queryRevision(_IDLPrimaryShardStub.java:420) at com.ibm.ws.objectgrid.partition.IDLPrimaryShardWrapperImpl.queryRevision(IDLPrimaryShardWrapperImpl.java:96) at com.ibm.ws.objectgrid.replication.PrimaryShardImpl$RevisionQueryHandler.run(PrimaryShardImpl.java:4209) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at com.ibm.ws.objectgrid.thread.XSThreadPool$Worker.run(XSThreadPool.java:309) CWOBJ1551I: Primary inventory:aSet:0successfully recovered and replicated after several exceptions from the primary on domain2:server20_C-1. When a MMR linkis in recovery state, look for CWOBJ1550W messages. These messages contain the exception received during replication. If the primary shard automatically recovers, a CWOBJ1551I message occurs.
- Review the SystemOut or JVM logs and
FFDC of a link in recovery state on the foreign domain side.
It is important to review the foreign primary side as well to see whether there are companion messages. If an
org.omg.CORBA.NO_RESPONSE orcom.ibm.ws.xsspi.xio.exception.MessageTimeOutException
exception occurs, then general network issues, hung threads, database problems, or other exceptions that prevent a timely response to the caller might be the cause of the problem. To review the foreign primary side, return to the showLinkedPrimaries command output and find the server name from the foreign domain. In the provided example, the foreign primary is running on serverserver20
indomain2
. Search on the same shard identificationinventory:aSet:0
in the SystemOut or JVM logs and the FFDC. Also, look forCWOBJ7853W
messages that indicate hung threads. You should also look forHMGR0152W
messages that indicate processor starvation that can prevent the server from operating efficiently. In this example, searching through the FFDC revealed database exceptions. Example FFDC:key = java.lang.reflect.InvocationTargetException com.ibm.ws.xs.osgi.service.BackingMapServiceHandler.invoke 90 Exception = java.lang.reflect.InvocationTargetException Source = com.ibm.ws.xs.osgi.service.BackingMapServiceHandler.invoke probeid = 90 Stack Dump = java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor67.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37) at java.lang.reflect.Method.invoke(Method.java:611) at com.ibm.ws.xs.osgi.service.XSServiceHandler.invoke(XSServiceHandler.java:87) at com.ibm.ws.xs.osgi.service.BackingMapServiceHandler.invoke(BackingMapServiceHandler.java:74) at $Proxy39.batchUpdate(Unknown Source) at com.ibm.ws.objectgrid.map.BaseMap.applyCacheLoader(BaseMap.java:1410) at com.ibm.ws.objectgrid.ObjectMapImpl$CacheLoaderApplyPrivilegedAction.run(ObjectMapImpl.java:2189) at java.security.AccessController.doPrivileged(AccessController.java:251) at com.ibm.ws.objectgrid.ObjectMapImpl.internalFlush(ObjectMapImpl.java:1684) at com.ibm.ws.objectgrid.SessionImpl.internalFlush(SessionImpl.java:2770) at com.ibm.ws.objectgrid.SessionImpl.commit(SessionImpl.java:1566) at com.ibm.ws.objectgrid.ObjectGridImpl.applyRevision(ObjectGridImpl.java:5923) at com.ibm.ws.objectgrid.replication.PrimaryShardImpl$RevisionQueryHandler.run(PrimaryShardImpl.java:4138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:897) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:919) at java.lang.Thread.run(Thread.java:736) Caused by: com.ibm.websphere.objectgrid.plugins.LoaderException: ... Caused by: java.sql.BatchUpdateException: ORA-01013: user requested cancel of current operation ...
Exceptions Workarounds (ORB) org.omg.CORBA.NO_RESPONSE
(XIO) com.ibm.ws.xsspi.xio.exception.MessageTimeOutException
These messages indicate that the transport layer did not determine whether a successful connection was made. Or, it might indicate that a connection was successful, but a response did not occur within the configured timeout. Consider checking for the following issues:
- Are there any network problems that prevent connections? For example, the network is intermittently down. The firewall blocks ports. The DNS service has intermittent problems. The ORB or XIO port must be open between the two containers that are replicating data in a multi-master environment. The primary shards on the containers servers connect directly to each other.
- Are there CWOBJ messages that indicate hung threads on the remote
container, such as CWOBJ7853W? If the domain uses a database, then
search for database-related exceptions on the container servers. For
example,
com.ibm.websphere.objectgrid.plugins.LoaderException
orjava.sql.BatchUpdateException
. Resolve the database problem.
(XIO) com.ibm.ws.xsspi.xio.exception.ConnectionRefusedException
(ORB) org.omg.CORBA.TRANSIENT
(ORB)org.omg.CORBA.COMM_FAILURE
These messages indicate the remote server might not be contacted and the JVM process is gone. This exception is normally temporary and the remote primary shard fails over to a new location and the links are updated. If the link does not recover, then consider the following steps:- Check to see whether either domain has quorum that is enabled and if the system is out of
quorum. Issue the showQuorumStatus command. For more
information, see Managing data center failures when quorum is enabled.
- If the domain is out of quorum, placement changes are not done.
- If the link does not recover and quorum is not the issue, check if the foreign primary is placed in a new location.
- Review the showPlacement and routetable command
output for the foreign primary shard.
- If the foreign primary is not placed or marked as "not reachable" in the routetable output, then run the triggerPlacement command in the foreign domain.
- If the foreign primary shard is placed and reachable on a new container server, then run triggerPlacement on the local domain.
org.omg.CORBA.OBJECT_NOT_EXIST (ORB)
com.ibm.ws.xsspi.xio.exception.ActorNotFoundException (XIO)
com.ibm.ws.xsspi.xio.exception.InvalidXIORefException (XIO)
These messages indicate that the remote server might be contacted, but the foreign primary shard was not found. This exception is normally temporary and the remote primary shard fails over to a new location. The links are also updated. If the link does not recover and quorum is not the issue, consider the following steps:- Check to see whether either domain has quorum that is enabled and if the system is out of
quorum. Issue the showQuorumStatus command. For more
information, see Managing data center failures when quorum is enabled.
- If the domain is out of quorum, placement changes are not done.
- If the link does not recover and quorum is not the issue, check if the foreign primary is placed in a new location.
- Check to see whether the foreign primary is placed in a new location.
Review the showPlacement and routetable command
output for the foreign primary shard.
- If the foreign primary is not placed or marked as "not reachable" in the route table output, issue the triggerPlacement command in the foreign domain.
- If the foreign primary shard is placed and reachable on a new container server, run triggerPlacement on the local domain.
- Review the output for the command showLinkedPrimaries and
locate a shard in recovery state. Example output:
-
Problem: The multimaster replication link was dismissed, but the foreign domain or
collective could not be contacted. The link is in the DISMISSING_LINK state in the monitoring
console, or the link is displayed in the DISMISSING_LINK state when you run the
xscmd -c showLinkedDomains -v
command. The foreign domain or collective cannot be restarted or contacted to resolve the dismiss link request. The link stays in DISMISSING_LINK state because the local domain tries again to connect to the foreign domain to complete the dismissal request.Solution: Run the
xscmd -c dismissLink
command with the -force option to dismiss the link once with the foreign domain and then clean up the local domain.