Hardware and software failures

You can use the information in this topic to identify and resolve hardware and software issues.

Table 1. Hardware and software issues
Issue Detection Resolution
An NPS® node experiences a host failover. Detection is automatic, though HA functionality. Resolution is automatic: the secondary host takes over and starts the NPS.
An NPS node experiences a hardware or software failure, resulting in the temporary inability to process query or update transactions. Node failure detection might be through a combination of eventmgr reporting, state transition events, hardware notification events, and user-developed solutions. There are no additional detection or automated recovery abilities added. You can use one of the following approaches:
  • Resolve the problem to bring the NPS node back online.
  • If the NPS node is a primary, modify the replication set configuration (by using the provided management commands) to demote the old primary and promote a new primary.
The replication capture or apply agent on an NPS primary node fails, resulting in the inability to replicate transactions. The relevant capture and apply processes are managed by the local startupsvr. The startupsvr detects the problem and restarts the replcapture and replapply processes.
An NPS node is unable to write to or read from the replication queue manager, due to a connectivity issue or a hardware or software issue with the local log server component. The capture/apply agent receives an error from an API call when it attempts to read or write a transaction or fetch the latest metadata information. A ReplPTSError event is reported through eventmgr.

All update transactions on a primary fail until this issue is resolved. On both a primary and a replica, replication is effectively blocked until the problem is resolved. All query transactions continue to work properly.

You can use one of the following approaches:
  • Resolve the issue so that replication can automatically resume.
  • In the case of a primary, manually fail over to a new primary on a subnet that does not have the connectivity or replication queue manager issue.
A communication failure occurs between replication queue manager components.

The system generates ReplMissedMetadataHeartbeat events when it detects a communication problem between the nodes in a replication set. That is, a particular number of metadata heartbeats are consecutively not received, or a metadata heartbeat is received late. For more information about the ReplMissedMetadataHeartbeat event and how to configure the conditions for generating it, see Table 1.

You can display detailed missed heartbeat information by using the nzreplstate -heartbeat command on each of the affected nodes; check command output to determine recently sent and received heartbeats.

You can resolve the problem by resolving network issues. You can decide to temporarily suspend replication at the primary node to avoid building up a backlog of unprocessed transactions. The RQM (replication queue manager) software automatically and repeatedly attempts to reconnect if a connection is dropped or is not responding and continues processing normally after the connection is reestablished.
A node's data is damaged or corrupted. NPS detects and reports this type of a problem. After the NPS node is restored to service, use the nzreplanalyze, nzreplbackup, and nzreplrestore commands to restore the damaged databases.
A replication queue manager host's data is irreparably damaged or corrupted. A number of conditions might indicate this, including the following ones:
  • The machine no longer boots.
  • The operating system reports errors reading from the partition or drive where the RQM is installed.
  • The RQM daemon does not run and reports exceptions or error messages, indicating problems.
You must reinitialize the replication queue manager host or initialize a new one and resynchronize its contents from the other replication queue manager hosts. For details, see Initializing a replica node. If the corresponding NPS node is the current primary, you must demote it to replica and use the nzreplanalyze, nzreplbackup, and nzreplrestore commands to restore the damaged databases.
A table becomes a versioned table when users add columns to or drop columns from the table. Updating rows in a versioned table can cause the replica to suspend. The replica suspends with the following error: Versioned tables do not support DELETE operations that join again to the versioned table. On the replica node, perform the following steps:
  1. Issue the GROOM TABLE command with the VERSIONS option.
  2. Activate the replica and restart replication by issuing the ALTER REPLICATION NODE repslet. <subordinatenodename> STATE ACTIVE command.