IBM Support

Replication May Fail with Transaction Identifiers Greater Than 2.147 Billion

Troubleshooting


Problem

On a "master" PureData for Analytics (PDA) node running Netezza Replication Services, if the "postgres" transaction id counter crosses 2,147,483,647 (2^31 - 1) and then a replicated transaction is attempted, errors that contain text like (for example) "this transaction (<negative number>) was started before replication was activated at transaction <positive number>" will appear.  Replication of new transactions will be unable to proceed when this happens (although subordinates will be able to catch up to existing replicated transactions from the master that had transaction ids smaller than 2^31, barring other errors).

The following procedure can be run to determine whether a customer's replication system is at risk of running into this issue and to take corrective action if the risk is present.  This makes use of the script nzreplpgxid attached to this technical note.  The script is also available under /nz/kit/bin/adm/tools in PDA Release 7.2.1.7-P2 and above.  

Symptom

"ERROR: XactSetTransactionReplicated: this transaction (<negative number>) was started before replication was activated at transaction <positive number>"
If you see the above error, your system has already run into this problem.  Contact IBM Support immediately.  If you have not encountered this error, the diagnosis procedure below determines whether or not your system will susceptible to this problem in the near future.
Please note that many Netezza transactions have both a "postgres" transaction id and a "dbos" transaction id.  The latter typically appear in dbos.log and may legitimately be greater than 2^31 (0x80000000) or 2^32 (0x100000000); they are not related to and do not contribute to this issue.

Environment

IBM PureData for Analytics with Netezza Replication Services

Diagnosing The Problem

To determine if your system is at risk of the above issue, run nzreplpgxid -check or nzreplpgxid -check -sendmail emailaddress at each replication node, master and subordinate(s).  The nzreplpgxid script can be obtained from /nz/kit/bin/adm/tools either in Software Support Tools release 7.2.1.8 or in PDA Release 7.2.1.7-P2. This command will report whether the given node may be subject to this issue in the near future.

  • If the node is not a replication node, the command will exit immediately with the message:
    Not a replication node, nothing to do.

     
  • If the node is not at risk, the command will display the following output (numbers are examples only):
    System has 2140381210 postgres transaction ids remaining before replication may be unable to continue.

    NOTICE: Current postgres transaction id  7102431 is below the warning limit 2000000000.
     
  • If the node is at risk, the command will display the following output:
    System has 381210 postgres transaction ids remaining before replication may be unable to continue.
    Please run or schedule a convenient time to run
    'nzreplpgxid -reset [hostBackupFile]' to reset the postgres transaction id.

     
  • If the node is at risk and the -sendmail option was specified, an email will be sent to emailaddress.  The body of the email will instruct the recipient to run nzreplpgxid -check, which if followed will check the node status again (it's possible that the remediation procedure below may already have been run by someone before the recipient acts on the email) and provide necessary further instructions.

If a given node is reported as at risk, schedule a time for the remediation procedure (below) at that node.  This remediation will involve an NPS stop and start, i.e., a temporary interruption of service.  The urgency of remediation is somewhat dependent on the magnitude of the number of "postgres transaction ids remaining" reported by nzreplpgxid -check and on the rate at which transaction ids (for replicated and non-replicated transactions) are consumed at the node.  The smaller the value of the number of postgres transaction ids remaining (especially on the master node - see note below regarding subordinate nodes), the higher and more immediate the risk and the need for remediation.  A number of less than one million, say, should be considered very critical, considering that the transaction id counter is consumed by the software at a much higher rate than one per SQL transaction, and remediation should not be put off for very long.  A number of 50 million or more, say, suggests that there is more leeway for scheduling the remediation at a convenient time.   If the number is zero or negative, contact IBM Support immediately and do not proceed further.

If nzreplpgxid -check reports a node as not currently at risk, or if the node is reported as at risk but remediation is not planned very soon, nzreplpgxid -check should be run periodically at the given node, e.g., daily or weekly depending on the number of postgres transaction ids remaining.
 

Remediation may be less urgent at an at-risk subordinate node (which does not issue replicated transaction ids) than at a master node, but remember that a subordinate node may become master in the future (if/when a failover is needed, for disaster recovery or any other reason) and issue replicated postgres transaction ids which could then cross the 2^31-1 threshold.  Again, remediation should not be put off for very long.

Resolving The Problem

If a replication node has been identified as being at risk, the remediation procedure should be run at a time when interruption of service can be tolerated.  This must be run at all nodes, master and/or subordinate, that were reported by nzreplpgxid -check as at risk.  Such nodes (if there are more than one) do not all need to run the procedure at the same time. 

The remediation procedure at a node will involve multiple steps:

  1. As user nz, run nzreplpgxid -reset [hostbackupfile].  This will do the following (with appropriate feedback along the way):
    - Do the equivalent of nzreplpgxid -check, and stop if the node is not currently at risk (or is not a replication node).  This allows for the case where the user wants to skip the nzreplpgxid -check step and is willing to proceed with remediation immediately if needed, as well as for the case where remediation was already carried out.
    - Suspend replication on the node.  In the case of a master, no further replicated transactions will be permitted.  In the case of a subordinate, execution of replicated transactions from the master will suspend.
    - Write a Netezza host backup (in "tar" format) to hostbackupfile. If hostbackupfile is not specified, the host backup is written to /tmp/nzreplpgxid.hostbackup.tar.gz.
    - Clear postgres transaction ids from several replication catalog tables (by truncating these tables).
    - Vacuum the system catalog.
    - Output a message to run the remaining steps of the procedure.

     
  2. As root, run /nz/kit/sbin/nzupgrade -r -T catupgrade upgrade DOCATUPGRADE=TRUE.  This will stop NPS, and then the catalog "re-upgrade" will reset the postgres transaction id back to a small value.
     
  3. As user nz, start NPS: nzstart.
     
  4. As user nz, run nzreplpgxid -resume.  This will report an ERROR if earlier steps were not performed or failed to reset the postgres transaction id, otherwise it will activate replication on the node and display the current replication synchronization state (from view _v_replication_sync).

If execution of any step above reports an error, contact IBM Support and do not proceed further.

Document Location

Worldwide

Following zip file contains the required script:

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSULQD","label":"IBM PureData System"},"Component":"","Platform":[{"code":"PF004","label":"Appliance"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
17 October 2019

UID

ibm10870574