In each column, Mission: Messaging discusses topics designed to encourage you to re-examine your thinking about IBM® WebSphere® MQ, its role in your environment, and why you should pay attention to it on a regular basis.
Why the debate?
IBM WebSphere MQ logs enable recovery of persistent messages from various types of failure. When the system is running properly, the logging process is an overhead that reduces the peak messaging capacity of the system in return for increased reliability. Circular logging enables the queue manager to reconcile the status of any outstanding transactions on restart. Linear logging enables recovery from this and more drastic outages such as loss of the queue file.
If that was all there was to it, the obvious choice would always be to use linear logs. After all, making the messages persistent implies that there is some expectation that they are recoverable. But the increased reliability of linear logging does not come free: it is slower and requires regular maintenance. At the end of the day, the selection of which logging mode to use comes down to a compromise between reliability and cost. In this installment of Mission: Messaging you will learn how to evaluate the risks and costs of both so you can make the appropriate decision.
Comparing logging modes
Before proceeding, letâs take a closer look at the different log modes. Protection against application, software, or power failure can be achieved with circular logs. Linear logs provide the same functionality, plus protection against media failure (a damaged queue file). Circular logging requires minimum human intervention because the queue manager automatically cycles through log extents, reusing them as needed. Linear logs are never reused and must be deleted or archived periodically. Circular logs also provide faster throughput. The additional performance cost of linear logs is from creating and formatting new extents and, if the logs are saved rather than deleted, moving the log extent to long term storage.
The following table compares the two options:
Table 1. Circular or linear logs?
|Category||Circular logs||Linear logs|
|Recovery||Circular logs are used to reconcile units of work that were outstanding at the time of failure. No provision to recover from damaged queue files.||Linear logs contain a copy of all persistent messages that are queued. In a normal restart, linear logs perform the same function as circular logs -- recovery of outstanding units of work. In addition, linear logs support recovery of data when queue files are damaged.|
|Performance||Circular logs are allocated once and then reused. Therefore no time is required to allocate and format new log extents or to delete or archive them.||New linear logs must be allocated periodically which degrades performance. In addition, the logs must be deleted or moved to prevent filling the underlying file system. Drive head contention during archive operations reduces performance.|
|Overhead||No administrative overhead is required during normal operations.||Administrators must provide for management of the log files. In addition, the file system must be monitored to prevent the log files from consuming all available space. Human processes touch the administration, operations and support teams.|
|Operational risk||The loss of a queue file results in loss of all messages on that queue. Loss of a disk partition under the queue files results in loss of all messages on that queue manager.||A normally running queue manager will eventually fill all available disk space if log files are not managed regularly. This will result in an outage of the queue manager if allowed to happen.|
This high level comparison should give you an idea of which mode you would like to use, but in order to make a sound decision you will need to understand the costs and the probability of the risk that are involved. To help with that, the next sections will discuss what happens internally.
Queue file operations
There are two sets of files used by the queue manager to store message and transaction data. The queue files are the primary storage area for persistent messages and for non-persistent messages that overflow the in-memory buffer. The queue files are where persistent messages are hardened to survive planned or unplanned restart of the queue manager. There is usually one queue file for each queue. In some cases there are more because dynamic queues leave files behind that the queue manager will reuse.
Recoverability is assured by making sure the message data is on disk before allowing any subsequent operations that might jeopardize it. In general, WebSphere MQ flushes messages to disk prior to returning control to the calling program. Exceptions to this rule are made to optimize performance where transactionality is not affected.
For example, when messages are written inside of a unit of work, the disk write can be cached safely up to the point in time where the transaction is committed. When the commit call is made, all pending writes are flushed to disk before the call completes. Similarly, a persistent message written outside of a transaction to a waiting getter might never be written to the queue file at all.
Because the queue files contain all recoverable messages, the size of the queue file depends on the number of messages queued up at any point in time. As new messages are queued, the queue file size tends to grow. As messages are removed, the queue file size tends to shrink. To optimize disk operations, resizing of queue files does not occur in real time. The shrinking of the queue file always lags behind the removal of messages. As a rule of thumb, the queue file for any queue is always as large as the amount of data stored in persistent messages on that queue which, depending on settings for MAXMSGL (the largest allowable message) and MAXDEPTH (the maximum number of messages that a queue can hold) can be quite large. It can become much larger.
Non-persistent messages are normally not written to the queue file but are held in memory. As messages begin to accumulate, eventually the in-memory buffers fill and non-persistent messages spill over to the queue files. During restart, non-persistent messages in the queue are discarded unless the queueâs NPMCLASS attribute is set to HIGH. In that case, the queue manager will make an effort to restore any non-persistent messages found in the log file.
Circular logs are used to contain the messages while they are inside of a transaction. During restart of the queue manager, the log files are reconciled against the queue files to determine the disposition of these transactional messages. Any messages enqueued under syncpoint are automatically removed, as if a rollback had been issued. Similarly, any messages dequeued under syncpoint are returned to the queue, as if a rollback had occurred, and become available for subsequent processing. The exception to this is when the queue manager acts as an XA resource manager. In this case, messages are still replayed from the logs, but their final disposition is determined by the system acting as the transaction manager.
Circular logs are allocated once and then reused as needed. Because the total log allocation is finite, there is no danger of the logs growing to exceed the allotted file space. Of course, this assumes that the log partition is mounted to a dedicated file system of sufficient size to contain all of the log extents. If the underlying file system is too small or is shared with other queue managers or applications, then it is possible to consume all the file space causing a queue manager outage.
The important thing to remember about circular logs is that the only messages guaranteed to be in them are those written inside a transaction. Without a copy of every persistent message, the circular logs cannot repair a damaged queue file.
Linear logging provides a superset of the functionality of circular logging. Queue manager restart operations using linear logs function the same as with circular logs: the log files are reconciled against the queue files to determine the disposition of transactional messages. In addition to the transactions under syncpoint, linear logs also contain a copy of all persistent messages. If one or more queue files are damaged, the queue can be recovered to the last known good state by replaying the linear logs. This is known as media recovery.
Unlike circular logs which are reused, the number of linear logs increases without limit as messages move through the queue manager. The amount of log data produced in a daily processing cycle is proportional to the amount of data processed as persistent messages during that same period. Each gigabyte of persistent message data processed will generate slightly more than a gigabyte of log files. The file system under the log partition must therefore be sized to hold all of the persistent messages that might pass through the queue manager in a typical processing day.
A newly created log extent is eligible to participate in both transaction recovery and queue recovery. Suppose an application writes a persistent message to the queue under syncpoint, but no application is there to consume it. The log extent first participates in the transaction during the enqueue operation. If the queue manager fails at this point, the message will be rolled back. Next the application issues a commit and the message becomes available on the queue. The log extent still contains the message and a record of the completed transaction. It can be used to recover the message, but it is no longer needed for transaction recovery. Eventually, the message is removed from the queue. At that point, the message is no longer recoverable. When all messages in a log extent are no longer recoverable, that extent becomes inactive and is eligible for archival or deletion.
Log file operations
Now letâs see how all the pieces fit together. The transactions under syncpoint at any given time are tracked using a log head pointer and a log tail pointer. New put or get activity advances the head pointer while commit or rollback calls advance the tail pointer. The maximum distance between the head pointer and the tail pointer is calculated as the size of a single log file multiplied by the number of primary and secondary log extents. This value represents the maximum amount of data that can be held under syncpoint by the queue manager at one time.
In the case of a long-running transaction, it is possible to exhaust all of the primary and secondary log extents. Consider the case of an application that gets a message under syncpoint and then never calls commit. Eventually, all primary and secondary extents will be used and the long-running transaction will prevent the tail pointer from advancing. When this occurs, the oldest outstanding transaction is rolled back. This frees the log tail pointer to advance, making room for the new transaction. The application holding the rolled-back transaction receives an appropriate return code and is free to retry the operation.
It is important to understand that space available for transaction recovery is always bounded by the number of log extents. This is true for linear as well as for circular logged queue managers. Because transaction recovery impacts restart time, it is necessary for WebSphere MQ to enable tuning of the maximum size and number of extents of log files. The queue manager might take a while to restart if there are many gigabytes of messages in the transaction logs. Many have wondered why it is necessary to specify primary and secondary log extents with linear logs, since they can grow indefinitely. The ability to trade off restart times against the amount of simultaneous transaction data is the reason why.
When a queue file is damaged, the administrator of a linear-logged queue manager can issue a command to recover messages in the queue. In this operation, the damaged queue file is deleted, an empty queue file is created, and the log file is parsed from the last known point of consistency. All put, get, and commit operations for that queue are replayed until the queue is restored to the state it was in after its last successful put, get, backout, or commit. The queue is then made available to the queue manager and the applications wishing to access it.
Determining the appropriate logging mode
Circular logging sacrifices the ability to recover persistent messages from a damaged queue file in return for performance and automated log file management. The queue manager can process many more messages per second, but if a queue file is lost, so are all the messages that were on it. Selection of this logging method is most appropriate when the messages are easily recreated or when the applications involved can automatically reconcile their state.
The use of persistent messages is often an indication that neither of these conditions is true and that linear logging might be required. However, there is a cost associated with linear logging, and there are cases when this cost exceeds the financial impact of losing persistent messages. Making that determination requires a fairly accurate estimation of the costs of linear logging and the possible financial impact of losing one or more queues full of messages. If lost messages is the lesser impact, choose circular logging.
Similarly, linear logging should not be selected without a thorough understanding of its costs and risks. A normally operating linear logged queue manager will eventually consume all available disk space if the logs are not managed. If this is permitted to occur, the entire queue manager ceases to function. For some applications, the temporary unavailability of the queue manager is far worse than that of loss of messages. For these applications circular logging may be indicated.
To help make that determination, the next sections discuss the costs and risks in greater detail.
Risks of circular logging
The most common cause of damaged queue files is human error. A few real-life examples include:
- Queue files that were deleted by a system administrator responding to a disk space alarm.
- Automating the regular deletion of a queue file by an administrator intending to clear messages from the queue.
- Attempting to start a primary and secondary queue manager against the same set of files.
- Backing up WebSphere MQ files while the queue manager is running.
- Users opening queue files under edit and acquiring locks on them.
- Changing the group membership of the MQ service account.
Such risks are mitigated through training and practice. When evaluating the likelihood of human error, there is a tendency to underestimate both the impact and the probability of an occurrence. For example, the impact estimate is often based on normal operation of the system, when the queues are empty or nearly so. However, the queues and logs are usually sized to hold a significant amount of data in the event of an outage. It is during such an event -- when the queues are at their high water marks and tensions are running high -- that human error is most likely to occur. When responding to a critical outage, people often take expedient actions to prevent escalation. This is precisely the time when queues tend to get lost due to disk space alarms, improper access, or improper restart of contingency systems.
The likelihood of damaged queues also tends to rise over time as the implementation matures in the organization. Although the systems and requirements tend to be well understood at deployment time, it is only routine operations that are practiced on a daily basis. These routine operations are thus reinforced while the exception procedures fade from memory. Normal staff turnover has a tendency to replace formal training with on-the-job training focused on routine tasks. This further dilutes knowledge of non-routine procedures, causing the probability of human error to tend to rise over time.
Of course, even the best trained staff can make mistakes under completely routine circumstances. Because the system depends on humans for its health and welfare, a certain amount of human error is inevitable. This is especially true where there is little or no excess capacity in the operational teams. The more time the team spends in triage mode and the less time spent on strategic activity, the more likely human error becomes.
A less common cause of damaged queue files is system error. From time to time problems are identified in which there is a possibility of damage to queue files through no fault of the human participants. When these result in changes to the code, they are identified as APARs and published. The most recent APAR that affects logging is IC60063. This APAR describes a situation in which a new circular log extent is formatted and written with disk caching enabled. If the server fails while disk writes remain cached, data might be lost or queue files damaged.
Other APARS describing conditions resulting in damaged queue files include:
- APAR SE28955: During "Record MQ Object Image," queues are compacted if necessary. During the process of compacting the queue, a difference between the memory and the disk image happens as certain flags related to group messages are not propagated correctly. The error is thrown at the Record MQ Object Image request as WebSphere MQ ensures that the bad queue image is not recorded.
- IC51598: Damaged object following log errors caused by a sharing violation on the log itself. A transaction is rolled back but a STOP_ALL (and end queue manager) failure occurs between logging a CLR log record and the end transaction, meaning the end transaction log record is never written. This STOP_ALL also does not result in the queue manager terminating. The transaction then appears as active in a subsequent checkpoint, whereas it should in fact have been rolled back, causing problems during recovery and results in a damaged object.
- IC53204: Damaged objects following queue file resizing. File pointers in use by queue manager threads could be left invalid following reduction of the file. This could result in data in subsequent writes to the file being lost.
Although such APARS are rare and the possibility of triggering the sequence of events to cause the problem is remote, the fact remains that queue files do occasionally become damaged without disk crashes or human error.
Risks of linear logging
Linear logging dramatically reduces the chance of lost data. The possibility is not eliminated altogether because there is always a chance of losing both the queue files and log files. This is why IBM recommends that queue files and log files are mounted on separate dedicated partitions. If the file systems are mounted to separate partitions and IBM recommendations are followed, a very high degree of recoverability can be achieved.
This recoverability comes at a price in terms of system performance and administrative overhead. Linear logging also introduces an additional risk which can potentially cause a complete outage of the queue manager. While the additional risk is well understood, the cost of implementing linear log maintenance correctly is often underestimated.
A normally operating linear logged queue manager will definitely experience an outage unless the logs are actively managed. Depending on message traffic load and disk space allocations, it might take a day or it might take a year, but eventually the logs will grow to consume all available space and the queue manager will crash unless you take steps to prevent it.
To understand the risk, it is necessary to understand the maintenance procedures. IBM provides several SupportPacs that manage linear log files. Typically, these tools use a scripting language to identify inactive log extents and dispose of them. The queue manager provides commands to inquire on which extents are active, as an option for users who wish to provide their own instrumentation for this process. Whatever tooling is used, it is typically automated to run nightly.
When evaluating the cost of linear log maintenance, most assessments consider only the scripting and automation. But like any other critical system process, something needs to make sure the script actually runs. Issues commonly encountered include:
- Due to heavy load, the file system fills before the next log archive interval.
- The log archive job or automation fails silently, allowing the file system to fill.
- Excessive MAXDEPTH and MAXMSGL settings let the queues hold more data than the capacity of the log partition.
Safe implementation of linear logs requires additional instrumentation and regular human oversight of the system. In practice, the most significant risk is failure to recognize this requirement and commit sufficient resources to implement a robust archive process. Linear logging is not unreliable, but implementing the archive process on a shoestring budget can make it appear so.
Below are a couple of use cases describing how linear logging was successfully implemented. These are at opposite ends of the spectrum when it comes to complexity, but both proved very successful.
A complex linear log use case
The first case is a system that I implemented some years ago. It had a rather large footprint but it was robust and supported hundreds of linear logged queue managers. The basis of the system is SupportPac MS62: Linear Log Cleanup Utility, a Perl script that parses the error logs to identify the inactive extents and provides options to archive or delete them. In the implementation described here, the inactive logs were deleted.
The SupportPac was bundled into a wrapper script which performed three functions:
- Take a checkpoint using the rcdmqimg command.
- Execute the log archive script from SupportPac MS62.
- Report any errors.
The combined script was scheduled nightly. If we had stopped there, outages would have been a fairly common occurrence. Initially, that is exactly what happened. Eventually I added negative notifications. The original error reports let us know when something was wrong, but lack of notification could either mean that the system was healthy or it could mean that the archive job failed. The negative notifications were daily e-mails to the team that reported success of all archive jobs. If the notice did not arrive, we knew something was wrong.
The final implementation included many components and touched many processes:
- A Web service received updates from all log archive jobs and raised an alarm if there were problems.
- A report program reconciled all of the updates from the previous night against the database of queue managers. Any failed archive jobs from the previous night were listed. In addition, the report also listed linear logged queue managers where the archive job failed to report in. This provided negative notification.
- The report was e-mailed to all MQ administrators and was also available online using a Web browser. Failure to receive the e-mail provided additional negative notification.
- System monitors reported log partitions that fell low on free space.
- All of the human processes to support the system were documented in the various teams that participated in routine and exception procedures. These included the MQ administrators, platform OS support teams, and the staff of the operations command center.
- The deployment window for new queue managers was extended slightly to include setup and testing of the linear log automation.
- The provisioning process was modified to include message traffic profiling and sizing for log files in addition to the existing message traffic profiling that was in place to size queue files.
- A sandbox environment was provided to test the archive system and to practice recovery exercises.
- A development environment was provided to house a non-production version of the system.
- New, dedicated servers were provided for the Web server front-end and database. One of these was in the production data center and one in the disaster recovery data center.
In addition, organizational commitment was required to maintain two-deep expertise on staff capable of performing maintenance of the log archiving and reporting system. As these were fairly complex, the requirement was not simply having someone on staff who knew Perl and Web services, but specifically two people who were familiar with the system and qualified to modify it to resolve an outage.
When I implemented this system, we had already developed an online tool to administer the queue managers and leveraged much of that infrastructure to build out the log archive automation. This made the system a little more complex than something you might build from scratch, but the real improvement was when we mapped out all the human processes and formalized the touchpoints to properly provision, deploy, and administer the system.
A slightly less complex linear log use case
One of my clients had a different approach that enjoyed the benefit of simplicity and was just as robust as the first case. The basic problem to be solved is that an unmonitored linear logged queue manager will eventually run out of disk space. This shop had modest space requirements, so their approach was to massively over-allocate log file storage. They calculated the amount of storage a dayâs worth of message traffic would consume and then added a generous margin to allow for growth. They then multiplied that by 15 to come up with a two-week buffer.
It was the job of the on-call person to monitor the disk partitions a couple of times a week. If one person neglected that duty, the on-call person the following week would be likely to catch the error. To add one more safeguard, the file system monitors were set at a very low threshold, calculated to signal a problem after two to three days. When I first heard of this system, it had already been in place for several years without any incidents.
Putting it all together
There is a school of thought which holds that circular logging is generally preferred, due to queue manager outages that can occur with linear logging. However, the perceived instability of linear logged systems is not due to any frailty in the systems themselves, but largely due to failure to fund the implementation sufficiently. If the system is robustly instrumented and the human processes accounted for, linear logging can provide exceptional levels of reliability for those applications that require it.
Unfortunately, there have been many linear log implementations that lacked human oversight, redundancy, or negative notification, and which subsequently failed. This has led to the popular wisdom that linear logging introduces more risk than it mitigates. But that need not be the case. I have provided two examples where linear logging was extremely reliable. The first was complex but allowed fine tuning of disk space and early warning of problems. The other was strikingly simple in using relatively cheap disk storage to account for week-long lapses in human oversight. Yours is likely to be somewhere in the middle of these extremes.
To make the decision of linear versus circular logging you will need to calculate the costs and risks for your shop and applications. Which is more costly to your business: message loss or a queue manager outage? What would it cost in your shop to implement a truly robust process for log maintenance? Do you need the throughput that only a circular logged queue manager can provide? When you can answer these questions with confidence, you will be well equipped to make the right logging decision for your business.
- WebSphere MQ System Administration Guide, Recovery and Restart
- APAR IC60063: Potential MQ data integrity issue
- APAR SE28955: Queue becomes damaged at least 2 to 3 times a week
- APAR IC51598: Damaged object following log errors caused by a sharing violation on the log itself
- APAR IC53204: WebSphere MQ V6: Repeated queue damaged instances
- Podcast: The Deep Queue
- Author's Web page: T-Rob.net
- IBMers' Blog on Messaging
- The Vienna WebSphere MQ List server
- developerWorks WebSphere MQ forum