Recovering damaged objects
There are ways in which an IBM® MQ object can become unusable, for example because of inadvertent damage. You must then recover either your complete system or some part of it. The action required depends on when the damage is detected, whether the log method selected supports media recovery, and which objects are damaged.
You can record media images for objects so that they can be recovered if damaged. This feature is only available on queue managers that use linear logging or replicated logging and, for linear logging, only for objects that are defined as recoverable. You define that types of object are recoverable by using the IMGRCOVO and IMGRCOVQ queue manager attributes, see ALTER QMGR. If an object that is not defined as recoverable is damaged, then the recovery options are the same as for circular logging.
Media recovery re-creates objects from information recorded in a linear log or replicated log. For example, if an object file is inadvertently deleted, or becomes unusable for some other reason, media recovery can re-create it. The information in the log required for media recovery of an object is called a media image.
A media image is a sequence of log records containing an image of an object from which the object itself can be re-created.
The first log record required to re-create an object is known as its media recovery record; it is the start of the latest media image for the object. The media recovery record of each object is one of the pieces of information recorded during a checkpoint.
When an object is re-created from its media image, it is also necessary to replay any log records describing updates performed on the object since the last image was taken.
Consider, for example, a local queue that has an image of the queue object taken before a persistent message is put onto the queue. In order to re-create the latest image of the object, it is necessary to replay the log entries recording the putting of the message to the queue, in addition to replaying the image itself.
- Images of all process objects and queues that are not local
- Images of empty local queues
Media images can also be recorded manually using the rcdmqimg command, described in rcdmqimg. This command writes a media image of the IBM MQ object.
The queue manager records media images automatically if IMGSCHED(AUTO) is set. For more information, see ALTER QMGR for information on IMGINTVL and INGLOGLN.
When a media image has been written, only the logs that hold the media image, and all the logs created after this time, are required to re-create damaged objects. The benefit of creating media images depends on such factors as the amount of free storage available, and the speed at which log files are created.
Recovering from media images
A queue manager automatically recovers some objects from their media image during startup of the queue manager. It recovers a queue automatically if it was involved in any transaction that was incomplete when the queue manager last shut down, and is found to be corrupted or damaged during the restart processing.
You must recover other objects manually, using the rcrmqobj command, which replays the records in the log to re-create the IBM MQ object. The object is re-created from its latest image found in the log, together with all applicable log events between the time the image was saved and the time the re-create command was issued. If an IBM MQ object becomes damaged, the only valid actions that can be performed are either to delete it or to re-create it by this method. Nonpersistent messages cannot be recovered in this way.
See rcrmqobj for further details of the rcrmqobj command.
The log file containing the media recovery record, and all subsequent log files, must be available in the log file directory when attempting media recovery of an object. If a required file cannot be found, operator message AMQ6767 is issued and the media recovery operation fails. If you do not take regular media images of the objects that you want to re-create, you might have insufficient disk space to hold all the log files required to re-create an object.
Native HA queue managers use replicated logging. Such queue managers attempt automatic recovery of eligible objects when damage is detected. Once started, Native HA queue managers, by default, automatically attempt asynchronous recovery when object damage is detected. Recovery might not immediately be possible if, for example, the object is in use by an application, or the log extents required for media recovery are unavailable. In these situations, the asynchronous recovery processing periodically retries. If the issue that prevented recovery is resolved, the object will be recovered on the next retry, or the object can be recovered manually, using the rcrmqobj command.
What object files exist
The queue manager stores the attributes of objects that are defined in runmqsc in files on disk. These object files are in sub directories under the data directory of the queue manager.
For example, on AIX® and Linux® platforms, channels are stored in /var/mqm/qmgrs/qmgr/channel.
The data in these object files is the media image of the objects. If these object files get deleted or corrupted, the object stored in that file is damaged. Using a linear logging queue manager, damaged objects can be recovered from the log using the rcrmqobj command. Replicated logging (Native HA) queue managers automatically attempt to recover damaged objects when they are detected.
The object catalog catalogs all the objects of all types and is stored in qmanager/QMQMOBJCAT.
The syncfile contains internal state data associated with all channels.
Queue files contain both the messages on that queue as well as the attributes of that queue.
The catalog and the queue manager can be recorded, but not recovered. If these objects get damaged the queue manager ends preemptively and these objects get recovered automatically on restart.
Subscriptions are not listed in objects to record or recover, because durable subscriptions are stored on a system queue. To record or recover durable subscriptions, record or recover the SYSTEM.DURABLE.SUBSCRIBER.QUEUE instead.
Recovering damaged objects during startup
If the queue manager discovers a damaged object during startup, the action it takes depends on the type of object and whether the queue manager is configured to support media recovery.
If the queue manager object is damaged, the queue manager cannot start unless it can recover the object. If the queue manager is configured with a linear log, and thus supports media recovery, IBM MQ automatically tries to re-create the queue manager object from its media images. If the log method selected does not support media recovery, you can either restore a backup of the queue manager or delete the queue manager.
If any transactions were active when the queue manager stopped, the local queues containing the persistent, uncommitted messages put or got inside these transactions are also required to start the queue manager successfully. If any of these local queues is found to be damaged, and the queue manager supports media recovery, it automatically tries to re-create them from their media images. If any of the queues cannot be recovered, IBM MQ cannot start.
If any damaged local queues containing uncommitted messages are discovered during startup processing on a queue manager that does not support media recovery, the queues are marked as damaged objects and the uncommitted messages on them are ignored. This situation is because it is not possible to perform media recovery of damaged objects on such a queue manager and the only action left is to delete them. Message AMQ7472 is issued to report any damage.
Recovering damaged objects at other times
Media recovery of objects is automatic only during startup (other than for Native HA queue managers, which use automatic recovery by default). At other times, when object damage is detected, operator message AMQ7472 is issued and most operations using the object fail with the return code MQRC_OBJECT_DAMAGED. If the queue manager object is damaged at any time after the queue manager has started, the queue manager performs a preemptive shutdown. When an object has been damaged you can delete it or, if the queue manager is using a linear log, attempt to recover it from its media image using the rcrmqobj command (see rcrmqobj for further details).
If a queue (or other object) gets damaged, MEDIALOG will not move forward. This is because MEDIALOG is the oldest extent required for media recovery. If your workload is continuing, CURRLOG will still be moving forward and so new extents will be written. Depending on your configuration (including your LogManagement setting), this might start filling your log filesystem. If the log filesystem fills completely, transactions get rolled back, and the queue manager might end abruptly. So when a queue gets damaged, you might have only a limited amount of time to act before your queue manager ends. How much time you have, depends on the rate at which your workload is causing the queue manager to write new extents,and the amount of free space you have in your log filesystem.
If you are using manual log management, you might be archiving extents not needed for restart recovery, and then deleting them from the log filesystem, even though they are still needed for media recovery. This is acceptable as long as you can restore them from your archive when needed. This policy does not cause your log filesystem to fill when a queue gets damaged and MEDIALOG stops moving forward. However, if you only archive and delete extents that are not needed for either restart or media recovery, your log filesystem starts to fill if a queue gets damaged.
If you are using automatic or archive log management, the queue manager will not reuse extents that are still needed for media recovery, even though you might have archived them and notified the queue manager using SET LOG ARCHIVED. Consequently if a queue gets damaged your log filesystem will start filling.
If a queue gets damaged you will get OBJECT DAMAGED FFDCs written and MEDIALOG stops moving forward. The damaged object can be identified from the FFDC or because it is the object with the oldest MEDIALOG when you display its status in runmqsc.
If your log filesystem is filling, and you are concerned that your workload is getting backed out because the log filesystem is becoming full, then recovering the object, or quiescing your workload might stop this happening.
In the case of Native HA queue managers (which use replicated logging) automatic recovery of damaged objects is attempted. Once started, Native HA queue managers, by default, automatically attempt asynchronous recovery when object damage is detected. Recovery might not immediately be possible if, for example, the object is in use by an application, or the log extents required for media recovery are unavailable. In these situations, the asynchronous recovery processing periodically retries. If the issue that prevented recovery is resolved, the object will be recovered on the next retry, or the object can be recovered manually, using the rcrmqobj command.