IBM Support

Recovering from lost or damaged FILE volumes in deduplicated storage pool.

Troubleshooting


Problem

Recovering from a lost or damaged FILE volume in a Tivoli Storage Manager deduplicated storage pool.

Environment

All supported V6/V7 Tivoli Storage Manager server environments using deduplicated file storage pools. This document does not apply to 7.1.3.000+ deduplicated directory-container pools.

Resolving The Problem

Removing a deduplicated volume can potentially affect other deduplicated volumes and introduce problems into the deduplication engine.

If a primary file volume associated with a deduplicated file storage pool has been lost from the underlying operating system, or otherwise damaged in some way such that it needs to be removed from the Tivoli Storage Manager inventory, the below procedure should be followed. This procedure ensures that any data that can be recovered from a copy storage pool or a replication target server is completed prior to removing the volume. It also outlines how to identify if deleting the volume has introduced issues into the deduplication engine that will require further assistance from IBM support.

If you are not using copy storage pools or replication to protect your primary deduplicated file storage pool, then contact IBM support to investigate other clean-up options.

The high-level summary of the process is as follows:

Part 1: Recover as much of data on the damaged/lost volume(s) as possible.
Part 2: Remove any remaining irrecoverable data from the damaged/lost volume(s).
Part 3: Recover any data that was referencing irrecoverable data that was removed.
Part 4: Remove any remaining data that cannot be recovered.

Part 1: Recovering data using a copy storage pool (all supported levels) and/or a node replication target (7.1.1.0+ only):



1. For any file volume that no longer exists and is not mountable (DESTROYED), update the missing volume(s) to DESTROYED status (for example, the file volume was unexpectedly removed from the underlying filesystem):

    IMPORTANT: You must identify and update all known missing volumes at this step.

    UPDATE VOLUME <volume name> ACCESS=DESTROYED

2. For any file volume that still exists and is mountable (READONLY), but may contain damaged objects, update the volume(s) to read-only and audit them (for example, the file volume still exists but some objects on that volume cannot be accessed during read operations):

    IMPORTANT: You must identify, update and audit all known damaged volumes at this step.

    UPDATE VOLUME <volume name> ACCESS=READONLY
    AUDIT VOLUME <volume name>

3. Manually initiate client backups, or wait for all normally scheduled clients to run a complete backup cycle, in an attempt to recover any damaged data that may still exist on the client filesystems.

4. Regardless of whether a copy storage pool exists or not, issue the RESTORE STGPOOL command against the storage pool containing the damaged or lost volumes (and wait for the process to end).

    RESTORE STGPOOL <stgpool name> PREVIEW=NO MAXPROCESS=<n>

    This process also initiates a silent background re-linker process that attempts to locate a valid copy of the damaged chunk somewhere else in the pool to relink the data. This has the opportunity to reduce the overall scope of the damage, which is why it is recommended regardless of whether a copy storage pool exists or not.

5. If the data is (or might be) replicated, attempt to use node replication to recover the affected data on the missing or damaged volume(s) by issuing the following command and waiting for the process to end (monitor the process on the source and target servers):

    REPLICATE NODE * RECOVERDAMAGED=ONLY WAIT=YES

6. For any file volume(s) identified in step 2 above (READONLY), attempt to move any existing valid data from those volumes to other new volumes in the same storage pool (and wait for the process to end). Do not move the data to a different storage pool, and do not issue this command for missing volume(s) identified in step 1:

    MOVE DATA <volume name>

7. Issue the following commands to determine if there are any objects or referenced deduplicated base chunks remaining on any of the volumes identified in steps 1 (DESTROYED) or 2 (READONLY) above:

    QUERY CONTENT <volume name> FOLLOWLINKS=NO

    If this command lists objects, you have experienced irrecoverable backup data loss. The list of files returned is a list of irrecoverable objects and their owners (node names). If there are no objects listed or the volume can no longer be found, every object on this volume was recovered successfully using either a copy storage pool or node replication.

    QUERY CONTENT <volume name> FOLLOWLINKS=JUSTLINKS

    If this command lists objects, then objects stored on other volumes need to be recovered due to damaged deduplicated base chunks on this volume. The list of files returned is a list of affected objects on other volumes that will need to be recovered.

    If neither command returned objects, recovery is complete and nothing further is required. If one or more of these QUERY CONTENT commands returned objects, then continue with the remaining steps within this document.

Part 2: Removing data that cannot be recovered from the missing or damaged volumes:


8. For any file volume(s) identified in step 2 (READONLY) above, ensure that all unreadable data remains marked as damaged by initiating an audit (and wait for the process to end):

    AUDIT VOLUME <volume name> FIX=YES

9. For any file volume(s) identified in step 2 (READONLY) above, attempt to move any remaining valid data from those volumes to other volumes in the same storage pool and wait for the process to end (do not move the data to a different storage pool):

    MOVE DATA <volume name>

    IMPORTANT: Ensure that the MOVE DATA processes end with success. If they end with failure, review the activity log to determine why they failed. If the processes ended with failure because of a resource contention issue (ie lock conflict), re-issue the command at a later time. Otherwise, stop and contact IBM support for further review of the failure.

10. For any file volume(s) identified in either step 1 (DESTROYED) or 2 (READONLY) above, remove the volume(s) and their remaining irrecoverable data from the Tivoli Storage Manager inventory using the following command (and wait for the process to end):

    WARNING:
    Before proceeding, determine if any ANR4895E (invalid links) errors have recently occurred (QUERY ACTLOG SEARCH=ANR4895E BEGIND=TODAY-45). If so, there is existing damage in this storage pool that needs to be addressed BEFORE proceeding. To address these errors before continuing, begin by executing the dedupAuditTool.pl script for the ANR4895E symptom as outlined in the following TechNote: Auditing and repairing a deduplicated file storage pool. Once the script results are available, IBM Support should be contacted for further recovery steps. Once recovery of the existing damage has completed, then continue with the remainder of this TechNote.

    IMPORTANT: Be absolutely sure that the previous steps in this TechNote have been followed correctly before deleting the volume. Failure to do so may cause unintended data loss or corruption of the deduplication catalog.

    NOTE: Deleting the volume will remove any of the remaining and irrecoverable objects on that volume. Record the object names and owners (node name) before deleting the volume and attempt to back them up from the owning node again later if possible. This information was previously collected with the first QUERY CONTENT command in step 7 above.

    DELETE VOLUME <volume name> DISCARDD=YES

    This command will create "invalid links" for the objects referencing the data on this volume. If this command returns that the volume no longer exists (ANR2401E), then recovery completed at an earlier step, but you should still continue with the remaining steps in this document.

Part 3: Recovering objects referencing damaged data ("invalid links"):


11. Scan and validate the deduplicated storage pool to determine if deleting the volume invalidated any links to base data:

    VALIDATE EXTENTS <deduplicated stgpool> ACTION=MARKDAMAGED PREVIEW=NO

12. Review the activity log to determine the results of the above step. The results will look similar to the following:

    07/09/2015 09:18:12   ****  VALIDATE EXTENTS CURRENT TOTALS FOR dedup ****
    07/09/2015 09:18:12   Validate Extents: Total invalid          :      0
    07/09/2015 09:18:12   Validate Extents: Total deleted          :      0
    07/09/2015 09:18:12   Validate Extents: Total damaged (in pool):      0
    07/09/2015 09:18:12   Validate Extents: Total damaged (not in pool):  0
    07/09/2015 09:18:12   ANR0985I Process 5 for VALIDATE EXTENTS running in the
                          BACKGROUND completed with completion state SUCCESS at
                          09:18:12 AM.

    If zero objects were returned as invalid or damaged by the VALIDATE EXTENTS command, then recovery and clean-up is complete. If more than 0 objects were returned as either invalid or damaged, then continue with the remaining steps within this document.

13. If a copy storage pool exists, attempt to restore any affected data on the missing or damaged volume(s) by issuing the following command (and wait for the process to end):

    RESTORE STGPOOL <stgpool name> PREVIEW=NO MAXPROCESS=<n>

14. If the data is (or might be) replicated, attempt to use node replication to recover the affected data on the missing or damaged volume(s) by issuing the following command and waiting for the process to end (monitor the process on the source and target servers):

    REPLICATE NODE * RECOVERDAMAGED=ONLY WAIT=YES

15. Repeat steps 11 and 12 to verify that no further issues are reported after the RESTORE STGPOOL or REPLICATE NODE RECOVERDAMAGED recovery attempt. If no further objects are invalid/damaged, then recovery is complete. If problems are still reported, then continue with the below step.

Part 4: Removing any remaining deduplicated data that cannot be recovered:


16. Review the following TechNote to download and start a dedupAuditTool.pl scan using the INVALIDATED_LINKS symptom code: Auditing and repairing a deduplicated file storage pool. Once the script has started, contact IBM support for further assistance in resolving the remaining damage that could not be recovered. The script output, once complete, will be reviewed by the support team and further instructions will be provided to you.

[{"Product":{"code":"SSGSG7","label":"Tivoli Storage Manager"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":"Server","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Supported Versions","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]

Product Synonym

ITSM ADSM TSM Spectrum protect

Document Information

Modified date:
17 June 2018

UID

swg21883611