Question & Answer
How does IBM Spectrum Protect validate that deduplicated data is correct and unaltered when the data is restored to a client?
IBM Spectrum Protect has many capabilities to validate that deduplicated data is unaltered.
To begin, an understanding of how IBM Spectrum Protect performs deduplication is necessary. This discussion focuses primarily on the current and preferred method of implementing data deduplication, namely, by using directory or cloud-based container storage pools.
General description of the data deduplication scheme
IBM Spectrum Protect uses a software-based data deduplication scheme. A general description of the process is provided:
- Data is analyzed by using a fingerprint algorithm, which identifies the deduplication extents (sometimes referred to as chunks). The fingerprint algorithm is a variable block algorithm. For this reason, the resulting extents can be of varying sizes and are typically in the range 50 KB - 300 KB.
- A given extent is then assigned an identifier by performing a Secure Hash Algorithm 1 (SHA-1) hash of the data in that extent.
- An extent is identified by a number of characteristics. For this discussion, the main characteristics to consider are the size of the extent and its hash value.
- When a file is analyzed and the deduplication extents are identified, an end-to-end MD-5 checksum of the data is also performed. The process is done from the beginning of the file to the end.
- Each file that is deduplicated has an assembly map. That map indicates which extents are required to reassemble or "rehydrate" that file. The map includes a list of extents and the assembly order.
The previously described steps are done whether the data is deduplicated on the client or server. The IBM Spectrum Protect database stores the following information about deduplicated extents: the chunk size and hash, which chunks are used to compose a file, and the end-to-end MD-5 checksum.
Validation of data integrity
Within the described process, the following capabilities are available to validate the integrity of the data that is being returned when data is retrieved. Consider that the retrieval request will vary depending upon the IBM Spectrum Protect capability that is being utilized. For example, backup data is restored, archive data is retrieved, and HSM migrated data is recalled. All three of these services can utilize data deduplication and can read back the data. For purposes of this discussion, we'll generally use the term retrieval, but it could be data read in support of any of the services (backup, archive, or HSM) that were mentioned.
The primary validation occurs when the data is retrieved by a client. An end-to-end MD-5 checksum is generated for the data and it is compared to the original MD-5 checksum that was calculated when the data was originally stored. If the checksums do not match, the IBM Spectrum Protect server reports an error.
In addition to the MD-5 end-to-end validation that is performed when data is retrieved, the following operations can validate the data:
- During server data movement operations (such as running the PROTECT STGPOOL or REPLICATE NODE commands), when data is read, validation is performed on the deduplicated extents. The extents have headers and trailers that are used by IBM Spectrum Protect to manage the data. These headers and trailers are examined and validated when an extent is accessed. If a bit-level error impacts one of these headers or trailers, the error might be detected. In case of a more significant error, such as an overwrite of data, damage to a block or sector on disk, or destruction of a block or sector on disk, the error can also be detected through this validation.
- The server provides a command that is named AUDIT CONTAINER. The behavior of this command differs for directory-container storage pools versus cloud-container storage pools.
For directory-container storage pools, the audit operation reads all the extents from a container and validates the attributes of those extents. This process reanalyzes the extents and calculates the SHA-1 hash and compares it to what was stored in the database. In the event of a mismatch between the stored and recalculated hash values, the extent is marked as damaged and the administrator is alerted. Some IBM Spectrum Protect customers elect to schedule the AUDIT CONTAINER command to be run periodically. They might run the command for the entire pool, or they might do a sampling of the data in the pool.
For cloud-container storage pools, the audit operation requests that an entity tag (ETag) be returned from the cloud. The ETag from the cloud is compared to the ETag that was recorded when the data was stored. If the ETag from the object storage system does not match what the server recorded, the container and all the chunks in it are marked as damaged.
If a deduplicated extent is marked as damaged, IBM Spectrum Protect can take steps to repair that extent, for example, by bringing it back from a replicated copy if the PROTECT STORAGEPOOL or REPLICATE NODE commands are used. The considerations and procedures for that are beyond the scope of this discussion.
- Cloud-container storage pools provide an additional level of validation. You can configure cloud-container storage pools to encrypt the data while at rest. This is an attribute of the storage pool on the IBM Spectrum Protect server. For storage pools configured to encrypt data while at rest, the encryption algorithm detects alteration of the data as well. When the data is retrieved by a client, the decryption of the data detects whether or not the data was altered. If the decryption fails because the data was altered, the extent is marked as damaged and the administrator is alerted.
Data deduplication is an essential technology in today's world. Users want to trust that the data is not only backed up in a space-efficient way, but that the data is usable when retrieved. IBM Spectrum Protect has implemented data deduplication with appropriate checks and balances to validate the data and to ensure that, when data is returned to a client, the data is a byte-for-byte match of what was originally stored. The validation capabilities that IBM Spectrum Protect provides will alert administrators to any detected issues. In addition, administrators have the flexibility to decide what is appropriate for their environment, such as whether to schedule periodic AUDIT CONTAINER operations.
Was this topic helpful?
17 June 2018