Scanning agent
All Slicestor nodes scan for missing Slices and limit the number of listing requests processed at one time to throttle the scanning operations.
Due to the cooperative nature of scanning, Slicestor appliances do not list areas that other Slicestor appliances are scanning actively. A full scan of the IBM Cloud Object Storage Systemâ„¢ should take 48 hours. This means that any issue with a Slice is detected within 48 hours.
To maintain a high level of availability, the system should allow read and write operations to continue despite one or more Slicestor appliances being unreachable. Due to the distributed nature of the system and the massive amount of data involved, there is no centralized index that can inform a Rebuild agent as to which slices were written during an outage.
To ensure that drive failures and further outages do not render data unreadable, the Scanning Agent requests information about what slice names currently reside on each Slicestor appliance. To allow this process to continue even when certain Slicestor appliances are unable to scan, Scanning Agents coordinate in a distributed fashion. When Slicestor appliances receive requests about certain slice names, they record that these names have been scanned. They use this information to scan any Slice names that have yet to be scanned. This ensures that missing Slices are found quickly and Slicestor appliances do not duplicate work. When the Slicestor appliances find that together they have checked every Slice stored on the system, the process restarts.
- If all Slices are present, nothing needs to happen.
- If a Slice is missing, reconstruct that Slice from the remaining Slices and upload the reconstructed Slice to the Slicestor appliance from which it is missing.
- If Slices cannot be reconstructed (if a client performed a delete operation during an outage); deprecate the existing Slices from the system to ensure that no storage leaks occur.