Deduplication statistics
Use the Ceph Object Gateway deduplication statistics feature to estimate potential storage savings. This command does not perform deduplication or modify data.
The Ceph Object Gateway deduplication statistics utility helps administrators identify redundant data across Ceph Object Gateway buckets. It uses radosgw-admin subcommands to start and manage estimation tasks. The utility reports duplication estimates but does not modify data.
The utility scans bucket index metadata instead of object data. This design reduces I/O load and enables fast, scalable analysis.
Use this utility to assess whether deduplication can reduce storage consumption in your Ceph Object Gateway environment. It supports capacity planning, storage optimization, and analysis of object storage patterns.
How estimation works
The estimation process runs in parallel across multiple Ceph Object Gateway daemons. It reads each bucket index shard once, which allows the workload to scale efficiently. The process does not read object data or metadata, so it runs independently of storage media performance, whether SSD or HDD.
Skipped objects
The utility excludes these objects from estimation:- Objects smaller than 4 MB, unless Multipart.
- Objects stored with different placement rules, pools, or storage classes.
- Deduplication skips data storage pools that use a replicated rule.
Memory usage
The utility uses a small, predictable amount of memory based on the object count. The following table lists the reference values.| Ceph Object Gateway object count | Approximate memory usage |
|---|---|
| 1 million | 8 MB |
| 4 million | 16 MB |
| 16 million | 32 MB |
| 64 million | 64 MB |
| 256 million | 128 MB |
| 1 billion | 256 MB |
| 4 billion | 512 MB |
| 16 billion | 1024 MB (1 GB) |