Troubleshooting
Problem
To skip to the steps, click here.
Summary
This article discusses the steps required to quickly remove unwanted tombstones if they cause issues when using the DB. Possible solutions to rapid tombstone generation are also discussed.
Attention:
There is some risk to this operation. If nodes drop mutations while the cleanup has not yet finished, there is a chance that if any of these mutations were tombstones, we could end up with a zombie data situation. Adjusting gc_grace_seconds so it’s reasonably longer than the time one estimates it takes to process the table can help avoid this.
Applies to
- DataStax Enterprise (All Versions)
- Apache Cassandra (All Versions)
Symptoms
- Queries taking longer than the norm to complete
- Frequent Garbage Collection
- Messages in the logs such as:
WARN [CoreThread-0] 2025-05-01 05:10:40,188 NoSpamLogger.java:98 - Scanned over 3671 tombstone rows for query SELECT * FROM acme.order_log WHERE customer_id = 'V2lsZSBFLiBDb3lvdGU=' AND timestamp = 041025 AND order_result = 'Success' LIMIT 5000 - more than the warning threshold 1000 WARN [CoreThread-2] 2025-05-01 05:11:51,386 NoSpamLogger.java:98 - Scanned over 1262 tombstone rows for query SELECT * FROM lpfcam.fms_order_line_route_attr WHERE customer_id = 751 AND timestamp = 040225 LIMIT 5000 - more than the warning threshold 1000 WARN [CoreThread-1] 2025-05-01 05:13:23,967 NoSpamLogger.java:98 - Scanned over 15762 tombstone rows for query SELECT * FROM acme.order_log WHERE customer_id = 'RGFmZnkgRHVjaw==' AND timestamp = 041025 LIMIT 5000 - more than the warning threshold 1000 WARN [CoreThread-2] 2025-05-01 05:13:24,882 NoSpamLogger.java:98 - Scanned over 15762 tombstone rows for query SELECT * FROM acme.order_log WHERE customer_id = 'RGFmZnkgRHVjaw==' AND timestamp = 041025 AND order_result = 'Success' LIMIT 5000 - more than the warning threshold 1000 WARN [CoreThread-0] 2025-05-01 05:13:34,242 NoSpamLogger.java:98 - Scanned over 3654 tombstone rows for query SELECT * FROM acme.order_log WHERE customer_id = 'V2lsZSBFLiBDb3lvdGU=' AND timestamp = 041025 AND order_result = 'Success' LIMIT 5000 - more than the warning threshold 1000 WARN [CoreThread-4] 2025-05-01 05:13:35,493 NoSpamLogger.java:98 - Scanned over 3654 tombstone rows for query SELECT * FROM acme.order_log WHERE customer_id = 'V2lsZSBFLiBDb3lvdGU=' AND timestamp = 041025 LIMIT 5000 - more than the warning threshold 1000 WARN [CoreThread-5] 2025-05-01 05:13:37,730 NoSpamLogger.java:98 - Scanned over 3652 tombstone rows for query SELECT * FROM acme.order_log WHERE customer_id = 'V2lsZSBFLiBDb3lvdGU=' AND timestamp = 041025 LIMIT 5000 - more than the warning threshold 1000
- Queries being aborted or timing out and the following message in the logs:
ERROR [CoreThread-2] 2025-05-01 00:00:12,222 NoSpamLogger.java:101 - Scanned over 100001 tombstone rows during query SELECT * FROM acme.order_log WHERE order_result = 'Order Lost' LIMIT 5000 - more than the maximum allowed 100000; query aborted
WARN [CoreThread-2] 2025-03-01 00:00:12,223 NoSpamLogger.java:98 - Scanned over 100001 tombstone rows during query 'SELECT * FROM acme.order_log WHERE order_result = 'Order Lost' LIMIT 5000' (last scanned row partition key was ('V2lsZSBFLiBDb3lvdGU='); query aborted
Cause
Tombstones are the result of a delete within a table. They’re tags with a deletion timestamp that get added to a partition, or to a column within one. Until the timestamp is older than gc_grace_seconds, the data that was deleted will remain in the DB. This is so that repairs can distribute them to nodes that have missed this update.
For more information, please read this article: https://docs.datastax.com/en/dse/6.8/architecture/database-internals/ar…
Having too many tombstones in a table is not a problem unless they’re being actively read to satisfy a client query. Each one is read into the node’s heap, which can cause slow or frequent GCs to occur. It also causes queries to take longer as more data needs to be read.
Workaround
One would ideally wait for automatic compactions to take care of the tombstones; however, this is not always possible because gc_grace_seconds has a default value of 10 days (in seconds), and the issue needs to be resolved immediately.
To clear up tombstones manually, there are two options: Use nodetool garbagecollect or use nodetool compact. garbagecollect does not compact sstables together and is less thorough than a compaction, but does not impact performance as much. The command needs to be run twice to be effective. compact, on the other hand, does a better job of optimizing the sstables and is more thorough.
The value gc_grace_seconds should be set to will depend on a couple of factors. The aim is to strike a balance between avoiding data resurrection and ensuring that the bulk of the tombstones have been removed from disk:
- If the cluster is otherwise healthy, setting it to be slightly longer than max_hint_window_in_ms (3 hours) is enough.
- If a node fails to replay a hint during this time, and the hint is a tombstone, then it will never be replayed as the tombstone has 'expired'. A primary range repair of the node during this hint window will fix the issue. If this cannot be done, the table needs to be rebuild. Truncate the table and run a primary range repair of it.
- https://docs.datastax.com/en/dse/6.8/architecture/database-architecture…
- If mutations are guaranteed to not happen, it's possible to set it to 1h and more tombstones may be dropped
- Setting it to 0 will work but it is not advised.
It may be necessary to adjust compaction settings to control the rate at which the table is processed. Please review this article for indications:
https://docs.datastax.com/en/dse/6.8/managing/operations/configure-compaction.html
Steps:
The following steps use the Keyspace ‘acme’ and the Table ‘orders’ as placeholders.
- Disable OpsCenter repairs. If this poses a problem, exclude this table from subrange repair . If Nodesync is enabled on the table, ignore this step.
- Repair the table to make sure all nodes have all necessary tombstones.
- If Racks = RF, run a full repair of all nodes in a single rack, one at a time
- nodetool repair -full -- acme orders
- Otherwise, run a primary range repair of all nodes in the cluster, one node at a time
- nodetool repair -pr -- acme orders
- There are commands available to automate the process:
- This executes repairs in parallel across all DCs, one node after the other:
- nodetool repair -dcpar -pr -- acme orders
- In a single-DC cluster, the following will run a repair one node at a time:
- nodetool repair -seq -pr -- acme orders
- This executes repairs in parallel across all DCs, one node after the other:
- If Racks = RF, run a full repair of all nodes in a single rack, one at a time
- Record the original gc_grace_seconds value of the table
- cqlsh> describe table acme.orders;
- Alter the gc_grace_seconds value of the table to a value you have chosen
- cqlsh> ALTER TABLE acme.orders WITH gc_grace_seconds=10800;
- Alter the table to allow for tombstone checking in sstables that have recently been compacted:
- cqlsh> ALTER TABLE acme.orders WITH unchecked_tombstone_compaction=true;
- If this returns an error because the schema option does not exist, skip this step.
- cqlsh> ALTER TABLE acme.orders WITH unchecked_tombstone_compaction=true;
- Run nodetool garbagecollect -g CELL -- keyspace table twice or a compact -s –- keyspace table once, on every node:
- nodetol garbagecollect -g CELL -- acme orders
- =====================================
- nodetool compact -s -- acme orders
- As opposed to a repair, these commands can be run on multiple nodes in parallel. They generate IO latency, so one needs to be careful about how many are running in the cluster at once.
- Review your work
- Are there tombstone messages in the logs for the table that has just been cleaned?
- If running garbagecollect twice did not work, use a compaction instead
- If the above does not resolve the issue, check if the application is still generating tombstones and at what rate. If the client is generating tombstones faster than they can be cleared from the server, the application most likely has a problem.
- Are there tombstone messages in the logs for the table that has just been cleaned?
Solution
Tombstones are unavoidable. They’re necessary to ensure that the nodes in the cluster have all their data in sync. Controlling the rate at which they are generated or by how much they impact queries to a cluster is avoidable. Here are a couple of common causes for rapid tombstone generation:
- Inserting nulls into columns
- Frequently adding and deleting data
- Deleting a large amount of data in 'one-offs'
The database team would need to talk to the development team to find the best way forward for managing how the application creates tombstones.
One extreme solution would be to set the table to have a lower gc_grace_seconds than default. This means that the table would need to be repaired more frequently, which can cause a heavier burden on the node’s resources for short durations. OpsCenter repairs take the shortest gc_grace_seconds into account when calculating the cluster's repair speed. This can cause repair errors for tables with smaller values, so it’s best to handle repairs for these tables outside of OpsCenter. Nodesync, Reaper, or carefully controlled cron jobs can do the trick here.
Document Location
Worldwide
Historical Number
ka0Ui0000003W09IAE
Was this topic helpful?
Document Information
Modified date:
30 January 2026
UID
ibm17258484