How to quickly remove tombstones

Troubleshooting

Problem

To skip to the steps, click here.

Summary

This article discusses the steps required to quickly remove unwanted tombstones if they cause issues when using the DB. Possible solutions to rapid tombstone generation are also discussed.

Attention:

There is some risk to this operation. If nodes drop mutations while the cleanup has not yet finished, there is a chance that if any of these mutations were tombstones, we could end up with a zombie data situation. Adjusting gc_grace_seconds so it’s reasonably longer than the time one estimates it takes to process the table can help avoid this.

Applies to

DataStax Enterprise (All Versions)
Apache Cassandra (All Versions)

Symptoms

Queries taking longer than the norm to complete
Frequent Garbage Collection
Messages in the logs such as:

WARN  [CoreThread-0] 2025-05-01 05:10:40,188  NoSpamLogger.java:98 - Scanned over 3671 tombstone rows for query SELECT * FROM acme.order_log WHERE customer_id = 'V2lsZSBFLiBDb3lvdGU=' AND timestamp = 041025 AND order_result = 'Success' LIMIT 5000 - more than the warning threshold 1000

WARN  [CoreThread-2] 2025-05-01 05:11:51,386  NoSpamLogger.java:98 - Scanned over 1262 tombstone rows for query SELECT * FROM lpfcam.fms_order_line_route_attr WHERE customer_id = 751 AND timestamp = 040225 LIMIT 5000 - more than the warning threshold 1000

WARN  [CoreThread-1] 2025-05-01 05:13:23,967  NoSpamLogger.java:98 - Scanned over 15762 tombstone rows for query SELECT * FROM acme.order_log WHERE customer_id = 'RGFmZnkgRHVjaw==' AND timestamp = 041025 LIMIT 5000 - more than the warning threshold 1000

WARN  [CoreThread-2] 2025-05-01 05:13:24,882  NoSpamLogger.java:98 - Scanned over 15762 tombstone rows for query SELECT * FROM acme.order_log WHERE customer_id = 'RGFmZnkgRHVjaw==' AND timestamp = 041025 AND order_result = 'Success' LIMIT 5000 - more than the warning threshold 1000

WARN  [CoreThread-0] 2025-05-01 05:13:34,242  NoSpamLogger.java:98 - Scanned over 3654 tombstone rows for query SELECT * FROM acme.order_log WHERE customer_id = 'V2lsZSBFLiBDb3lvdGU=' AND timestamp = 041025 AND order_result = 'Success' LIMIT 5000 - more than the warning threshold 1000

WARN  [CoreThread-4] 2025-05-01 05:13:35,493  NoSpamLogger.java:98 - Scanned over 3654 tombstone rows for query SELECT * FROM acme.order_log WHERE customer_id = 'V2lsZSBFLiBDb3lvdGU=' AND timestamp = 041025 LIMIT 5000 - more than the warning threshold 1000

WARN  [CoreThread-5] 2025-05-01 05:13:37,730  NoSpamLogger.java:98 - Scanned over 3652 tombstone rows for query SELECT * FROM acme.order_log WHERE customer_id = 'V2lsZSBFLiBDb3lvdGU=' AND timestamp = 041025 LIMIT 5000 - more than the warning threshold 1000

Queries being aborted or timing out and the following message in the logs:

ERROR [CoreThread-2] 2025-05-01 00:00:12,222  NoSpamLogger.java:101 - Scanned over 100001 tombstone rows during query SELECT * FROM acme.order_log WHERE order_result = 'Order Lost' LIMIT 5000 - more than the maximum allowed 100000; query aborted

WARN  [CoreThread-2] 2025-03-01 00:00:12,223  NoSpamLogger.java:98 - Scanned over 100001 tombstone rows during query 'SELECT * FROM acme.order_log WHERE order_result = 'Order Lost' LIMIT 5000' (last scanned row partition key was ('V2lsZSBFLiBDb3lvdGU='); query aborted

Cause

Tombstones are the result of a delete within a table. They’re tags with a deletion timestamp that get added to a partition, or to a column within one. Until the timestamp is older than gc_grace_seconds, the data that was deleted will remain in the DB. This is so that repairs can distribute them to nodes that have missed this update.

For more information, please read this article: https://docs.datastax.com/en/dse/6.8/architecture/database-internals/ar…

Having too many tombstones in a table is not a problem unless they’re being actively read to satisfy a client query. Each one is read into the node’s heap, which can cause slow or frequent GCs to occur. It also causes queries to take longer as more data needs to be read.

Workaround

One would ideally wait for automatic compactions to take care of the tombstones; however, this is not always possible because gc_grace_seconds has a default value of 10 days (in seconds), and the issue needs to be resolved immediately.

To clear up tombstones manually, there are two options: Use nodetool garbagecollect or use nodetool compact. garbagecollect does not compact sstables together and is less thorough than a compaction, but does not impact performance as much. The command needs to be run twice to be effective. compact, on the other hand, does a better job of optimizing the sstables and is more thorough.

The value gc_grace_seconds should be set to will depend on a couple of factors. The aim is to strike a balance between avoiding data resurrection and ensuring that the bulk of the tombstones have been removed from disk:

If the cluster is otherwise healthy, setting it to be slightly longer than max_hint_window_in_ms (3 hours) is enough.
- If a node fails to replay a hint during this time, and the hint is a tombstone, then it will never be replayed as the tombstone has 'expired'. A primary range repair of the node during this hint window will fix the issue. If this cannot be done, the table needs to be rebuild. Truncate the table and run a primary range repair of it.
- https://docs.datastax.com/en/dse/6.8/architecture/database-architecture…
If mutations are guaranteed to not happen, it's possible to set it to 1h and more tombstones may be dropped
Setting it to 0 will work but it is not advised.

It may be necessary to adjust compaction settings to control the rate at which the table is processed. Please review this article for indications:
https://docs.datastax.com/en/dse/6.8/managing/operations/configure-compaction.html

Steps:

The following steps use the Keyspace ‘acme’ and the Table ‘orders’ as placeholders.

Disable OpsCenter repairs. If this poses a problem, exclude this table from subrange repair . If Nodesync is enabled on the table, ignore this step.
Repair the table to make sure all nodes have all necessary tombstones.
1. If Racks = RF, run a full repair of all nodes in a single rack, one at a time
  1. nodetool repair -full -- acme orders
2. Otherwise, run a primary range repair of all nodes in the cluster, one node at a time
  1. nodetool repair -pr -- acme orders
  2. There are commands available to automate the process:
    - This executes repairs in parallel across all DCs, one node after the other:
      - nodetool repair -dcpar -pr -- acme orders
    - In a single-DC cluster, the following will run a repair one node at a time:
      - nodetool repair -seq -pr -- acme orders
Record the original gc_grace_seconds value of the table
- cqlsh> describe table acme.orders;
Alter the gc_grace_seconds value of the table to a value you have chosen
- cqlsh> ALTER TABLE acme.orders WITH gc_grace_seconds=10800;
Alter the table to allow for tombstone checking in sstables that have recently been compacted:
- cqlsh> ALTER TABLE acme.orders WITH unchecked_tombstone_compaction=true;
  - If this returns an error because the schema option does not exist, skip this step.
Run nodetool garbagecollect -g CELL -- keyspace table twice or a compact -s –- keyspace table once, on every node:
- nodetol garbagecollect -g CELL -- acme orders
- =====================================
- nodetool compact -s -- acme orders
- As opposed to a repair, these commands can be run on multiple nodes in parallel. They generate IO latency, so one needs to be careful about how many are running in the cluster at once.
Review your work
- Are there tombstone messages in the logs for the table that has just been cleaned?
  - If running garbagecollect twice did not work, use a compaction instead
  - If the above does not resolve the issue, check if the application is still generating tombstones and at what rate. If the client is generating tombstones faster than they can be cleared from the server, the application most likely has a problem.

Solution

Tombstones are unavoidable. They’re necessary to ensure that the nodes in the cluster have all their data in sync. Controlling the rate at which they are generated or by how much they impact queries to a cluster is avoidable. Here are a couple of common causes for rapid tombstone generation:

Inserting nulls into columns
Frequently adding and deleting data
Deleting a large amount of data in 'one-offs'

The database team would need to talk to the development team to find the best way forward for managing how the application creates tombstones.

One extreme solution would be to set the table to have a lower gc_grace_seconds than default. This means that the table would need to be repaired more frequently, which can cause a heavier burden on the node’s resources for short durations. OpsCenter repairs take the shortest gc_grace_seconds into account when calculating the cluster's repair speed. This can cause repair errors for tables with smaller values, so it’s best to handle repairs for these tables outside of OpsCenter. Nodesync, Reaper, or carefully controlled cron jobs can do the trick here.

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB76","label":"Data Platform"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSCR56","label":"IBM DataStax Enterprise"},"ARM Category":[{"code":"","label":""}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)"}]

Historical Number

ka0Ui0000003W09IAE

Was this topic helpful?

Document Information

Modified date:
30 January 2026

UID

ibm17258484

Tips