Troubleshooting
Problem
*Article should supplement https://support.datastax.com/s/article/Validation-failed-when-running-a-nodetool-repair
https://blog.pythian.com/so-you-have-a-broken-cassandra-sstable-file/
Overview
This technical note addresses failing repairs and corrupt SSTable exception found in system.log
Symptom
When a background repair occurs or a repair command, nodetool repair is run, the following error is encountered suggesting that the repair has failed:
|
|
Another error suggesting the same failed repair can be:
|
|
Analysis
|
|
There are cases where CorruptSSTableException is not explicitly mentioned in system.log. One example can be,
|
|
The initial error message that a Sync failure had occurred can be misleading, but looking further at the next few error lines might give more clues. From the last snippet, Failed creating a merkle tree for [repair #25682740-9c11-11e8-8e8f-fbc0ff4d2cb8 on keyspace1/standard1, ... suggests that ‘keyspace1’ is broken which is likely to be the main cause.
What is a corrupt SSTable?
SSTables are usually corrupt because they fail an internal consistency check such as a column length is too long and/or checksum validation. A corrupted SSTable file does not mean data is lost or the cluster is unusable so long as the Replication Factor (RF) is set to the recommended three or more.
How does Corrupt SSTable affect performances?
Corrupt SSTables have relatively little effect on normal reads against the table except for the request where the failure took place. However, It has a serious effect on compactions and repair, and may prevent these processes from completing.
Repair failure can result in long-term consistency issues between nodes and eventually the application returning incorrect results. Compaction failure may cause the number of SSTables to grow uncontrollably. In the short term, read performance will be adversely affected and in the long term, storage space problems will surface.
Solution
Once the problem is identified as a corrupt SSTable, there are four different solutions to fix the problem with different results and risks:
- nodetool scrub - online scrub. This means that this option can be done while the node is UP. Relatively lower chance of success. Potentially repeating the repair issue.
- rm -f - offline operation. Remove the SSTable while the node is offline. nodetool repair immediately after node is brought up. Fastest and easiest with some consistency risks when bringing up.
- Bootstrap the node - Similar to #3 with less theoretical impact on consistency.
- nodetool sstablescrub - offline scrub operation. This means the node has to be brought down before running sstablescrub. Much higher chance of success than online scrub. Requires a long time ( >48 hours) to complete for a decently sized SSTable (20MB and above). Offline scrub should be the last resort.
Option 1 - nodetool scrub
https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/tools/nodetool/toolsScrub.html
Perform the following steps to complete an online scrub. Easy to perform, but with a relatively low chance of success:
- Find out which SSTable is broken.
- Run nodetool scrub keyspace tablename.
- Run nodetool repair -full.
- Run nodetool listsnapshots.
- Run nodetool clearsnapshot keyspacename -t snapshot name.
Corrupt SSTable is usually stated in the error message found in system.log. Follow the steps in Analysis to identify which SSTable is broken.
|
|
Nodetool scrub will snapshot and rebuild the files. Some and possibly all rows from corrupt SSTable will disappear which will be repaired by $ nodetool repair. There will be fewer SSTables and file names might have changed.
Perform repair to recover files
|
|
The output should look similar to the following:
|
|
After a successful repair, a new snapshot called pre-scrub-<timestamp> is created. To reduce disk space usage, this snapshot can be removed. Start by listing snapshots:
|
|
One of the snapshots will be:
|
|
Clear the snapshot:
|
|
If repair still fails to complete and the validation error persists, one of the other three methods should be attempted.
Option 2 - Delete the files and run nodetool repair
Removing just one corrupted SSTable might not allow the down node to fully restart. If there are multiple corrupted SSTables, the node will fail to boot up on the first identifiable corrupt SSTable. Hence, the log will only show one corrupt SSTable exception for that one table but not for other corrupt SSTables. All corrupt SSTables must be removed before a node can be fully started.
Users should consider the time tradeoff between manually deleting the corrupted files one by one as it appears versus bootstrapping the entire node. When there are too many corrupted sstables, it is highly recommended to perform Option 3 - Bootstrapping instead.
Warning: When the Consistency Level (CL) is set to ONE, there is an increased risk of read consistency. However, it is uncommon and not recommended for CL to be set to ONE.
Steps:
- Bring the node down with nodetool drain.
- Navigate to the corrupted keyspace and sstable directory. This is usually at /var/lib/cassandra/data/
- Delete the specific corrupt SSTable. If unable to identify the specific ones, delete all files in the directory with sudo rm -f *
- Restart the node.
- Run nodetool repair.
Bring down the node following the steps in Option 2:
|
|
Navigate to the corrupted keyspace and SSTable directory, for example:
|
|
Delete the specific sstable if the specific table can be identified. In this example, the entire files are deleted:
|
|
The output will be:
|
|
When checking the remaining files you will see backups and snapshots:
|
|
Restart the node:
|
|
After a node is fully started, run nodetool repair on the directory immediately:
|
|
Option 3 - Bootstrap the node
Bootstrap should be performed when too many corrupted sstables are present in the node and file deletion process proves to be too cumbersome. Bootstrapping will mean that nodes containing missing data will not be read until all data is restored.
Warning: Bootstrap can operate in parallel, but depending on the amount of data that has to be recovered, this method can take longer than other options.
Steps:
- Bring down the node with $ nodetool drain.
- Remove all files under $CASSANDRA_HOME. This is usually var/lib/Cassandra.
- Modify Apache Cassandra environment in /etc/cassandra/conf/cassandra-env.sh.
- Restart Apache Cassandra. Server starting with no files will stream data from all nodes to one of its seeds to replace the lost data.
- Modify Apache Cassandra environment in /etc/cassandra/conf/cassandra-env.sh file to undo the change introduced in Step 3.
Bring down the node following the steps in Option 2:
|
|
Modify Apache Cassandra environment:
|
|
Add this line at the end of the file:
|
|
192.168.1.88 is the address the Apache Cassandra service is on. Upon restarting, the server will connect to one of the seeds. The server will try to recreate the schema by requesting all nodes to stream data to that seed and replace the lost data.
New token ranges will not be selected unless the service is restarted with a different IP than before. Hence, the edit in the environment file specifies the old address where the repair failure happened at 192.168.1.88.
Restart the cluster:
|
|
Wait for the node to join the cluster. Bootstrap occurs when the message appears:
|
|
The output message will eventually be:
|
|
Option 4 - sstablescrub
Offline sstablescrub has better success rate than its online version, nodetool scrub. Only attempt when other methods fail and are impossible to do. This is especially true for when your RF is one. SSTable scrub will require many hours and most of the time, days to complete for even a medium sized SSTable (20 MB and above).
Warning: Never use the offline scrub method that is sstablescrub for the entire node. Offline scrubbing an entire node will take days because this method has to scrub and rebuild ALL tables in the node. SSTablescrub should always be targeted specific to a keyspace and table.
Offline sstablescrub is a last resort solution.
Steps:
- Bring the node down with $ nodetool drain.
- Run $ sstablescrub
- Restart the node
- Run $ nodetool repair on the table
- Run $ nodetool clearsnapshot to remove pre-scrub snapshot
Run the following commands to bring the node down:
|
|
sstablescrub even with -n option can take days to complete a scrub for 1GiB sstable. Sstable scrub is not realistic for most situations unless SSTable size is very small.
sstablescrub can be run by:
|
|
Once sstablescrub completes, restart the node
|
|
Run the repair command:
|
|
Delete the pre-scrub snapshot
|
|
Flowchart
Presented with a corrupt sstable, the online nodetool scrub should always be attempted first because it is the easiest and safest solution and offline sstablescrub should be attempted last due to the process being extremely slow. Depending on the use cases, we can follow the flowchart below when trying to fix SSTable corruption:
False Corrupt SSTable Exception
Sometimes Corrupt SSTable Exception can occur due to AIO memory issues preventing the SSTable from being deserialized and read properly. In this case the SSTable is not corrupted, it was just not read successfully. You will see Caused by: messages in the stacktrace similar to the following:
...
Caused by: java.io.IOException: Error building row with data deserialized from RandomAccessReader: {rebufferer=Prefetching rebufferer: (8/4) buffers read-ahead, 4096 buffer size buffer=java.nio.DirectByteBuffer[pos=4096 lim=4096 cap=4096] bufferHolder=org.apache.cassandra.io.util.WrappingRebufferer$WrappingBufferHolder@2a6640b9}
at org.apache.cassandra.db.rows.UnfilteredSerializer.deserializeRowBody(UnfilteredSerializer.java:641)
at org.apache.cassandra.db.UnfilteredDeserializer.readNext(UnfilteredDeserializer.java:168)
at org.apache.cassandra.io.sstable.format.AbstractReader.readUnfiltered(AbstractReader.java:257)
at org.apache.cassandra.io.sstable.format.trieindex.ForwardReader.nextInSlice(ForwardReader.java:55)
at org.apache.cassandra.io.sstable.format.AbstractReader.next(AbstractReader.java:138)
at org.apache.cassandra.io.sstable.format.AsyncPartitionReader$PartitionSubscription.performRead(AsyncPartitionReader.java:537)
at org.apache.cassandra.io.sstable.format.AsyncPartitionReader.readWithRetry(AsyncPartitionReader.java:251)
... 31 common frames omitted
Caused by: org.apache.cassandra.io.sstable.BufferPoolException: Failed to allocate address nr. 0 of size 4096: buffer pool is probably exhausted, consider setting file_cache_size_in_mb and inflight_data_overhead_in_mb in the yaml
at org.apache.cassandra.utils.memory.buffers.PermanentBufferPool.allocate(PermanentBufferPool.java:144)
at org.apache.cassandra.cache.ChunkCacheImpl.newChunk(ChunkCacheImpl.java:607)
at org.apache.cassandra.cache.ChunkCacheImpl.asyncLoad(ChunkCacheImpl.java:652)
at org.apache.cassandra.cache.ChunkCacheImpl.asyncLoad(ChunkCacheImpl.java:67)
at com.github.benmanes.caffeine.cache.LocalAsyncLoadingCache.lambda$get$2(LocalAsyncLoadingCache.java:129)
at com.github.benmanes.caffeine.cache.LocalCache.lambda$statsAware
In prior DSE 6.x (such as 6.7) releases, DataStax recommended disabling AIO and setting file_cache_size_in_mb to 512 for search workloads, to improve indexing and query performance.
See the following link for details:
https://docs.datastax.com/en/dse/6.8/docs/search/tune-index.html#EnablingAsynchronousI/O(AIO)
Document Location
Worldwide
Historical Number
ka06R000000Hc3SQAS
Was this topic helpful?
Document Information
Modified date:
30 January 2026
UID
ibm17258927