Troubleshooting
Problem
Error inserting into system_distributed.parent_repair_history
Sample error:
ERROR [Repair-Task:1] 2019-06-21 06:40:44,895 SystemDistributedKeyspace.java:406 - Error executing query INSERT INTO system_distributed.parent_repair_history (parent_id, keyspace_name, columnfamily_names, requested_ranges, started_at, options) VALUES (11111111-0000-0000-0000-888888888888, 'system_auth', { 'roles','role_permissions','role_members' }, { '(1607483561684771030,1656713833712314075]' }, toTimestamp(now()), { 'trace': 'false','forceRepair': 'false','hosts': '','parallelism': 'parallel','dataCenters': '','previewKind': 'NONE','incremental': 'false','pullRepair': 'false','primaryRange': 'false','jobThreads': '1' })
This error typically produces a stack trace similar to the following (note that the the stack trace will vary based on the DSE version):
org.apache.cassandra.exceptions.WriteTimeoutException: Operation timed out - received only 0 responses. at org.apache.cassandra.service.AbstractWriteHandler$1.lambda$subscribeActual$0(AbstractWriteHandler.java:158) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) at org.apache.cassandra.service.AbstractWriteHandler$TimeoutAction.accept(AbstractWriteHandler.java:221) at org.apache.cassandra.service.AbstractWriteHandler$TimeoutAction.accept(AbstractWriteHandler.java:216) at org.apache.cassandra.concurrent.TPCTimeoutTask.run(TPCTimeoutTask.java:43) at org.apache.cassandra.concurrent.TPCHashedWheelTimer.lambda$onTimeout$0(TPCHashedWheelTimer.java:43) at org.apache.cassandra.utils.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:498) at org.apache.cassandra.utils.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:573) at org.apache.cassandra.utils.HashedWheelTimer$Worker.run(HashedWheelTimer.java:329) at org.apache.cassandra.concurrent.TPCRunnable.run(TPCRunnable.java:68) at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.process(EpollTPCEventLoopGroup.java:920) at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.processTasks(EpollTPCEventLoopGroup.java:892) at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.runScheduledTasks(EpollTPCEventLoopGroup.java:980) at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.processEvents(EpollTPCEventLoopGroup.java:774) at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.run(EpollTPCEventLoopGroup.java:441) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748)
What does this error message mean?
This error is generated by running repair tasks. Repair tasks will keep track of the repair session status in system_distributed.parent_repair_history and system_distributed.repair_history tables. The repair tasks will write the repair session information to these 2 tables with a consistency level of ONE (CL=ONE).
The error indicates an update or insert query against these 2 tables from the repair task failed due to the consistency level (CL) being unable to be met.
Why does this error occur?
The error typically occurs due to the following reasons:
- Overloaded nodes
- Communication issues among the nodes due to a network issue
When nodes become unresponsive due to load or communication issues, the update or insert queries against these 2 tables will fail as the consistency level (CL) cannot be met.
How do you fix this error?
When this error occurs, it generally indicates the nodes in the cluster are not responsive. Users can also observe the slowness or failure of user queries.
Overloaded nodes
Examine the system.log for signs that the nodes in the cluster are overloaded to the point where the error started to occur. This can include dropped messages, long GC pauses, etc.
For example:
INFO [ScheduledTasks:1] 2020-05-23 14:09:20,509 MessagingService.java:1273 - READ messages were dropped in last 5000 ms: 2300 internal and 136 cross node. Mean internal dropped latency: 5430 ms and Mean cross-node dropped latency: 5960 ms
WARN [Service Thread] 2020-05-23 14:09:15,508 GCInspector.java:282 - G1 Young Generation GC in 5170ms. G1 Eden Space: 18035507200 -> 0; G1 Old Gen: 12280584520 -> 26468408336; G1 Survivor Space: 1132462080 -> 662700032;
If the nodes in the cluster are overloaded, it is necessary to throttle the workload, check the access patterns (e.g. if running expensive queries) or add resources/nodes to better suit the cluster's needs.
Network issues
Check the output of nodetool status from all the nodes to see whether any node is in DN status
e.g.
-- Address Load Tokens Owns Host ID Rack DN 10.100.100.100 4.38 GiB 64 ? fdfc950d-6381-4c43-9bfc-ec567b06f360 rack1
Examine the system.log or debug.log for any gossip issue, for example:
INFO [GossipTasks:1] 2020-01-04 03:55:49,320 Gossiper.java:1205 - InetAddress /10.100.100.101 is now DOWN
DEBUG [InternalResponseStage:13] 2020-07-02 05:24:15,203 Gossiper.java:1213 - Failed to receive echo reply from /10.100.100.101
If a network issue occurs, simply run the following tests between the nodes to verify the connectivity:
ping
ping <ip-address of the down node>
telnet
telnet <ip-address of the down node> 7000
OR
telnet <ip-address of the down node> 7001
(if the node to node encryption is enabled)
If either of the above commands fail, more investigation at the network layer will be required.
Last Modified Date: December 4, 2023
Document Location
Worldwide
Historical Number
ka0Ui0000000H0rIAE
Was this topic helpful?
Document Information
Modified date:
30 January 2026
UID
ibm17258830