IBM Support

Error inserting into system_distributed.parent_repair_history

Troubleshooting


Problem

Error inserting into system_distributed.parent_repair_history
 

Sample error:
 
ERROR [Repair-Task:1] 2019-06-21 06:40:44,895 SystemDistributedKeyspace.java:406 - Error executing query INSERT INTO system_distributed.parent_repair_history (parent_id, keyspace_name, columnfamily_names, requested_ranges, started_at, options) VALUES (11111111-0000-0000-0000-888888888888, 'system_auth', { 'roles','role_permissions','role_members' }, { '(1607483561684771030,1656713833712314075]' }, toTimestamp(now()), { 'trace': 'false','forceRepair': 'false','hosts': '','parallelism': 'parallel','dataCenters': '','previewKind': 'NONE','incremental': 'false','pullRepair': 'false','primaryRange': 'false','jobThreads': '1' }) 
 

This error typically produces a stack trace similar to the following (note that the the stack trace will vary based on the DSE version):

 
org.apache.cassandra.exceptions.WriteTimeoutException: Operation timed out - received only 0 responses.
at org.apache.cassandra.service.AbstractWriteHandler$1.lambda$subscribeActual$0(AbstractWriteHandler.java:158)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.cassandra.service.AbstractWriteHandler$TimeoutAction.accept(AbstractWriteHandler.java:221)
at org.apache.cassandra.service.AbstractWriteHandler$TimeoutAction.accept(AbstractWriteHandler.java:216)
at org.apache.cassandra.concurrent.TPCTimeoutTask.run(TPCTimeoutTask.java:43)
at org.apache.cassandra.concurrent.TPCHashedWheelTimer.lambda$onTimeout$0(TPCHashedWheelTimer.java:43)
at org.apache.cassandra.utils.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:498)
at org.apache.cassandra.utils.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:573)
at org.apache.cassandra.utils.HashedWheelTimer$Worker.run(HashedWheelTimer.java:329)
at org.apache.cassandra.concurrent.TPCRunnable.run(TPCRunnable.java:68)
at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.process(EpollTPCEventLoopGroup.java:920)
at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.processTasks(EpollTPCEventLoopGroup.java:892)
at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.runScheduledTasks(EpollTPCEventLoopGroup.java:980)
at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.processEvents(EpollTPCEventLoopGroup.java:774)
at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.run(EpollTPCEventLoopGroup.java:441)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748) 
 

What does this error message mean?

This error is generated by running repair tasks. Repair tasks will keep track of the repair session status in system_distributed.parent_repair_history and system_distributed.repair_history tables. The repair tasks will write the repair session information to these 2 tables with a consistency level of ONE (CL=ONE).

The error indicates an update or insert query against these 2 tables from the repair task failed due to the consistency level (CL) being unable to be met.

 

Why does this error occur?

The error typically occurs due to the following reasons:

    • Overloaded nodes
    • Communication issues among the nodes due to a network issue
     

    When nodes become unresponsive due to load or communication issues, the update or insert queries against these 2 tables will fail as the consistency level (CL) cannot be met.

     

    How do you fix this error?

    When this error occurs, it generally indicates the nodes in the cluster are not responsive. Users can also observe the slowness or failure of user queries.

     

    Overloaded nodes

    Examine the system.log for signs that the nodes in the cluster are overloaded to the point where the error started to occur. This can include dropped messages, long GC pauses, etc.

    For example:

    INFO  [ScheduledTasks:1] 2020-05-23 14:09:20,509  MessagingService.java:1273 - READ messages were dropped in last 5000 ms: 2300 internal and 136 cross node. Mean internal dropped latency: 5430 ms and Mean cross-node dropped latency: 5960 ms
     
    WARN  [Service Thread] 2020-05-23 14:09:15,508  GCInspector.java:282 - G1 Young Generation GC in 5170ms.  G1 Eden Space: 18035507200 -> 0; G1 Old Gen: 12280584520 -> 26468408336; G1 Survivor Space: 1132462080 -> 662700032;
     

    If the nodes in the cluster are overloaded, it is necessary to throttle the workload, check the access patterns (e.g. if running expensive queries) or add resources/nodes to better suit the cluster's needs.

     

    Network issues

    Check the output of nodetool status from all the nodes to see whether any node is in DN status

    e.g.

    --  Address        Load       Tokens       Owns    Host ID                               Rack
    DN  10.100.100.100  4.38 GiB   64           ?       fdfc950d-6381-4c43-9bfc-ec567b06f360  rack1
    
     

    Examine the system.log or debug.log for any gossip issue, for example:

    INFO  [GossipTasks:1] 2020-01-04 03:55:49,320  Gossiper.java:1205 - InetAddress /10.100.100.101 is now DOWN
     
    DEBUG [InternalResponseStage:13] 2020-07-02 05:24:15,203  Gossiper.java:1213 - Failed to receive echo reply from /10.100.100.101
     

    If a network issue occurs, simply run the following tests between the nodes to verify the connectivity:

     

    ping

    ping <ip-address of the down node>
     

    telnet

    telnet <ip-address of the down node> 7000

    OR

    telnet <ip-address of the down node> 7001

    (if the node to node encryption is enabled)

     

    If either of the above commands fail, more investigation at the network layer will be required.


    Last Modified Date:
    December 4, 2023

    Document Location

    Worldwide

    [{"Type":"MASTER","Line of Business":{"code":"LOB76","label":"Data Platform"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSBAAS","label":"DataStax General"},"ARM Category":[{"code":"","label":""}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)"}]

    Historical Number

    ka0Ui0000000H0rIAE

    Document Information

    Modified date:
    30 January 2026

    UID

    ibm17258830