IBM Support

Why Does Raft Snapshot Restore Take A Long Time

Troubleshooting


Problem

Consul nodes serving as Vault's storage backend became stuck in an infinite Raft snapshot restore loop, preventing cluster synchronization and causing service degradation.

Symptom

  • Three out of six Consul nodes (consul-3, consul-4, consul-6) unable to synchronize with the Raft leader
  • Nodes repeatedly receiving and restoring 23.4 GB snapshots in a continuous loop
  • Each snapshot restore cycle taking approximately 5.3 minutes
  • CPU utilization remaining low (~10%) despite large machine resources (r6i.8xlarge: 32 vCPUs, 256 GB RAM)
  • Nodes unable to catch up to leader's log entries before triggering another snapshot transfer
  • Observed pattern:
Node receives 23.4 GB snapshot (~54 sec)
→ Node restores snapshot (~5.3 min)
→ Leader writes 18,300-25,900 entries during restore
→ Entries exceed raft_trailing_logs buffer (10,000)
→ Loop repeats

Cause

Root Cause

The incident was triggered by a SIGTERM signal that terminated the Consul process on consul-3 (and likely consul-4 and consul-6) at 06:56:02 UTC. The shutdown was non-graceful (Graceful shutdown disabled. Exiting, Shutdown without a Leave).

Technical Explanation

After restarting, each node spent approximately 5.3 minutes restoring its local 23.4 GB snapshot. During this restoration period, the Raft leader advanced by 18,300-25,900 log entries — well beyond the default raft_trailing_logs buffer of 10,000 entries. This forced the leader to send full snapshots instead of incremental log entries via AppendEntries RPCs, creating an infinite loop.

Contributing Factors

  1. Insufficient raft_trailing_logs buffer: Default 10,000 entries could not accommodate the 18,300-25,900 entries accumulated during snapshot restore
  2. Slow disk I/O: Snapshot restore processed data at approximately 78 MB/s (characteristic of default gp3 EBS volumes at 125 MB/s throughput)
  3. Non-graceful shutdown: leave_on_terminate = false (default for servers) caused abrupt departure without cluster notification
  4. Single-threaded Raft operations: Snapshot restore is a sequential, single-core operation bottlenecked by disk I/O, not CPU

Why Low CPU Utilization is Normal

Consul's Raft consensus protocol is single-threaded by design to maintain strict sequential ordering of log entries — a fundamental correctness constraint. The snapshot restore operation uses exactly one CPU core at 100% capacity. On a 32-core machine, this translates to approximately 3.125% CPU utilization, plus overhead from gossip, logging, and the Go runtime, resulting in the observed ~10% total utilization.

This is correct behavior, not a misconfiguration.

Environment

  • Infrastructure: 6-node Consul cluster on AWS EC2
  • Instance Type: r6i.8xlarge (32 vCPUs, 256 GB RAM)
  • Storage: EBS gp3 volumes (likely default 125 MB/s throughput, 3,000 IOPS)
  • Network: 12.5 Gbps guaranteed bandwidth
  • Consul Version: 1.19.3+ent
  • Use Case: Storage backend for HashiCorp Vault
  • Snapshot Size: 23.4 GB compressed
  • Write Rate: 58-82 Raft entries per second
  • Configuration:
    • raft_trailing_logs: 10,000 (default, insufficient)
    • leave_on_terminate: false (default for servers)

Diagnosing The Problem

Key Metrics to Monitor

1. Raft Replication Lag:

consul.raft.leader.lastContact > 1000ms  → Warning
consul.raft.leader.lastContact > 5000ms  → Critical

2. Snapshot Restore Duration:

consul.raft.snapshot.restore > 3 min   → Warning
consul.raft.snapshot.restore > 5 min   → Critical

3. Disk I/O Performance:

  • Monitor throughput (MB/s) during snapshot operations
  • Expected: 78 MB/s on default gp3 (125 MB/s provisioned)
  • Target: 250+ MB/s for optimal performance

4. Log Entry Accumulation:

  • Calculate: write_rate × restore_duration
  • Compare against raft_trailing_logs setting
  • Formula: raft_trailing_logs ≥ (entries_during_restore × safety_factor)

5. Process Restart Events:

  • Check for SIGTERM signals
  • Verify graceful vs. non-graceful shutdowns
  • Monitor Shutdown without a Leave messages

Log Analysis

From CSV log data, identify the snapshot restore loop pattern:

CycleSnapshot IndexInstalled AtEntries Since PreviousStatus
17,203,055,28507:10:48— (first remote snapshot)Completed
27,203,055,28507:23:140 (same index, re-sent)Completed
37,203,120,15207:29:29+64,867 from cycle 1Completed
47,203,138,47607:37:12+18,324 from cycle 3Completed
57,203,173,951~07:37:31+35,475 from cycle 4In progress

Performance Bottleneck Analysis

PhaseDurationBottleneck
Network transfer of 23.4 GB snapshot~54 secondsNetwork (not an issue)
Snapshot restore to state machine~5.3 minutesDisk I/O
Log replication via AppendEntriesSecondsNeither (very fast)

Resolving The Problem

Immediate Resolution (Applied)

Increase raft_trailing_logs from 10,000 to 50,000 to break the loop:

raft_trailing_logs = 50000

Result: With the larger buffer, log entries accumulated during restore (18,300–25,900) no longer exceeded the buffer, allowing the leader to replicate via fast AppendEntries RPCs instead of triggering another full snapshot.

Recommended Long-Term Solutions

1. Upgrade Disk I/O (Highest Impact, Lowest Cost)

Action: Increase gp3 provisioned throughput to at least 250 MB/s and IOPS to 10,000 on all Consul data volumes.

Implementation:

  • Online volume modification (no downtime required)
  • Cost: ~$5/month per volume (~$30/month for 6 nodes)
  • Pricing: $0.040 per provisioned MB/s above 125 MB/s baseline

Expected Results:

gp3 ThroughputAdditional Cost/moEst. Restore TimeEntries Accumulated
125 MB/s (current)$0~5.3 min18,300–25,900
250 MB/s (recommended)~$5~1.6 min~5,600–7,900
500 MB/s~$15~0.8 min~2,800–3,900
1,000 MB/s (gp3 max)~$35~0.4 min~1,400–2,000

At 250 MB/s, the default raft_trailing_logs = 10,000 would actually be sufficient, but keeping 50,000 provides a 6-9× safety margin.

2. Enable Graceful Shutdown (Critical)

Action: Set leave_on_terminate = true in Consul configuration.

leave_on_terminate = true

Impact: When a Consul node receives SIGTERM, it will gracefully leave the cluster, informing peers of its departure. This prevents the cascading failure scenario that occurred in this incident.

3. Right-Size Instances (Cost Savings)

Current: r6i.8xlarge (32 vCPUs, 256 GB RAM) — massively oversized for Consul's single-threaded Raft operations.

Recommended: Downsize to r6i.4xlarge (16 vCPUs, 128 GB RAM) or r6i.2xlarge (8 vCPUs, 64 GB RAM) if Consul-only.

Considerations:

  • Verify deployment topology (Consul-only vs. co-located with Vault)
  • If Vault is co-located, do not downsize below r6i.4xlarge
  • Smaller instances have burstable network bandwidth — verify baseline supports sustained snapshot transfers
  • Savings: ~$730-$1,095/month per node (~$4,400-$6,600/month for 6 nodes)

Network Bandwidth Impact:

InstanceBaseline BW23.4 GB Transfer TimeImpact on Loop
r6i.8xlarge (current)12.5 Gbps guaranteed~54 secNo issue
r6i.4xlarge~5 Gbps baseline~54 sec (within burst)Minimal
r6i.2xlarge~2.5 Gbps baseline~75 sec at baseline+21 sec per cycle if burst depleted

4. Additional Configuration Tuning

# Already applied — keep this
raft_trailing_logs = 50000

# Enable graceful shutdown (CRITICAL)
leave_on_terminate = true

# Faster failure detection
performance {
  raft_multiplier = 1
}

# Adjust snapshot frequency for high-write workloads
raft_snapshot_threshold = 16384   # Default: 8192
raft_snapshot_interval  = "120s"  # Default: 120s

5. Implement Monitoring and Alerting

Deploy monitoring to detect issues before they escalate:

# Critical alerts
consul.raft.leader.lastContact  > 1000ms  → Warning
consul.raft.leader.lastContact  > 5000ms  → Critical
consul.raft.snapshot.restore    > 3 min   → Warning
consul.raft.snapshot.restore    > 5 min   → Critical
consul.serf.member.failed                 → Critical
Process restart detected                  → Critical

# Capacity planning
consul.raft.apply (rate)        → Track trend over time
consul.raft.snapshot.create     → Track snapshot size growth
Disk I/O utilization            → Track via CloudWatch/iostat

6. Investigate and Prevent SIGTERM Source

Action: Determine what sent the SIGTERM signal to 3 of 6 nodes simultaneously.

Possible causes:

  • Container orchestration platform (Kubernetes, ECS) performing rolling updates
  • Auto-scaling group replacement
  • Manual intervention
  • System maintenance or patching

Prevention: Implement proper shutdown procedures and ensure orchestration platforms respect graceful shutdown timeouts.

7. Reduce Snapshot Size (Long-Term)

The 23.4 GB snapshot is the root cause of performance issues. Investigate:

  1. Data audit: What data is stored in Consul? Is all of it actively needed?
  2. Stale KV entries: Vault may accumulate expired leases or old versions — implement periodic cleanup
  3. Compression: Verify snapshot compression is enabled (should be in 1.19.3+ent)
  4. Storage backend optimization: Review Vault's storage patterns and cleanup policies

Formula for calculating raft_trailing_logs per HashiCorp guidance:

raft_trailing_logs = entries_accumulated_during_snapshot_install × safety_factor

For this environment:

Write rate:                    58-82 entries/sec
Snapshot restore time:         ~316 seconds (5.3 minutes)
Entries during restore:        58 × 316 = 18,328  (low estimate)
                               82 × 316 = 25,912  (high estimate)

With safety factor of 1.2:    25,912 × 1.2 = 31,094
With safety factor of 2.0:    25,912 × 2.0 = 51,824

Current setting: 50,000       → 1.93× safety factor (adequate)

If disk I/O is upgraded (restore time drops to ~90 seconds):

Entries during restore:        82 × 90 = 7,380
With safety factor of 2.0:    7,380 × 2.0 = 14,760
Default 10,000 would be close, but keep 50,000 for margin.

Important: Do not "set and forget." Monitor consul.raft.apply rates and snapshot restore durations over time. If write rates increase significantly, recalculate and adjust.

Priority Action Matrix

ActionPriorityImpactEffortCost Impact
Investigate SIGTERM sourceCriticalPrevents recurrenceMediumNone
Set leave_on_terminate = trueCriticalGraceful shutdownsLowNone
Increase gp3 throughput to 250+ MB/sHigh3× faster restoreTrivial+$30/mo
Downsize to r6i.4xlargeMediumCost savingsMedium-$4,400/mo
Implement monitoring/alertingHighEarly detectionMediumVaries
Tune raft_multiplier = 1LowFaster failure detectionLowNone
Clean up stale snapshot filesLowReclaim disk spaceLowNone

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB77","label":"Automation Platform"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSSJOV","label":"IBM Consul Self-Managed"},"ARM Category":[{"code":"a8mgJ0000000E7yQAE","label":"Consul-\u003EConsul Operations-\u003EOperational Management"}],"ARM Case Number":"TS022021500","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"1.18.0;1.18.11;1.19.0;1.19.9;1.20.0;1.20.7;1.21.0;1.21.5;1.22.0"}]

Document Information

Modified date:
11 May 2026

UID

ibm17270935