Why Does Raft Snapshot Restore Take A Long Time

Troubleshooting

Problem

Consul nodes serving as Vault's storage backend became stuck in an infinite Raft snapshot restore loop, preventing cluster synchronization and causing service degradation.

Symptom

Three out of six Consul nodes (consul-3, consul-4, consul-6) unable to synchronize with the Raft leader
Nodes repeatedly receiving and restoring 23.4 GB snapshots in a continuous loop
Each snapshot restore cycle taking approximately 5.3 minutes
CPU utilization remaining low (~10%) despite large machine resources (r6i.8xlarge: 32 vCPUs, 256 GB RAM)
Nodes unable to catch up to leader's log entries before triggering another snapshot transfer
Observed pattern:

Node receives 23.4 GB snapshot (~54 sec)
→ Node restores snapshot (~5.3 min)
→ Leader writes 18,300-25,900 entries during restore
→ Entries exceed raft_trailing_logs buffer (10,000)
→ Loop repeats

Cause

Root Cause

The incident was triggered by a SIGTERM signal that terminated the Consul process on consul-3 (and likely consul-4 and consul-6) at 06:56:02 UTC. The shutdown was non-graceful (Graceful shutdown disabled. Exiting, Shutdown without a Leave).

Technical Explanation

After restarting, each node spent approximately 5.3 minutes restoring its local 23.4 GB snapshot. During this restoration period, the Raft leader advanced by 18,300-25,900 log entries — well beyond the default raft_trailing_logs buffer of 10,000 entries. This forced the leader to send full snapshots instead of incremental log entries via AppendEntries RPCs, creating an infinite loop.

Contributing Factors

Insufficient raft_trailing_logs buffer: Default 10,000 entries could not accommodate the 18,300-25,900 entries accumulated during snapshot restore
Slow disk I/O: Snapshot restore processed data at approximately 78 MB/s (characteristic of default gp3 EBS volumes at 125 MB/s throughput)
Non-graceful shutdown: leave_on_terminate = false (default for servers) caused abrupt departure without cluster notification
Single-threaded Raft operations: Snapshot restore is a sequential, single-core operation bottlenecked by disk I/O, not CPU

Why Low CPU Utilization is Normal

Consul's Raft consensus protocol is single-threaded by design to maintain strict sequential ordering of log entries — a fundamental correctness constraint. The snapshot restore operation uses exactly one CPU core at 100% capacity. On a 32-core machine, this translates to approximately 3.125% CPU utilization, plus overhead from gossip, logging, and the Go runtime, resulting in the observed ~10% total utilization.

This is correct behavior, not a misconfiguration.

Environment

Infrastructure: 6-node Consul cluster on AWS EC2
Instance Type: r6i.8xlarge (32 vCPUs, 256 GB RAM)
Storage: EBS gp3 volumes (likely default 125 MB/s throughput, 3,000 IOPS)
Network: 12.5 Gbps guaranteed bandwidth
Consul Version: 1.19.3+ent
Use Case: Storage backend for HashiCorp Vault
Snapshot Size: 23.4 GB compressed
Write Rate: 58-82 Raft entries per second
Configuration:
- raft_trailing_logs: 10,000 (default, insufficient)
- leave_on_terminate: false (default for servers)

Diagnosing The Problem

Key Metrics to Monitor

1. Raft Replication Lag:

consul.raft.leader.lastContact > 1000ms  → Warning
consul.raft.leader.lastContact > 5000ms  → Critical

2. Snapshot Restore Duration:

consul.raft.snapshot.restore > 3 min   → Warning
consul.raft.snapshot.restore > 5 min   → Critical

3. Disk I/O Performance:

Monitor throughput (MB/s) during snapshot operations
Expected: 78 MB/s on default gp3 (125 MB/s provisioned)
Target: 250+ MB/s for optimal performance

4. Log Entry Accumulation:

Calculate: write_rate × restore_duration
Compare against raft_trailing_logs setting
Formula: raft_trailing_logs ≥ (entries_during_restore × safety_factor)

5. Process Restart Events:

Check for SIGTERM signals
Verify graceful vs. non-graceful shutdowns
Monitor Shutdown without a Leave messages

Log Analysis

From CSV log data, identify the snapshot restore loop pattern:

Cycle	Snapshot Index	Installed At	Entries Since Previous	Status
1	7,203,055,285	07:10:48	— (first remote snapshot)	Completed
2	7,203,055,285	07:23:14	0 (same index, re-sent)	Completed
3	7,203,120,152	07:29:29	+64,867 from cycle 1	Completed
4	7,203,138,476	07:37:12	+18,324 from cycle 3	Completed
5	7,203,173,951	~07:37:31	+35,475 from cycle 4	In progress

Performance Bottleneck Analysis

Phase	Duration	Bottleneck
Network transfer of 23.4 GB snapshot	~54 seconds	Network (not an issue)
Snapshot restore to state machine	~5.3 minutes	Disk I/O
Log replication via AppendEntries	Seconds	Neither (very fast)

Resolving The Problem

Immediate Resolution (Applied)

Increase raft_trailing_logs from 10,000 to 50,000 to break the loop:

raft_trailing_logs = 50000

Result: With the larger buffer, log entries accumulated during restore (18,300–25,900) no longer exceeded the buffer, allowing the leader to replicate via fast AppendEntries RPCs instead of triggering another full snapshot.

Recommended Long-Term Solutions

1. Upgrade Disk I/O (Highest Impact, Lowest Cost)

Action: Increase gp3 provisioned throughput to at least 250 MB/s and IOPS to 10,000 on all Consul data volumes.

Implementation:

Online volume modification (no downtime required)
Cost: ~$5/month per volume (~$30/month for 6 nodes)
Pricing: $0.040 per provisioned MB/s above 125 MB/s baseline

Expected Results:

gp3 Throughput	Additional Cost/mo	Est. Restore Time	Entries Accumulated
125 MB/s (current)	$0	~5.3 min	18,300–25,900
250 MB/s (recommended)	~$5	~1.6 min	~5,600–7,900
500 MB/s	~$15	~0.8 min	~2,800–3,900
1,000 MB/s (gp3 max)	~$35	~0.4 min	~1,400–2,000

At 250 MB/s, the default raft_trailing_logs = 10,000 would actually be sufficient, but keeping 50,000 provides a 6-9× safety margin.

2. Enable Graceful Shutdown (Critical)

Action: Set leave_on_terminate = true in Consul configuration.

leave_on_terminate = true

Impact: When a Consul node receives SIGTERM, it will gracefully leave the cluster, informing peers of its departure. This prevents the cascading failure scenario that occurred in this incident.

3. Right-Size Instances (Cost Savings)

Current: r6i.8xlarge (32 vCPUs, 256 GB RAM) — massively oversized for Consul's single-threaded Raft operations.

Recommended: Downsize to r6i.4xlarge (16 vCPUs, 128 GB RAM) or r6i.2xlarge (8 vCPUs, 64 GB RAM) if Consul-only.

Considerations:

Verify deployment topology (Consul-only vs. co-located with Vault)
If Vault is co-located, do not downsize below r6i.4xlarge
Smaller instances have burstable network bandwidth — verify baseline supports sustained snapshot transfers
Savings: ~$730-$1,095/month per node (~$4,400-$6,600/month for 6 nodes)

Network Bandwidth Impact:

Instance	Baseline BW	23.4 GB Transfer Time	Impact on Loop
r6i.8xlarge (current)	12.5 Gbps guaranteed	~54 sec	No issue
r6i.4xlarge	~5 Gbps baseline	~54 sec (within burst)	Minimal
r6i.2xlarge	~2.5 Gbps baseline	~75 sec at baseline	+21 sec per cycle if burst depleted

4. Additional Configuration Tuning

# Already applied — keep this
raft_trailing_logs = 50000

# Enable graceful shutdown (CRITICAL)
leave_on_terminate = true

# Faster failure detection
performance {
  raft_multiplier = 1
}

# Adjust snapshot frequency for high-write workloads
raft_snapshot_threshold = 16384   # Default: 8192
raft_snapshot_interval  = "120s"  # Default: 120s

5. Implement Monitoring and Alerting

Deploy monitoring to detect issues before they escalate:

# Critical alerts
consul.raft.leader.lastContact  > 1000ms  → Warning
consul.raft.leader.lastContact  > 5000ms  → Critical
consul.raft.snapshot.restore    > 3 min   → Warning
consul.raft.snapshot.restore    > 5 min   → Critical
consul.serf.member.failed                 → Critical
Process restart detected                  → Critical

# Capacity planning
consul.raft.apply (rate)        → Track trend over time
consul.raft.snapshot.create     → Track snapshot size growth
Disk I/O utilization            → Track via CloudWatch/iostat

6. Investigate and Prevent SIGTERM Source

Action: Determine what sent the SIGTERM signal to 3 of 6 nodes simultaneously.

Possible causes:

Container orchestration platform (Kubernetes, ECS) performing rolling updates
Auto-scaling group replacement
Manual intervention
System maintenance or patching

Prevention: Implement proper shutdown procedures and ensure orchestration platforms respect graceful shutdown timeouts.

7. Reduce Snapshot Size (Long-Term)

The 23.4 GB snapshot is the root cause of performance issues. Investigate:

Data audit: What data is stored in Consul? Is all of it actively needed?
Stale KV entries: Vault may accumulate expired leases or old versions — implement periodic cleanup
Compression: Verify snapshot compression is enabled (should be in 1.19.3+ent)
Storage backend optimization: Review Vault's storage patterns and cleanup policies

Formula for calculating raft_trailing_logs per HashiCorp guidance:

raft_trailing_logs = entries_accumulated_during_snapshot_install × safety_factor

For this environment:

Write rate:                    58-82 entries/sec
Snapshot restore time:         ~316 seconds (5.3 minutes)
Entries during restore:        58 × 316 = 18,328  (low estimate)
                               82 × 316 = 25,912  (high estimate)

With safety factor of 1.2:    25,912 × 1.2 = 31,094
With safety factor of 2.0:    25,912 × 2.0 = 51,824

Current setting: 50,000       → 1.93× safety factor (adequate)

If disk I/O is upgraded (restore time drops to ~90 seconds):

Entries during restore:        82 × 90 = 7,380
With safety factor of 2.0:    7,380 × 2.0 = 14,760
Default 10,000 would be close, but keep 50,000 for margin.

Important: Do not "set and forget." Monitor consul.raft.apply rates and snapshot restore durations over time. If write rates increase significantly, recalculate and adjust.

Priority Action Matrix

Action	Priority	Impact	Effort	Cost Impact
Investigate SIGTERM source	Critical	Prevents recurrence	Medium	None
Set `leave_on_terminate = true`	Critical	Graceful shutdowns	Low	None
Increase gp3 throughput to 250+ MB/s	High	3× faster restore	Trivial	+$30/mo
Downsize to r6i.4xlarge	Medium	Cost savings	Medium	-$4,400/mo
Implement monitoring/alerting	High	Early detection	Medium	Varies
Tune `raft_multiplier = 1`	Low	Faster failure detection	Low	None
Clean up stale snapshot files	Low	Reclaim disk space	Low	None

Related Information

Agent Telemetry — Key Metrics

Autopilot Monitoring

AWS EBS Volume Types

Consul Grafana Dashboard

Leadership Changes Monitoring

leave_on_terminate Configuration

Memory Requirements

Metrics API

Raft Performance Tuning Defaults

Raft Replication Capacity Issues

raft_snapshot_interval Configuration

raft_snapshot_threshold Configuration

Raft Thread Saturation

Server Hardware Requirements

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB77","label":"Automation Platform"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSSJOV","label":"IBM Consul Self-Managed"},"ARM Category":[{"code":"a8mgJ0000000E7yQAE","label":"Consul-\u003EConsul Operations-\u003EOperational Management"}],"ARM Case Number":"TS022021500","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"1.18.0;1.18.11;1.19.0;1.19.9;1.20.0;1.20.7;1.21.0;1.21.5;1.22.0"}]

Tips