ClickHouse DB::Exception: Raft consensus error

The "DB::Exception: Raft consensus error" in ClickHouse signals a failure in the Raft consensus protocol used by ClickHouse Keeper (the built-in alternative to ZooKeeper). The error code is RAFT_ERROR. This means the Keeper cluster was unable to achieve consensus on an operation, which directly affects any functionality that depends on distributed coordination -- replication, distributed DDL, and leader election among them.

Impact

Raft consensus failures can have wide-reaching effects across the entire ClickHouse cluster:

Replicated tables may stop synchronizing data between replicas
INSERT operations to replicated tables can fail or hang waiting for quorum
Distributed DDL queries (ALTER, CREATE on Replicated databases) will not propagate
Leader election for ReplicatedMergeTree tables may stall, halting merges on affected tables
If the Keeper ensemble loses quorum entirely, the cluster effectively becomes read-only for replicated tables

Common Causes

Loss of quorum -- More than half of the Keeper nodes are down or unreachable, making it impossible to commit any new log entries.
Network partitions -- Connectivity issues between Keeper nodes prevent them from exchanging heartbeats and log replication messages.
Disk I/O bottlenecks on Keeper nodes -- The Raft log and snapshots are persisted to disk. Slow storage can cause nodes to fall behind and trigger timeouts.
Clock skew between Keeper nodes -- Significant time differences can confuse timeout calculations and leader election.
Keeper node overload -- Excessive request volume or large transactions (e.g., bulk table creation) can overwhelm a Keeper node, causing it to miss heartbeats.
Misconfigured Keeper ensemble -- Incorrect server_id values, wrong peer addresses, or mismatched Raft settings across nodes.
Corrupted Raft log or snapshots -- Disk corruption or abrupt shutdowns can leave the Raft state in an inconsistent state on one or more nodes.

Troubleshooting and Resolution Steps

Check Keeper ensemble health:

# Using the four-letter command interface
echo ruok | nc keeper_host 9181
# Expected response: imok

echo mntr | nc keeper_host 9181

Repeat for each Keeper node. If fewer than a majority respond, quorum is lost.

Verify Keeper cluster status from ClickHouse:
```
SELECT * FROM system.zookeeper WHERE path = '/';
```
If this query times out or errors, the Keeper connection is broken.
Inspect Keeper logs for Raft-specific errors:
```
grep -iE "raft|leader|follower|election|quorum|snapshot" /var/log/clickhouse-keeper/clickhouse-keeper.log | tail -50
```
Look for messages about failed elections, log replication gaps, or snapshot application errors.
Check network connectivity between Keeper peers:
```
# Test the Raft internal port (default 9234)
nc -zv keeper_peer_host 9234
```
Ensure all Keeper nodes can reach each other on the Raft port.
Review disk performance on Keeper nodes:
```
iostat -x 1 5
```
High await or %util values on the disk hosting the Keeper data directory indicate I/O bottlenecks. Move Keeper data to faster storage (SSD or NVMe recommended).
Restart a lagging Keeper node if it has fallen too far behind:
```
systemctl restart clickhouse-keeper
```
The node will catch up by replaying the Raft log from peers or loading the latest snapshot.
If the Raft log is corrupted on a single node, remove its data directory and let it rejoin as a fresh follower:
```
systemctl stop clickhouse-keeper
rm -rf /var/lib/clickhouse-keeper/raft_log /var/lib/clickhouse-keeper/raft_snapshot
systemctl start clickhouse-keeper
```
The node will receive a snapshot from the current leader and resume normal operation.

Best Practices

Deploy Keeper in an odd-numbered ensemble (3 or 5 nodes) to maximize fault tolerance. Three nodes tolerate one failure; five tolerate two.
Place Keeper nodes on separate failure domains (different racks, availability zones) to reduce the risk of simultaneous failures.
Use fast, low-latency storage (NVMe or local SSD) for the Keeper data directory. Network-attached storage with high latency is a common source of Raft timeouts.
Monitor Keeper metrics (zk_followers, zk_synced_followers, zk_last_proposal_size) and alert when followers fall out of sync.
Keep clock synchronization tight across all nodes using NTP or a similar service.
Avoid overloading Keeper with too many znodes -- tens of millions of nodes can strain the ensemble.

Frequently Asked Questions

Q: How many Keeper nodes can fail before I lose quorum?
A: With an ensemble of N nodes, you can tolerate up to (N-1)/2 failures. For a 3-node cluster, one node can fail. For a 5-node cluster, two can fail. If more nodes are lost, no writes can be committed.

Q: Can I use ZooKeeper instead of ClickHouse Keeper to avoid RAFT_ERROR?
A: Yes. ClickHouse supports both ZooKeeper and its built-in Keeper. ZooKeeper uses the ZAB protocol rather than Raft. The choice does not eliminate consensus-related errors -- ZooKeeper has its own analogous failure modes -- but it is a supported alternative.

Q: Will data be lost if the Keeper cluster loses quorum temporarily?
A: No data is lost. Committed data in ClickHouse table parts remains on disk regardless of Keeper status. However, new inserts to replicated tables and replication of existing data will stall until quorum is restored. Uncommitted operations that were in-flight when quorum was lost may need to be retried.

Q: How do I recover from a completely failed Keeper ensemble?
A: If all Keeper data is lost, you can reinitialize the ensemble and then use SYSTEM RESTORE REPLICA on each replicated table to rebuild the Keeper metadata from the actual data parts on disk. This is a last-resort procedure and should be done carefully.