The "DB::Exception: All replicas lost" error in ClickHouse indicates that connections to every replica of a shard were lost during query execution. The error code is ALL_REPLICAS_LOST. Unlike connection errors at the start of a query, this exception typically means connections were initially established but then dropped, or that replicas became unavailable mid-flight.
Impact
This error causes the distributed query to fail for the affected shard, which can result in:
- Complete query failure if
skip_unavailable_shardsis not enabled - Interrupted data pipelines and application requests
- Possible partial writes if the failure occurs during a distributed INSERT
- A signal that the cluster may be experiencing instability requiring immediate attention
Common Causes
- Replica crashes during query execution -- The remote ClickHouse process terminates unexpectedly due to out-of-memory conditions, segfaults, or other fatal errors.
- Network interruptions -- Transient or sustained network failures sever established TCP connections between the initiator and replica nodes.
- Query timeouts on the replica side -- The replica kills the query due to
max_execution_timeor other resource limits, closing the connection. - Overloaded replicas -- Resource exhaustion (CPU, memory, or threads) on replica servers causes them to stop responding.
- Kernel or OS-level issues -- TCP keepalive mismatches, connection tracking table overflow, or firewall state expiration can silently drop connections.
- Rolling restarts or deployments -- All replicas of a shard are restarted simultaneously or in quick succession, leaving none available.
Troubleshooting and Resolution Steps
Check replica server status immediately:
systemctl status clickhouse-serverIf the process has crashed, examine the core dump and error log:
tail -500 /var/log/clickhouse-server/clickhouse-server.err.logQuery the system tables for recent errors on the initiator:
SELECT name, value, last_error_time, last_error_message FROM system.errors WHERE last_error_time > now() - INTERVAL 10 MINUTE ORDER BY last_error_time DESC;Review replica health and replication state:
SELECT database, table, replica_name, is_session_expired, active_replicas FROM system.replicas;Look for replicas with
is_session_expired = 1or lowactive_replicascounts.Verify network stability between nodes:
# Long-running connectivity test ping -c 100 <replica_host> # Check for packet loss and latency mtr <replica_host>Check resource utilization on replica nodes -- OOM kills are a common cause of sudden replica loss:
dmesg | grep -i "out of memory" journalctl -u clickhouse-server --since "10 minutes ago"If rolling restarts are the cause, stagger them properly: Ensure that at least one replica per shard remains available at all times during maintenance windows. Wait for each replica to fully rejoin the cluster before restarting the next.
Increase TCP keepalive settings to detect dead connections faster and allow the load balancer to route around them:
<!-- In clickhouse-server config --> <tcp_keep_alive_timeout>60</tcp_keep_alive_timeout>
Best Practices
- Maintain at least two replicas per shard so that losing one does not make the shard entirely unavailable.
- Configure memory limits carefully on replica nodes to prevent OOM kills -- use
max_memory_usageat the query and server level. - Implement health checks and automated alerting on replica availability using the
system.replicastable. - During maintenance, perform rolling restarts with proper sequencing to guarantee replica availability.
- Use
skip_unavailable_shardsin non-critical query paths where partial results are tolerable. - Set appropriate TCP keepalive values at the OS and ClickHouse level to handle connection state issues in cloud or containerized environments.
Frequently Asked Questions
Q: What is the difference between ALL_REPLICAS_LOST and ALL_CONNECTION_TRIES_FAILED?
A: ALL_CONNECTION_TRIES_FAILED means ClickHouse could not establish a connection to any replica in the first place. ALL_REPLICAS_LOST means connections were established or available but were subsequently lost during the query lifecycle.
Q: Can this error occur with a single-replica setup?
A: Yes. If your shard has only one replica and it becomes unavailable, you will see this error. This is one of the key reasons to run multiple replicas per shard.
Q: Does ClickHouse automatically retry the query on other replicas when one is lost?
A: ClickHouse does attempt failover to other replicas within the same shard. The ALL_REPLICAS_LOST error means that this failover also failed -- every replica in the shard was lost.
Q: How can I make my application resilient to this error?
A: Implement retry logic in your application with exponential backoff. For read queries, consider enabling skip_unavailable_shards and handling partial results gracefully. For writes, use a buffer or queue to retry failed inserts.