ClickHouse DB::Exception: All replicas lost

Q: What is the difference between ALL_REPLICAS_LOST and ALL_CONNECTION_TRIES_FAILED?

ALL_CONNECTION_TRIES_FAILED means ClickHouse could not establish a connection to any replica in the first place. ALL_REPLICAS_LOST means connections were established or available but were subsequently lost during the query lifecycle.

The "DB::Exception: All replicas lost" error in ClickHouse indicates that connections to every replica of a shard were lost during query execution. The error code is ALL_REPLICAS_LOST. Unlike connection errors at the start of a query, this exception typically means connections were initially established but then dropped, or that replicas became unavailable mid-flight.

Impact

This error causes the distributed query to fail for the affected shard, which can result in:

Complete query failure if skip_unavailable_shards is not enabled
Interrupted data pipelines and application requests
Possible partial writes if the failure occurs during a distributed INSERT
A signal that the cluster may be experiencing instability requiring immediate attention

Common Causes

Replica crashes during query execution -- The remote ClickHouse process terminates unexpectedly due to out-of-memory conditions, segfaults, or other fatal errors.
Network interruptions -- Transient or sustained network failures sever established TCP connections between the initiator and replica nodes.
Query timeouts on the replica side -- The replica kills the query due to max_execution_time or other resource limits, closing the connection.
Overloaded replicas -- Resource exhaustion (CPU, memory, or threads) on replica servers causes them to stop responding.
Kernel or OS-level issues -- TCP keepalive mismatches, connection tracking table overflow, or firewall state expiration can silently drop connections.
Rolling restarts or deployments -- All replicas of a shard are restarted simultaneously or in quick succession, leaving none available.

Troubleshooting and Resolution Steps

Check replica server status immediately:
```
systemctl status clickhouse-server
```
If the process has crashed, examine the core dump and error log:
```
tail -500 /var/log/clickhouse-server/clickhouse-server.err.log
```

Query the system tables for recent errors on the initiator:

SELECT name, value, last_error_time, last_error_message
FROM system.errors
WHERE last_error_time > now() - INTERVAL 10 MINUTE
ORDER BY last_error_time DESC;

Review replica health and replication state:
```
SELECT database, table, replica_name, is_session_expired, active_replicas
FROM system.replicas;
```
Look for replicas with is_session_expired = 1 or low active_replicas counts.

Verify network stability between nodes:

# Long-running connectivity test
ping -c 100 <replica_host>

# Check for packet loss and latency
mtr <replica_host>

Check resource utilization on replica nodes -- OOM kills are a common cause of sudden replica loss:
```
dmesg | grep -i "out of memory"
journalctl -u clickhouse-server --since "10 minutes ago"
```
If rolling restarts are the cause, stagger them properly: Ensure that at least one replica per shard remains available at all times during maintenance windows. Wait for each replica to fully rejoin the cluster before restarting the next.
Increase TCP keepalive settings to detect dead connections faster and allow the load balancer to route around them:
```

<tcp_keep_alive_timeout>60</tcp_keep_alive_timeout>
```

Best Practices

Maintain at least two replicas per shard so that losing one does not make the shard entirely unavailable.
Configure memory limits carefully on replica nodes to prevent OOM kills -- use max_memory_usage at the query and server level.
Implement health checks and automated alerting on replica availability using the system.replicas table.
During maintenance, perform rolling restarts with proper sequencing to guarantee replica availability.
Use skip_unavailable_shards in non-critical query paths where partial results are tolerable.
Set appropriate TCP keepalive values at the OS and ClickHouse level to handle connection state issues in cloud or containerized environments.

Frequently Asked Questions

Q: What is the difference between ALL_REPLICAS_LOST and ALL_CONNECTION_TRIES_FAILED?
A: ALL_CONNECTION_TRIES_FAILED means ClickHouse could not establish a connection to any replica in the first place. ALL_REPLICAS_LOST means connections were established or available but were subsequently lost during the query lifecycle.

Q: Can this error occur with a single-replica setup?
A: Yes. If your shard has only one replica and it becomes unavailable, you will see this error. This is one of the key reasons to run multiple replicas per shard.

Q: Does ClickHouse automatically retry the query on other replicas when one is lost?
A: ClickHouse does attempt failover to other replicas within the same shard. The ALL_REPLICAS_LOST error means that this failover also failed -- every replica in the shard was lost.

Q: How can I make my application resilient to this error?
A: Implement retry logic in your application with exponential backoff. For read queries, consider enabling skip_unavailable_shards and handling partial results gracefully. For writes, use a buffer or queue to retry failed inserts.