The "DB::Exception: All replicas are stale" error in ClickHouse is raised when a distributed query determines that every available replica for a shard has a replication delay exceeding the threshold defined by max_replica_delay_for_distributed_queries. The error code for this exception is ALL_REPLICAS_ARE_STALE. Essentially, ClickHouse considers none of the replicas fresh enough to serve reliable data.
Impact
This error prevents the distributed query from executing against the affected shard. The broader consequences include:
- Query failures in applications and dashboards relying on distributed tables
- Users receiving no results rather than stale results, unless fallback settings are configured
- Potential indication of a deeper replication health problem that could worsen if left unaddressed
Common Causes
- Heavy replication lag -- Replicas are far behind the leader due to slow merges, high insert throughput, or resource bottlenecks on replica nodes.
- ZooKeeper or ClickHouse Keeper latency -- Replication metadata updates are delayed because the coordination service is overloaded or experiencing network issues.
- Replica nodes under-resourced -- Insufficient CPU, memory, or disk I/O on replica servers slows down the replication queue processing.
- Large mutations or ALTERs in progress -- Schema changes or
ALTER UPDATE/ALTER DELETEoperations can temporarily stall replication while they execute. - max_replica_delay_for_distributed_queries set too low -- The threshold is more aggressive than what the cluster can realistically sustain under normal load.
- Network bandwidth saturation -- Data transfer between replicas is throttled by limited network bandwidth, especially with large parts.
Troubleshooting and Resolution Steps
Check current replication delay across replicas:
SELECT database, table, replica_name, absolute_delay, queue_size, inserts_in_queue, merges_in_queue FROM system.replicas WHERE absolute_delay > 0 ORDER BY absolute_delay DESC;This shows which tables and replicas are lagging and by how much.
Review the current threshold setting:
SELECT name, value FROM system.settings WHERE name = 'max_replica_delay_for_distributed_queries';The default is 300 seconds. If your workload naturally has bursts of lag, consider increasing it.
Temporarily allow stale replicas to serve queries:
SET fallback_to_stale_replicas_for_distributed_queries = 1;This lets ClickHouse use a stale replica rather than failing the query entirely. The trade-off is potentially reading slightly outdated data.
Investigate ZooKeeper / ClickHouse Keeper health:
SELECT * FROM system.zookeeper WHERE path = '/clickhouse/tables/';Also check Keeper logs and metrics for latency spikes, connection drops, or leader elections.
Look at the replication queue for stuck entries:
SELECT database, table, type, create_time, num_tries, last_exception FROM system.replication_queue WHERE num_tries > 1 ORDER BY create_time ASC LIMIT 20;Entries with many retries or old creation times indicate stuck replication tasks.
Check resource utilization on replica nodes -- monitor CPU, memory, disk I/O, and network throughput. Replicas that are IO-bound cannot keep up with the replication queue.
If a specific replica is permanently stuck, consider resyncing it:
SYSTEM RESTART REPLICA database.table_name;In severe cases, you may need to drop and re-create the replica from a fresh copy.
Best Practices
- Set
max_replica_delay_for_distributed_queriesto a value that reflects realistic lag expectations for your workload -- 300 seconds is a reasonable starting point, but high-throughput clusters may need more headroom. - Enable
fallback_to_stale_replicas_for_distributed_queriesin environments where returning slightly outdated data is preferable to query failure. - Monitor
absolute_delayfromsystem.replicascontinuously and alert when it exceeds a warning threshold well below the configured maximum. - Ensure replica nodes have adequate resources, particularly disk I/O and network bandwidth, to keep up with replication.
- Avoid running heavy mutations during peak query hours, as they compete with replication for resources.
- Keep ZooKeeper or ClickHouse Keeper properly sized and monitored, since it is the backbone of replication coordination.
Frequently Asked Questions
Q: What does "stale" mean in the context of ClickHouse replicas?
A: A replica is considered stale when its replication delay (measured in seconds behind the most up-to-date replica) exceeds the value of max_replica_delay_for_distributed_queries. This delay is tracked in the absolute_delay column of system.replicas.
Q: Can I disable the staleness check entirely?
A: Yes. Setting max_replica_delay_for_distributed_queries to 0 disables the check, meaning ClickHouse will never reject a replica as stale. However, this means queries could silently return significantly outdated data.
Q: How is replica delay calculated?
A: ClickHouse computes the delay by comparing the timestamp of the last replication queue entry processed by the replica against the current time. The absolute_delay field in system.replicas reports this value in seconds.
Q: Will fallback_to_stale_replicas_for_distributed_queries always prevent this error?
A: When enabled, it will pick the least-stale replica instead of throwing the error, so yes -- the query will proceed. The result may contain data that is behind by more than your configured threshold, though.
Q: What should I do if one replica is consistently more delayed than others?
A: Investigate resource constraints on that specific node. Common culprits are slow disks, insufficient CPU for merges, or network issues between the replica and ZooKeeper. If the node cannot keep up, consider scaling its hardware or rebalancing the cluster.