NEW

Pulse 2025 Product Roundup: From Monitoring to AI-Native Control Plane

ClickHouse DB::Exception: All replicas are stale

The "DB::Exception: All replicas are stale" error in ClickHouse is raised when a distributed query determines that every available replica for a shard has a replication delay exceeding the threshold defined by max_replica_delay_for_distributed_queries. The error code for this exception is ALL_REPLICAS_ARE_STALE. Essentially, ClickHouse considers none of the replicas fresh enough to serve reliable data.

Impact

This error prevents the distributed query from executing against the affected shard. The broader consequences include:

  • Query failures in applications and dashboards relying on distributed tables
  • Users receiving no results rather than stale results, unless fallback settings are configured
  • Potential indication of a deeper replication health problem that could worsen if left unaddressed

Common Causes

  1. Heavy replication lag -- Replicas are far behind the leader due to slow merges, high insert throughput, or resource bottlenecks on replica nodes.
  2. ZooKeeper or ClickHouse Keeper latency -- Replication metadata updates are delayed because the coordination service is overloaded or experiencing network issues.
  3. Replica nodes under-resourced -- Insufficient CPU, memory, or disk I/O on replica servers slows down the replication queue processing.
  4. Large mutations or ALTERs in progress -- Schema changes or ALTER UPDATE/ALTER DELETE operations can temporarily stall replication while they execute.
  5. max_replica_delay_for_distributed_queries set too low -- The threshold is more aggressive than what the cluster can realistically sustain under normal load.
  6. Network bandwidth saturation -- Data transfer between replicas is throttled by limited network bandwidth, especially with large parts.

Troubleshooting and Resolution Steps

  1. Check current replication delay across replicas:

    SELECT
        database,
        table,
        replica_name,
        absolute_delay,
        queue_size,
        inserts_in_queue,
        merges_in_queue
    FROM system.replicas
    WHERE absolute_delay > 0
    ORDER BY absolute_delay DESC;
    

    This shows which tables and replicas are lagging and by how much.

  2. Review the current threshold setting:

    SELECT name, value
    FROM system.settings
    WHERE name = 'max_replica_delay_for_distributed_queries';
    

    The default is 300 seconds. If your workload naturally has bursts of lag, consider increasing it.

  3. Temporarily allow stale replicas to serve queries:

    SET fallback_to_stale_replicas_for_distributed_queries = 1;
    

    This lets ClickHouse use a stale replica rather than failing the query entirely. The trade-off is potentially reading slightly outdated data.

  4. Investigate ZooKeeper / ClickHouse Keeper health:

    SELECT * FROM system.zookeeper WHERE path = '/clickhouse/tables/';
    

    Also check Keeper logs and metrics for latency spikes, connection drops, or leader elections.

  5. Look at the replication queue for stuck entries:

    SELECT
        database,
        table,
        type,
        create_time,
        num_tries,
        last_exception
    FROM system.replication_queue
    WHERE num_tries > 1
    ORDER BY create_time ASC
    LIMIT 20;
    

    Entries with many retries or old creation times indicate stuck replication tasks.

  6. Check resource utilization on replica nodes -- monitor CPU, memory, disk I/O, and network throughput. Replicas that are IO-bound cannot keep up with the replication queue.

  7. If a specific replica is permanently stuck, consider resyncing it:

    SYSTEM RESTART REPLICA database.table_name;
    

    In severe cases, you may need to drop and re-create the replica from a fresh copy.

Best Practices

  • Set max_replica_delay_for_distributed_queries to a value that reflects realistic lag expectations for your workload -- 300 seconds is a reasonable starting point, but high-throughput clusters may need more headroom.
  • Enable fallback_to_stale_replicas_for_distributed_queries in environments where returning slightly outdated data is preferable to query failure.
  • Monitor absolute_delay from system.replicas continuously and alert when it exceeds a warning threshold well below the configured maximum.
  • Ensure replica nodes have adequate resources, particularly disk I/O and network bandwidth, to keep up with replication.
  • Avoid running heavy mutations during peak query hours, as they compete with replication for resources.
  • Keep ZooKeeper or ClickHouse Keeper properly sized and monitored, since it is the backbone of replication coordination.

Frequently Asked Questions

Q: What does "stale" mean in the context of ClickHouse replicas?
A: A replica is considered stale when its replication delay (measured in seconds behind the most up-to-date replica) exceeds the value of max_replica_delay_for_distributed_queries. This delay is tracked in the absolute_delay column of system.replicas.

Q: Can I disable the staleness check entirely?
A: Yes. Setting max_replica_delay_for_distributed_queries to 0 disables the check, meaning ClickHouse will never reject a replica as stale. However, this means queries could silently return significantly outdated data.

Q: How is replica delay calculated?
A: ClickHouse computes the delay by comparing the timestamp of the last replication queue entry processed by the replica against the current time. The absolute_delay field in system.replicas reports this value in seconds.

Q: Will fallback_to_stale_replicas_for_distributed_queries always prevent this error?
A: When enabled, it will pick the least-stale replica instead of throwing the error, so yes -- the query will proceed. The result may contain data that is behind by more than your configured threshold, though.

Q: What should I do if one replica is consistently more delayed than others?
A: Investigate resource constraints on that specific node. Common culprits are slow disks, insufficient CPU for merges, or network issues between the replica and ZooKeeper. If the node cannot keep up, consider scaling its hardware or rebalancing the cluster.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.