The "DB::Exception: All connection tries failed" error in ClickHouse surfaces when the server exhausts every attempt to establish a connection to the shards or replicas involved in a distributed query. The error code associated with this exception is ALL_CONNECTION_TRIES_FAILED. In practice, this means ClickHouse tried each configured endpoint for a given shard and none of them responded successfully.
Impact
When this error fires, the distributed query fails entirely for the affected shard. Depending on your cluster topology and settings, the consequences may include:
- Complete query failure if even one shard is unreachable and
skip_unavailable_shardsis not enabled - Degraded availability for dashboards, applications, and data pipelines that depend on distributed queries
- Potential cascading timeouts on upstream services waiting for ClickHouse responses
Common Causes
- Network connectivity problems -- Firewalls, security groups, or routing issues prevent the initiator node from reaching shard or replica endpoints on the configured ports (typically 9000 for native protocol or 9440 for TLS).
- Target ClickHouse instances are down -- One or more replica servers have crashed, been stopped, or are still starting up.
- Incorrect cluster configuration -- Host names, IP addresses, or ports in
remote_serversdo not match the actual deployment. - DNS resolution failures -- The initiator cannot resolve hostnames listed in the cluster definition.
- Connection timeouts set too low -- The
connect_timeoutorconnect_timeout_with_failover_msvalues are too aggressive for the network latency between nodes. - TLS/SSL misconfiguration -- If inter-node encryption is enabled, certificate mismatches or expired certificates will cause connection failures.
- Resource exhaustion on target nodes -- The remote server's file descriptor limit or connection backlog is saturated, so it cannot accept new connections.
Troubleshooting and Resolution Steps
Verify basic connectivity from the initiator node:
# Test native protocol port nc -zv <replica_host> 9000 # If using TLS openssl s_client -connect <replica_host>:9440If these fail, investigate network-level issues such as firewalls, security groups, or VPC peering.
Check that target ClickHouse processes are running:
systemctl status clickhouse-server # or clickhouse-client --host <replica_host> --query "SELECT 1"Review the cluster configuration on the initiator:
SELECT cluster, shard_num, replica_num, host_name, port, is_local FROM system.clusters WHERE cluster = 'your_cluster_name';Confirm that every listed host and port is accurate.
Inspect DNS resolution:
dig <replica_host> nslookup <replica_host>Make sure the resolved address matches expectations. In containerized environments, internal DNS can be a frequent source of trouble.
Increase connection timeout if latency is high:
SET connect_timeout_with_failover_ms = 5000; -- default is 2000 SET connect_timeout = 10; -- secondsThen retry the query.
Examine ClickHouse server logs on the target replica for errors around listener binding, TLS handshake failures, or "too many open files" messages:
tail -200 /var/log/clickhouse-server/clickhouse-server.err.logIncrease connection retry count if transient network blips are suspected:
SET connections_with_failover_max_tries = 5; -- default is 3
Best Practices
- Use replication (at least two replicas per shard) so that a single node failure does not make an entire shard unreachable.
- Set
skip_unavailable_shards = 1in queries or user profiles where partial results are acceptable, to prevent a single shard outage from blocking all queries. - Monitor inter-node connectivity continuously with health checks or synthetic probes.
- Keep
connect_timeout_with_failover_mstuned to your network conditions -- too low causes false failures, too high increases query latency during actual outages. - Maintain consistent cluster configuration files across all nodes using configuration management tools.
- Configure alerting on the
system.errorstable or ClickHouse logs for early detection of connection issues.
Frequently Asked Questions
Q: Does ClickHouse retry connections automatically before throwing ALL_CONNECTION_TRIES_FAILED?
A: Yes. ClickHouse will attempt to connect to each replica of the target shard, cycling through them according to the load-balancing policy. The number of overall attempts is controlled by connections_with_failover_max_tries (default 3). The error is raised only after all attempts are exhausted.
Q: What is the difference between ALL_CONNECTION_TRIES_FAILED and SHARD_HAS_NO_CONNECTIONS?
A: ALL_CONNECTION_TRIES_FAILED means ClickHouse actively tried to connect and every attempt failed. SHARD_HAS_NO_CONNECTIONS indicates that the shard has no usable connections at all, often because the connection pool is empty or has not been established yet.
Q: Can I make distributed queries succeed even when some shards are unreachable?
A: Yes. Setting skip_unavailable_shards = 1 tells ClickHouse to execute the query on whichever shards are reachable and return partial results. Be aware this means the result set may be incomplete.
Q: How do I know which specific replica failed?
A: The full error message usually includes the host and port of each endpoint that was tried. You can also check system.errors and the server log on the initiator node for detailed connection failure reasons.