ClickHouse DB::Exception: All connection tries failed

Q: What is the difference between ALL_CONNECTION_TRIES_FAILED and SHARD_HAS_NO_CONNECTIONS?

ALL_CONNECTION_TRIES_FAILED means ClickHouse actively tried to connect and every attempt failed. SHARD_HAS_NO_CONNECTIONS indicates that the shard has no usable connections at all, often because the connection pool is empty or has not been established yet.

Q: Can I make distributed queries succeed even when some shards are unreachable?

Yes. Setting skip_unavailable_shards = 1 tells ClickHouse to execute the query on whichever shards are reachable and return partial results. Be aware this means the result set may be incomplete.

The "DB::Exception: All connection tries failed" error in ClickHouse surfaces when the server exhausts every attempt to establish a connection to the shards or replicas involved in a distributed query. The error code associated with this exception is ALL_CONNECTION_TRIES_FAILED. In practice, this means ClickHouse tried each configured endpoint for a given shard and none of them responded successfully.

Impact

When this error fires, the distributed query fails entirely for the affected shard. Depending on your cluster topology and settings, the consequences may include:

Complete query failure if even one shard is unreachable and skip_unavailable_shards is not enabled
Degraded availability for dashboards, applications, and data pipelines that depend on distributed queries
Potential cascading timeouts on upstream services waiting for ClickHouse responses

Common Causes

Network connectivity problems -- Firewalls, security groups, or routing issues prevent the initiator node from reaching shard or replica endpoints on the configured ports (typically 9000 for native protocol or 9440 for TLS).
Target ClickHouse instances are down -- One or more replica servers have crashed, been stopped, or are still starting up.
Incorrect cluster configuration -- Host names, IP addresses, or ports in remote_servers do not match the actual deployment.
DNS resolution failures -- The initiator cannot resolve hostnames listed in the cluster definition.
Connection timeouts set too low -- The connect_timeout or connect_timeout_with_failover_ms values are too aggressive for the network latency between nodes.
TLS/SSL misconfiguration -- If inter-node encryption is enabled, certificate mismatches or expired certificates will cause connection failures.
Resource exhaustion on target nodes -- The remote server's file descriptor limit or connection backlog is saturated, so it cannot accept new connections.

Troubleshooting and Resolution Steps

Verify basic connectivity from the initiator node:
```
# Test native protocol port
nc -zv <replica_host> 9000

# If using TLS
openssl s_client -connect <replica_host>:9440
```
If these fail, investigate network-level issues such as firewalls, security groups, or VPC peering.

Check that target ClickHouse processes are running:

systemctl status clickhouse-server
# or
clickhouse-client --host <replica_host> --query "SELECT 1"

Review the cluster configuration on the initiator:

SELECT cluster, shard_num, replica_num, host_name, port, is_local
FROM system.clusters
WHERE cluster = 'your_cluster_name';

Confirm that every listed host and port is accurate.

Inspect DNS resolution:
```
dig <replica_host>
nslookup <replica_host>
```
Make sure the resolved address matches expectations. In containerized environments, internal DNS can be a frequent source of trouble.

Increase connection timeout if latency is high:

SET connect_timeout_with_failover_ms = 5000;  -- default is 2000
SET connect_timeout = 10;                      -- seconds

Then retry the query.

Examine ClickHouse server logs on the target replica for errors around listener binding, TLS handshake failures, or "too many open files" messages:
```
tail -200 /var/log/clickhouse-server/clickhouse-server.err.log
```
Increase connection retry count if transient network blips are suspected:
```
SET connections_with_failover_max_tries = 5;  -- default is 3
```

Best Practices

Use replication (at least two replicas per shard) so that a single node failure does not make an entire shard unreachable.
Set skip_unavailable_shards = 1 in queries or user profiles where partial results are acceptable, to prevent a single shard outage from blocking all queries.
Monitor inter-node connectivity continuously with health checks or synthetic probes.
Keep connect_timeout_with_failover_ms tuned to your network conditions -- too low causes false failures, too high increases query latency during actual outages.
Maintain consistent cluster configuration files across all nodes using configuration management tools.
Configure alerting on the system.errors table or ClickHouse logs for early detection of connection issues.

Frequently Asked Questions

Q: Does ClickHouse retry connections automatically before throwing ALL_CONNECTION_TRIES_FAILED?
A: Yes. ClickHouse will attempt to connect to each replica of the target shard, cycling through them according to the load-balancing policy. The number of overall attempts is controlled by connections_with_failover_max_tries (default 3). The error is raised only after all attempts are exhausted.

Q: What is the difference between ALL_CONNECTION_TRIES_FAILED and SHARD_HAS_NO_CONNECTIONS?
A: ALL_CONNECTION_TRIES_FAILED means ClickHouse actively tried to connect and every attempt failed. SHARD_HAS_NO_CONNECTIONS indicates that the shard has no usable connections at all, often because the connection pool is empty or has not been established yet.

Q: Can I make distributed queries succeed even when some shards are unreachable?
A: Yes. Setting skip_unavailable_shards = 1 tells ClickHouse to execute the query on whichever shards are reachable and return partial results. Be aware this means the result set may be incomplete.

Q: How do I know which specific replica failed?
A: The full error message usually includes the host and port of each endpoint that was tried. You can also check system.errors and the server log on the initiator node for detailed connection failure reasons.