NEW

Pulse 2025 Product Roundup: From Monitoring to AI-Native Control Plane

ClickHouse DB::Exception: All connection tries failed

The "DB::Exception: All connection tries failed" error in ClickHouse surfaces when the server exhausts every attempt to establish a connection to the shards or replicas involved in a distributed query. The error code associated with this exception is ALL_CONNECTION_TRIES_FAILED. In practice, this means ClickHouse tried each configured endpoint for a given shard and none of them responded successfully.

Impact

When this error fires, the distributed query fails entirely for the affected shard. Depending on your cluster topology and settings, the consequences may include:

  • Complete query failure if even one shard is unreachable and skip_unavailable_shards is not enabled
  • Degraded availability for dashboards, applications, and data pipelines that depend on distributed queries
  • Potential cascading timeouts on upstream services waiting for ClickHouse responses

Common Causes

  1. Network connectivity problems -- Firewalls, security groups, or routing issues prevent the initiator node from reaching shard or replica endpoints on the configured ports (typically 9000 for native protocol or 9440 for TLS).
  2. Target ClickHouse instances are down -- One or more replica servers have crashed, been stopped, or are still starting up.
  3. Incorrect cluster configuration -- Host names, IP addresses, or ports in remote_servers do not match the actual deployment.
  4. DNS resolution failures -- The initiator cannot resolve hostnames listed in the cluster definition.
  5. Connection timeouts set too low -- The connect_timeout or connect_timeout_with_failover_ms values are too aggressive for the network latency between nodes.
  6. TLS/SSL misconfiguration -- If inter-node encryption is enabled, certificate mismatches or expired certificates will cause connection failures.
  7. Resource exhaustion on target nodes -- The remote server's file descriptor limit or connection backlog is saturated, so it cannot accept new connections.

Troubleshooting and Resolution Steps

  1. Verify basic connectivity from the initiator node:

    # Test native protocol port
    nc -zv <replica_host> 9000
    
    # If using TLS
    openssl s_client -connect <replica_host>:9440
    

    If these fail, investigate network-level issues such as firewalls, security groups, or VPC peering.

  2. Check that target ClickHouse processes are running:

    systemctl status clickhouse-server
    # or
    clickhouse-client --host <replica_host> --query "SELECT 1"
    
  3. Review the cluster configuration on the initiator:

    SELECT cluster, shard_num, replica_num, host_name, port, is_local
    FROM system.clusters
    WHERE cluster = 'your_cluster_name';
    

    Confirm that every listed host and port is accurate.

  4. Inspect DNS resolution:

    dig <replica_host>
    nslookup <replica_host>
    

    Make sure the resolved address matches expectations. In containerized environments, internal DNS can be a frequent source of trouble.

  5. Increase connection timeout if latency is high:

    SET connect_timeout_with_failover_ms = 5000;  -- default is 2000
    SET connect_timeout = 10;                      -- seconds
    

    Then retry the query.

  6. Examine ClickHouse server logs on the target replica for errors around listener binding, TLS handshake failures, or "too many open files" messages:

    tail -200 /var/log/clickhouse-server/clickhouse-server.err.log
    
  7. Increase connection retry count if transient network blips are suspected:

    SET connections_with_failover_max_tries = 5;  -- default is 3
    

Best Practices

  • Use replication (at least two replicas per shard) so that a single node failure does not make an entire shard unreachable.
  • Set skip_unavailable_shards = 1 in queries or user profiles where partial results are acceptable, to prevent a single shard outage from blocking all queries.
  • Monitor inter-node connectivity continuously with health checks or synthetic probes.
  • Keep connect_timeout_with_failover_ms tuned to your network conditions -- too low causes false failures, too high increases query latency during actual outages.
  • Maintain consistent cluster configuration files across all nodes using configuration management tools.
  • Configure alerting on the system.errors table or ClickHouse logs for early detection of connection issues.

Frequently Asked Questions

Q: Does ClickHouse retry connections automatically before throwing ALL_CONNECTION_TRIES_FAILED?
A: Yes. ClickHouse will attempt to connect to each replica of the target shard, cycling through them according to the load-balancing policy. The number of overall attempts is controlled by connections_with_failover_max_tries (default 3). The error is raised only after all attempts are exhausted.

Q: What is the difference between ALL_CONNECTION_TRIES_FAILED and SHARD_HAS_NO_CONNECTIONS?
A: ALL_CONNECTION_TRIES_FAILED means ClickHouse actively tried to connect and every attempt failed. SHARD_HAS_NO_CONNECTIONS indicates that the shard has no usable connections at all, often because the connection pool is empty or has not been established yet.

Q: Can I make distributed queries succeed even when some shards are unreachable?
A: Yes. Setting skip_unavailable_shards = 1 tells ClickHouse to execute the query on whichever shards are reachable and return partial results. Be aware this means the result set may be incomplete.

Q: How do I know which specific replica failed?
A: The full error message usually includes the host and port of each endpoint that was tried. You can also check system.errors and the server log on the initiator node for detailed connection failure reasons.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.