NEW

Pulse 2025 Product Roundup: From Monitoring to AI-Native Control Plane

Elasticsearch Error: No alive nodes found in your cluster - Common Causes & Fixes

NoNodeAvailableException: No alive nodes found in your cluster (Java client) or its equivalent in other clients is raised when an Elasticsearch client cannot reach any of the hosts in its configured node list. The client gives up after exhausting retries to every known endpoint. The cluster itself may be healthy - the failure is between the client and the cluster, not inside it.

What This Error Means

Elasticsearch clients maintain a list of candidate nodes (either statically configured or discovered via sniffing) and round-robin requests across them. When every node returns a connection error, times out, or fails authentication, the client marks them all dead and raises this exception. The root cause is one of: nodes not running, network unreachability, wrong port/protocol, TLS mismatch, authentication failure, or a stale sniffed list pointing at decommissioned hosts.

The cluster may be fully operational - check from another host before assuming the cluster is down.

Common Causes

  1. Elasticsearch process not running on the target hosts. How to confirm: systemctl status elasticsearch or curl http://<host>:9200/ directly from the client host.
  2. Wrong scheme (HTTP vs HTTPS) after enabling security. How to confirm: try both http://<host>:9200/ and https://<host>:9200/ from the client - one will return JSON, the other a TLS handshake error or empty response.
  3. TLS certificate not trusted by the client. How to confirm: openssl s_client -connect <host>:9200 -showcerts and verify the certificate chain.
  4. Firewall, security group, or network policy blocking port 9200 (REST) or 9300 (transport). How to confirm: nc -vz <host> 9200 from the client host.
  5. Sniffing returned 9300-style transport addresses but the client connects via REST. How to confirm: the client log will list the unreachable addresses; if they are 9300, sniffing is misconfigured.
  6. Authentication failure (wrong username/password/API key) returning 401 on every node. How to confirm: try a manual curl -u <user>:<pass> https://<host>:9200/.

How to Fix No Alive Nodes Found

  1. From the client host, confirm at least one node responds:

    curl -k -u elastic:<password> https://<host>:9200/
    

    A 200 with the cluster banner means the node and credentials are fine.

  2. Check the running Elasticsearch process on each node:

    sudo systemctl status elasticsearch
    sudo journalctl -u elasticsearch -n 200
    
  3. Verify cluster health from one of the nodes:

    curl -k -u elastic:<password> https://localhost:9200/_cluster/health
    
  4. Reconcile client config: scheme, port, credentials, and CA cert path must match what the cluster runs. For the Java REST client:

    RestClient.builder(new HttpHost("es-1", 9200, "https"))
        .setHttpClientConfigCallback(b -> b.setSSLContext(sslContext)
            .setDefaultCredentialsProvider(credentialsProvider));
    
  5. Disable sniffing temporarily to rule out stale topology data. In the Java client, remove the Sniffer until basic connectivity works.

  6. Open firewall/security groups for port 9200 from the client subnet:

    sudo ufw allow from <client_subnet> to any port 9200 proto tcp
    
  7. Restart Elasticsearch only after configuration checks pass:

    sudo systemctl restart elasticsearch
    

Resolve No Alive Nodes Found Automatically with Pulse

Pulse is an AI DBA for Elasticsearch and OpenSearch. When NoNodeAvailableException: No alive nodes found in your cluster fires from a client, Pulse:

  • Probes both REST (9200) and transport (9300) endpoints continuously from outside the cluster, distinguishing connection refused, TLS handshake failure, 401 Unauthorized, and timeout, and correlates each with systemctl status elasticsearch, journalctl -u elasticsearch, and _cluster/health from a known-good vantage point
  • Identifies which of the six causes applies: node process down, HTTP-vs-HTTPS scheme mismatch after enabling security, untrusted TLS chain, blocked firewall/security group on 9200, sniffing returning unreachable 9300 transport addresses, or auth failure across every endpoint
  • Generates the exact remediation: the correct RestClient.builder configuration with HttpHost scheme/port/CA, the ufw/security-group rule to open 9200 from the client subnet, the Sniffer disable, or the certificate trust update
  • Applies network policy and dynamic cluster settings changes with operator approval; leaves client library configuration as a one-click PR

Pulse separates "the cluster is down" from "this client cannot reach the cluster" by running probes from outside the cluster network, which catches partial outages where some clients still work and others do not.

Start a free trial to connect your cluster.

Frequently Asked Questions

Q: Does "No alive nodes found" mean my cluster is down?
A: Not necessarily. The error means the client cannot reach any node. The cluster may be fully operational - test with curl from a known-good host before troubleshooting cluster state.

Q: How can I tell if it is a network or authentication issue?
A: A network error in the client log typically reads "connection refused" or "connect timed out"; an auth failure reads "401 Unauthorized" or "security_exception". Test with curl -v to see the HTTP status code directly.

Q: Why does sniffing make this error worse?
A: Sniffing replaces the configured node list with addresses returned by _nodes/_all/http. If those addresses are not reachable from the client network (private IPs, deprecated hosts, internal DNS names), the client appears stuck even when a publicly reachable load balancer is configured.

Q: Can I lose data when this error occurs?
A: The client error itself does not affect cluster data. Write operations that never reached a node simply fail and should be retried. Operations that returned 200 were persisted - check the client logs for the per-request status.

Q: How long should I wait before retrying after this error?
A: Use exponential backoff starting at 1-2 seconds, capped at ~30 seconds, with jitter. Aggressive retries during a network blip make recovery harder for the cluster.

Q: Will mismatched cluster names cause this error?
A: The REST client does not enforce cluster name matching, so it would not cause "no alive nodes". The transport client (deprecated since 7.0, removed in 8.0) did, but is no longer supported. If you are still using it, migrate to the REST client.

Q: What's the fastest way to diagnose "No alive nodes found" in production?
A: Pulse, the AI DBA for Elasticsearch and OpenSearch, probes the cluster from outside the network on both REST and transport, distinguishes connection refused, TLS handshake failure, and 401 in the same view, and names whether the issue is node state, firewall, scheme mismatch, sniffing, or credentials. That isolates client-network problems from actual cluster outages without a manual curl-from-three-hosts walk.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.