Meet the Pulse team at AWS re:Invent!

Read more

Elasticsearch Nodes Keep Disconnecting Troubleshooting

Frequent node disconnections destabilize Elasticsearch clusters, causing shard reallocation, increased recovery time, and potential data unavailability. This guide helps you identify and fix the root causes of node disconnections.

Symptoms of Node Disconnection Issues

  • Nodes appearing and disappearing from /_cat/nodes
  • Frequent master elections
  • Shard relocations and recoveries
  • Client connection errors
  • Log messages: "node disconnected", "failed to ping", "transport error"

Diagnosing Disconnection Causes

Check Node Status

GET /_cat/nodes?v&h=name,ip,heap.percent,cpu,load_1m,node.role,master

Review Cluster Logs

Search for disconnection events:

grep -i "disconnect\|failed to ping\|master.*changed\|transport.*error" /var/log/elasticsearch/*.log

Check Discovery Configuration

GET /_cluster/settings?include_defaults=true&filter_path=*.discovery.*,*.cluster.initial_master_nodes

Common Causes and Solutions

Cause 1: Network Issues

Symptoms:

  • Random disconnections across multiple nodes
  • Timeout errors in logs
  • Packet loss or high latency

Diagnosis:

# Test connectivity between nodes
ping -c 100 <other-node-ip>

# Check for packet loss
mtr --report <other-node-ip>

# Verify port accessibility
nc -zv <other-node-ip> 9300

Solutions:

  1. Increase transport timeouts:
# elasticsearch.yml
transport.tcp.connect_timeout: 30s
transport.ping_schedule: 5s
  1. Adjust discovery timeout:
# elasticsearch.yml
discovery.zen.ping_timeout: 10s
discovery.zen.fd.ping_interval: 2s
discovery.zen.fd.ping_timeout: 60s
discovery.zen.fd.ping_retries: 5
  1. For Elasticsearch 7.x+:
# elasticsearch.yml
cluster.fault_detection.leader_check.timeout: 30s
cluster.fault_detection.leader_check.interval: 2s
cluster.fault_detection.leader_check.retry_count: 5
cluster.fault_detection.follower_check.timeout: 30s

Cause 2: Resource Exhaustion

Symptoms:

  • Disconnections correlate with high CPU or memory usage
  • GC pauses before disconnections
  • Timeouts during high load periods

Diagnosis:

GET /_nodes/stats/jvm,os

Solutions:

  1. Ensure adequate heap:
# jvm.options.d/heap.options
-Xms16g
-Xmx16g
  1. Monitor and reduce GC pauses:
GET /_nodes/stats/jvm?filter_path=nodes.*.jvm.gc
  1. Scale the cluster to handle the load.

Cause 3: Long GC Pauses

Symptoms:

  • Log messages about GC taking too long
  • Node appears to freeze periodically
  • [gc][warning] entries in logs

Diagnosis:

grep "gc\[" /var/log/elasticsearch/*.log | grep -E "[0-9]{4,}ms"

Solutions:

  1. Reduce heap pressure (see JVM heap pressure guides)
  2. Tune GC settings:
# jvm.options.d/gc.options
-XX:+UseG1GC
-XX:G1HeapRegionSize=32m
-XX:+ParallelRefProcEnabled

Cause 4: Incorrect Discovery Configuration

Symptoms:

  • Nodes not finding each other after restart
  • Split brain scenarios
  • Multiple masters elected

Diagnosis:

# Check current configuration
cat /etc/elasticsearch/elasticsearch.yml | grep -E "discovery|cluster.initial_master"

Solutions:

For Elasticsearch 7.x+:

# elasticsearch.yml
discovery.seed_hosts:
  - 192.168.1.10:9300
  - 192.168.1.11:9300
  - 192.168.1.12:9300
cluster.initial_master_nodes:
  - master-node-1
  - master-node-2
  - master-node-3

Cause 5: File Descriptor Limits

Symptoms:

  • "Too many open files" errors
  • Connections failing to establish

Diagnosis:

# Check current limits
ulimit -n

# Check Elasticsearch's view
GET /_nodes/stats/process?filter_path=nodes.*.process.open_file_descriptors,nodes.*.process.max_file_descriptors

Solutions:

# /etc/security/limits.conf
elasticsearch  -  nofile  65535
elasticsearch  -  nproc   4096

Cause 6: DNS Resolution Issues

Symptoms:

  • Intermittent connectivity
  • Resolution timeouts
  • Inconsistent behavior across restarts

Solutions:

  1. Use IP addresses instead of hostnames:
# elasticsearch.yml
discovery.seed_hosts:
  - 192.168.1.10
  - 192.168.1.11
  - 192.168.1.12
  1. Add entries to /etc/hosts:
192.168.1.10 es-node-1
192.168.1.11 es-node-2
192.168.1.12 es-node-3

Cause 7: JVM Crashes

Symptoms:

  • Abrupt disconnections without warning
  • Core dumps or hs_err files
  • OOM killer invocations

Diagnosis:

# Check for OOM killer
grep -i "killed process" /var/log/syslog
dmesg | grep -i "out of memory"

# Look for JVM crash files
ls -la /var/log/elasticsearch/hs_err*

Solutions:

  • Ensure heap doesn't exceed 50% of RAM
  • Check for system memory pressure
  • Disable swap or set bootstrap.memory_lock: true

Stability Configuration

Recommended Settings

# elasticsearch.yml

# Discovery settings
discovery.seed_hosts:
  - 192.168.1.10
  - 192.168.1.11
  - 192.168.1.12

# Fault detection (ES 7.x+)
cluster.fault_detection.leader_check.timeout: 30s
cluster.fault_detection.leader_check.interval: 2s
cluster.fault_detection.follower_check.timeout: 30s
cluster.fault_detection.follower_check.interval: 2s

# Transport settings
transport.tcp.connect_timeout: 30s

# Memory lock
bootstrap.memory_lock: true

Monitoring for Disconnections

Set Up Alerts

Monitor for:

  • Node count drops
  • Master elections
  • Shard recovery events
  • Transport layer errors

Key Metrics

GET /_cluster/health
GET /_cat/nodes?v
GET /_cluster/stats
Pulse - Elasticsearch Operations Done Right

Pulse can solve your Elasticsearch issues

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.