Elasticsearch Nodes Keep Disconnecting Troubleshooting

Frequent node disconnections destabilize Elasticsearch clusters, causing shard reallocation, increased recovery time, and potential data unavailability. This guide helps you identify and fix the root causes of node disconnections.

Symptoms of Node Disconnection Issues

Nodes appearing and disappearing from /_cat/nodes
Frequent master elections
Shard relocations and recoveries
Client connection errors
Log messages: "node disconnected", "failed to ping", "transport error"

Diagnosing Disconnection Causes

Check Node Status

GET /_cat/nodes?v&h=name,ip,heap.percent,cpu,load_1m,node.role,master

Review Cluster Logs

Search for disconnection events:

grep -i "disconnect\|failed to ping\|master.*changed\|transport.*error" /var/log/elasticsearch/*.log

Check Discovery Configuration

GET /_cluster/settings?include_defaults=true&filter_path=*.discovery.*,*.cluster.initial_master_nodes

Common Causes and Solutions

Cause 1: Network Issues

Symptoms:

Random disconnections across multiple nodes
Timeout errors in logs
Packet loss or high latency

Diagnosis:

# Test connectivity between nodes
ping -c 100 <other-node-ip>

# Check for packet loss
mtr --report <other-node-ip>

# Verify port accessibility
nc -zv <other-node-ip> 9300

Solutions:

Increase transport timeouts:

# elasticsearch.yml
transport.tcp.connect_timeout: 30s
transport.ping_schedule: 5s

Adjust discovery timeout:

# elasticsearch.yml
discovery.zen.ping_timeout: 10s
discovery.zen.fd.ping_interval: 2s
discovery.zen.fd.ping_timeout: 60s
discovery.zen.fd.ping_retries: 5

For Elasticsearch 7.x+:

# elasticsearch.yml
cluster.fault_detection.leader_check.timeout: 30s
cluster.fault_detection.leader_check.interval: 2s
cluster.fault_detection.leader_check.retry_count: 5
cluster.fault_detection.follower_check.timeout: 30s

Cause 2: Resource Exhaustion

Symptoms:

Disconnections correlate with high CPU or memory usage
GC pauses before disconnections
Timeouts during high load periods

Diagnosis:

GET /_nodes/stats/jvm,os

Solutions:

Ensure adequate heap:

# jvm.options.d/heap.options
-Xms16g
-Xmx16g

Monitor and reduce GC pauses:

GET /_nodes/stats/jvm?filter_path=nodes.*.jvm.gc

Scale the cluster to handle the load.

Cause 3: Long GC Pauses

Symptoms:

Log messages about GC taking too long
Node appears to freeze periodically
[gc][warning] entries in logs

Diagnosis:

grep "gc\[" /var/log/elasticsearch/*.log | grep -E "[0-9]{4,}ms"

Solutions:

Reduce heap pressure (see JVM heap pressure guides)
Tune GC settings:

# jvm.options.d/gc.options
-XX:+UseG1GC
-XX:G1HeapRegionSize=32m
-XX:+ParallelRefProcEnabled

Cause 4: Incorrect Discovery Configuration

Symptoms:

Nodes not finding each other after restart
Split brain scenarios
Multiple masters elected

Diagnosis:

# Check current configuration
cat /etc/elasticsearch/elasticsearch.yml | grep -E "discovery|cluster.initial_master"

Solutions:

For Elasticsearch 7.x+:

# elasticsearch.yml
discovery.seed_hosts:
  - 192.168.1.10:9300
  - 192.168.1.11:9300
  - 192.168.1.12:9300
cluster.initial_master_nodes:
  - master-node-1
  - master-node-2
  - master-node-3

Cause 5: File Descriptor Limits

Symptoms:

"Too many open files" errors
Connections failing to establish

Diagnosis:

# Check current limits
ulimit -n

# Check Elasticsearch's view
GET /_nodes/stats/process?filter_path=nodes.*.process.open_file_descriptors,nodes.*.process.max_file_descriptors

Solutions:

# /etc/security/limits.conf
elasticsearch - nofile  65535
elasticsearch - nproc   4096

Cause 6: DNS Resolution Issues

Symptoms:

Intermittent connectivity
Resolution timeouts
Inconsistent behavior across restarts

Solutions:

Use IP addresses instead of hostnames:

# elasticsearch.yml
discovery.seed_hosts:
  - 192.168.1.10
  - 192.168.1.11
  - 192.168.1.12

Add entries to /etc/hosts:

192.168.1.10 es-node-1
192.168.1.11 es-node-2
192.168.1.12 es-node-3

Cause 7: JVM Crashes

Symptoms:

Abrupt disconnections without warning
Core dumps or hs_err files
OOM killer invocations

Diagnosis:

# Check for OOM killer
grep -i "killed process" /var/log/syslog
dmesg | grep -i "out of memory"

# Look for JVM crash files
ls -la /var/log/elasticsearch/hs_err*

Solutions:

Ensure heap doesn't exceed 50% of RAM
Check for system memory pressure
Disable swap or set bootstrap.memory_lock: true

Stability Configuration

Recommended Settings

# elasticsearch.yml

# Discovery settings
discovery.seed_hosts:
  - 192.168.1.10
  - 192.168.1.11
  - 192.168.1.12

# Fault detection (ES 7.x+)
cluster.fault_detection.leader_check.timeout: 30s
cluster.fault_detection.leader_check.interval: 2s
cluster.fault_detection.follower_check.timeout: 30s
cluster.fault_detection.follower_check.interval: 2s

# Transport settings
transport.tcp.connect_timeout: 30s

# Memory lock
bootstrap.memory_lock: true

Monitoring for Disconnections

Set Up Alerts

Monitor for:

Node count drops
Master elections
Shard recovery events
Transport layer errors

Key Metrics

GET /_cluster/health
GET /_cat/nodes?v
GET /_cluster/stats

Elasticsearch Nodes Keep Disconnecting Troubleshooting

Symptoms of Node Disconnection Issues

Diagnosing Disconnection Causes

Check Node Status

Review Cluster Logs

Check Discovery Configuration

Common Causes and Solutions

Cause 1: Network Issues

Cause 2: Resource Exhaustion

Cause 3: Long GC Pauses

Cause 4: Incorrect Discovery Configuration

Cause 5: File Descriptor Limits

Cause 6: DNS Resolution Issues

Cause 7: JVM Crashes

Stability Configuration

Recommended Settings

Monitoring for Disconnections

Set Up Alerts

Key Metrics

Related Topics