Frequent node disconnections destabilize Elasticsearch clusters, causing shard reallocation, increased recovery time, and potential data unavailability. This guide helps you identify and fix the root causes of node disconnections.
Symptoms of Node Disconnection Issues
- Nodes appearing and disappearing from
/_cat/nodes - Frequent master elections
- Shard relocations and recoveries
- Client connection errors
- Log messages: "node disconnected", "failed to ping", "transport error"
Diagnosing Disconnection Causes
Check Node Status
GET /_cat/nodes?v&h=name,ip,heap.percent,cpu,load_1m,node.role,master
Review Cluster Logs
Search for disconnection events:
grep -i "disconnect\|failed to ping\|master.*changed\|transport.*error" /var/log/elasticsearch/*.log
Check Discovery Configuration
GET /_cluster/settings?include_defaults=true&filter_path=*.discovery.*,*.cluster.initial_master_nodes
Common Causes and Solutions
Cause 1: Network Issues
Symptoms:
- Random disconnections across multiple nodes
- Timeout errors in logs
- Packet loss or high latency
Diagnosis:
# Test connectivity between nodes
ping -c 100 <other-node-ip>
# Check for packet loss
mtr --report <other-node-ip>
# Verify port accessibility
nc -zv <other-node-ip> 9300
Solutions:
- Increase transport timeouts:
# elasticsearch.yml
transport.tcp.connect_timeout: 30s
transport.ping_schedule: 5s
- Adjust discovery timeout:
# elasticsearch.yml
discovery.zen.ping_timeout: 10s
discovery.zen.fd.ping_interval: 2s
discovery.zen.fd.ping_timeout: 60s
discovery.zen.fd.ping_retries: 5
- For Elasticsearch 7.x+:
# elasticsearch.yml
cluster.fault_detection.leader_check.timeout: 30s
cluster.fault_detection.leader_check.interval: 2s
cluster.fault_detection.leader_check.retry_count: 5
cluster.fault_detection.follower_check.timeout: 30s
Cause 2: Resource Exhaustion
Symptoms:
- Disconnections correlate with high CPU or memory usage
- GC pauses before disconnections
- Timeouts during high load periods
Diagnosis:
GET /_nodes/stats/jvm,os
Solutions:
- Ensure adequate heap:
# jvm.options.d/heap.options
-Xms16g
-Xmx16g
- Monitor and reduce GC pauses:
GET /_nodes/stats/jvm?filter_path=nodes.*.jvm.gc
- Scale the cluster to handle the load.
Cause 3: Long GC Pauses
Symptoms:
- Log messages about GC taking too long
- Node appears to freeze periodically
[gc][warning]entries in logs
Diagnosis:
grep "gc\[" /var/log/elasticsearch/*.log | grep -E "[0-9]{4,}ms"
Solutions:
- Reduce heap pressure (see JVM heap pressure guides)
- Tune GC settings:
# jvm.options.d/gc.options
-XX:+UseG1GC
-XX:G1HeapRegionSize=32m
-XX:+ParallelRefProcEnabled
Cause 4: Incorrect Discovery Configuration
Symptoms:
- Nodes not finding each other after restart
- Split brain scenarios
- Multiple masters elected
Diagnosis:
# Check current configuration
cat /etc/elasticsearch/elasticsearch.yml | grep -E "discovery|cluster.initial_master"
Solutions:
For Elasticsearch 7.x+:
# elasticsearch.yml
discovery.seed_hosts:
- 192.168.1.10:9300
- 192.168.1.11:9300
- 192.168.1.12:9300
cluster.initial_master_nodes:
- master-node-1
- master-node-2
- master-node-3
Cause 5: File Descriptor Limits
Symptoms:
- "Too many open files" errors
- Connections failing to establish
Diagnosis:
# Check current limits
ulimit -n
# Check Elasticsearch's view
GET /_nodes/stats/process?filter_path=nodes.*.process.open_file_descriptors,nodes.*.process.max_file_descriptors
Solutions:
# /etc/security/limits.conf
elasticsearch - nofile 65535
elasticsearch - nproc 4096
Cause 6: DNS Resolution Issues
Symptoms:
- Intermittent connectivity
- Resolution timeouts
- Inconsistent behavior across restarts
Solutions:
- Use IP addresses instead of hostnames:
# elasticsearch.yml
discovery.seed_hosts:
- 192.168.1.10
- 192.168.1.11
- 192.168.1.12
- Add entries to /etc/hosts:
192.168.1.10 es-node-1
192.168.1.11 es-node-2
192.168.1.12 es-node-3
Cause 7: JVM Crashes
Symptoms:
- Abrupt disconnections without warning
- Core dumps or hs_err files
- OOM killer invocations
Diagnosis:
# Check for OOM killer
grep -i "killed process" /var/log/syslog
dmesg | grep -i "out of memory"
# Look for JVM crash files
ls -la /var/log/elasticsearch/hs_err*
Solutions:
- Ensure heap doesn't exceed 50% of RAM
- Check for system memory pressure
- Disable swap or set
bootstrap.memory_lock: true
Stability Configuration
Recommended Settings
# elasticsearch.yml
# Discovery settings
discovery.seed_hosts:
- 192.168.1.10
- 192.168.1.11
- 192.168.1.12
# Fault detection (ES 7.x+)
cluster.fault_detection.leader_check.timeout: 30s
cluster.fault_detection.leader_check.interval: 2s
cluster.fault_detection.follower_check.timeout: 30s
cluster.fault_detection.follower_check.interval: 2s
# Transport settings
transport.tcp.connect_timeout: 30s
# Memory lock
bootstrap.memory_lock: true
Monitoring for Disconnections
Set Up Alerts
Monitor for:
- Node count drops
- Master elections
- Shard recovery events
- Transport layer errors
Key Metrics
GET /_cluster/health
GET /_cat/nodes?v
GET /_cluster/stats