Rolling restarts allow cluster maintenance without downtime, but issues can arise that cause delays or failures. This guide helps troubleshoot common rolling restart problems.
Proper Rolling Restart Procedure
Before Restarting Any Node
- Check cluster health:
GET /_cluster/health
GET /_cat/nodes?v
- Disable shard allocation:
PUT /_cluster/settings
{
"transient": {
"cluster.routing.allocation.enable": "primaries"
}
}
- Optional: Sync flush (for faster recovery):
POST /_flush/synced
During Node Restart
- Stop the node:
systemctl stop elasticsearch
Perform maintenance (config changes, updates, etc.)
Start the node:
systemctl start elasticsearch
- Wait for node to join:
GET /_cat/nodes?v
After Node Rejoins
- Re-enable allocation:
PUT /_cluster/settings
{
"transient": {
"cluster.routing.allocation.enable": "all"
}
}
- Wait for green status:
GET /_cluster/health?wait_for_status=green&timeout=10m
- Proceed to next node
Common Problems and Solutions
Problem 1: Node Won't Rejoin Cluster
Symptoms:
- Node starts but doesn't appear in
_cat/nodes - "master not discovered" in logs
Diagnosis:
grep -i "master\|discovery\|join" /var/log/elasticsearch/*.log
Solutions:
- Verify discovery configuration:
# elasticsearch.yml
discovery.seed_hosts: ["host1:9300", "host2:9300", "host3:9300"]
- Check network connectivity:
nc -zv other_node_ip 9300
- Verify cluster name matches:
GET /_cluster/settings?filter_path=*.cluster.name
Problem 2: Slow Recovery After Restart
Symptoms:
- Cluster stays yellow for extended period
- Recovery progress very slow
Diagnosis:
GET /_cat/recovery?v&active_only=true
GET /_cluster/health?level=indices
Solutions:
- Increase recovery bandwidth:
PUT /_cluster/settings
{
"transient": {
"indices.recovery.max_bytes_per_sec": "500mb"
}
}
- Use sync flush before restart (ES 7.x):
POST /_flush/synced
- Check for relocation throttling:
GET /_cluster/settings?filter_path=*.indices.recovery*
Problem 3: Shard Allocation Disabled Stuck
Symptoms:
- Forgot to re-enable allocation
- Cluster stuck with primaries only
Solution:
PUT /_cluster/settings
{
"transient": {
"cluster.routing.allocation.enable": "all"
}
}
Problem 4: Node Crashes During Restart
Symptoms:
- Node fails to start
- Errors during startup
Diagnosis:
journalctl -u elasticsearch --since "10 minutes ago"
cat /var/log/elasticsearch/*.log | tail -200
Common causes and solutions:
Configuration error:
- Validate
elasticsearch.yml - Check for syntax errors
- Validate
Permission issues:
chown -R elasticsearch:elasticsearch /var/lib/elasticsearch
chown -R elasticsearch:elasticsearch /var/log/elasticsearch
- Heap configuration:
- Verify heap settings in
jvm.options.d/ - Ensure not exceeding available memory
- Verify heap settings in
Problem 5: Split Brain After Restart
Symptoms:
- Multiple masters elected
- Data inconsistency
Prevention:
- Never restart master nodes simultaneously
- Use dedicated master nodes (minimum 3)
Recovery:
- Stop all but one master node
- Let single master stabilize
- Restart other masters one at a time
Problem 6: Timeout Waiting for Green
Symptoms:
- Cluster stays yellow despite waiting
Diagnosis:
GET /_cluster/allocation/explain
Solutions:
- Check for allocation issues:
GET /_cat/shards?v&h=index,shard,prirep,state,unassigned.reason
- Force allocation if stuck:
POST /_cluster/reroute?retry_failed=true
- Check disk watermarks:
GET /_cat/allocation?v
Automation and Scripting
Automated Rolling Restart Script
#!/bin/bash
ES_HOST="localhost:9200"
NODES=$(curl -s "$ES_HOST/_cat/nodes?h=name" | sort)
for NODE in $NODES; do
echo "Processing node: $NODE"
# Disable allocation
curl -X PUT "$ES_HOST/_cluster/settings" -H 'Content-Type: application/json' -d'
{
"transient": {
"cluster.routing.allocation.enable": "primaries"
}
}'
# Sync flush
curl -X POST "$ES_HOST/_flush/synced"
# Stop node (adjust command as needed)
ssh $NODE "systemctl stop elasticsearch"
# Wait for node to leave
while curl -s "$ES_HOST/_cat/nodes" | grep -q $NODE; do
sleep 5
done
# Perform maintenance on node
# ssh $NODE "..."
# Start node
ssh $NODE "systemctl start elasticsearch"
# Wait for node to rejoin
until curl -s "$ES_HOST/_cat/nodes" | grep -q $NODE; do
sleep 5
done
# Re-enable allocation
curl -X PUT "$ES_HOST/_cluster/settings" -H 'Content-Type: application/json' -d'
{
"transient": {
"cluster.routing.allocation.enable": "all"
}
}'
# Wait for green
curl -s "$ES_HOST/_cluster/health?wait_for_status=green&timeout=30m"
echo "Node $NODE completed"
done
Best Practices
Order of Restart
- Data-only nodes first
- Coordinating nodes
- Ingest nodes
- Master-eligible nodes last (one at a time, current master last)
Timing Considerations
- Schedule during low-traffic periods
- Allow sufficient time between nodes
- Monitor throughout the process
Monitoring During Restart
# Watch cluster health
watch -n 5 'curl -s localhost:9200/_cluster/health?pretty'
# Watch node status
watch -n 5 'curl -s localhost:9200/_cat/nodes?v'
# Watch recovery progress
watch -n 5 'curl -s localhost:9200/_cat/recovery?active_only=true&v'
Checklist
- Snapshot taken before restart
- Cluster health green before starting
- Allocation disabled before stopping node
- Sync flush performed (if applicable)
- Node rejoined cluster after restart
- Allocation re-enabled
- Cluster returned to green before next node
- All nodes processed