Elasticsearch Rolling Restart Problems

Rolling restarts allow cluster maintenance without downtime, but issues can arise that cause delays or failures. This guide helps troubleshoot common rolling restart problems.

Proper Rolling Restart Procedure

Before Restarting Any Node

Check cluster health:

GET /_cluster/health
GET /_cat/nodes?v

Disable shard allocation:

PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.enable": "primaries"
  }
}

Optional: Sync flush (for faster recovery):

POST /_flush/synced

During Node Restart

Stop the node:

systemctl stop elasticsearch

Perform maintenance (config changes, updates, etc.)
Start the node:

systemctl start elasticsearch

Wait for node to join:

GET /_cat/nodes?v

After Node Rejoins

Re-enable allocation:

PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.enable": "all"
  }
}

Wait for green status:

GET /_cluster/health?wait_for_status=green&timeout=10m

Proceed to next node

Common Problems and Solutions

Problem 1: Node Won't Rejoin Cluster

Symptoms:

Node starts but doesn't appear in _cat/nodes
"master not discovered" in logs

Diagnosis:

grep -i "master\|discovery\|join" /var/log/elasticsearch/*.log

Solutions:

Verify discovery configuration:

# elasticsearch.yml
discovery.seed_hosts: ["host1:9300", "host2:9300", "host3:9300"]

Check network connectivity:

nc -zv other_node_ip 9300

Verify cluster name matches:

GET /_cluster/settings?filter_path=*.cluster.name

Problem 2: Slow Recovery After Restart

Symptoms:

Cluster stays yellow for extended period
Recovery progress very slow

Diagnosis:

GET /_cat/recovery?v&active_only=true
GET /_cluster/health?level=indices

Solutions:

Increase recovery bandwidth:

PUT /_cluster/settings
{
  "transient": {
    "indices.recovery.max_bytes_per_sec": "500mb"
  }
}

Use sync flush before restart (ES 7.x):

POST /_flush/synced

Check for relocation throttling:

GET /_cluster/settings?filter_path=*.indices.recovery*

Problem 3: Shard Allocation Disabled Stuck

Symptoms:

Forgot to re-enable allocation
Cluster stuck with primaries only

Solution:

PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.enable": "all"
  }
}

Problem 4: Node Crashes During Restart

Symptoms:

Node fails to start
Errors during startup

Diagnosis:

journalctl -u elasticsearch --since "10 minutes ago"
cat /var/log/elasticsearch/*.log | tail -200

Common causes and solutions:

Configuration error:
- Validate elasticsearch.yml
- Check for syntax errors
Permission issues:

chown -R elasticsearch:elasticsearch /var/lib/elasticsearch
chown -R elasticsearch:elasticsearch /var/log/elasticsearch

Heap configuration:
- Verify heap settings in jvm.options.d/
- Ensure not exceeding available memory

Problem 5: Split Brain After Restart

Symptoms:

Multiple masters elected
Data inconsistency

Prevention:

Never restart master nodes simultaneously
Use dedicated master nodes (minimum 3)

Recovery:

Stop all but one master node
Let single master stabilize
Restart other masters one at a time

Problem 6: Timeout Waiting for Green

Symptoms:

Cluster stays yellow despite waiting

Diagnosis:

GET /_cluster/allocation/explain

Solutions:

Check for allocation issues:

GET /_cat/shards?v&h=index,shard,prirep,state,unassigned.reason

Force allocation if stuck:

POST /_cluster/reroute?retry_failed=true

Check disk watermarks:

GET /_cat/allocation?v

Automation and Scripting

Automated Rolling Restart Script

#!/bin/bash

ES_HOST="localhost:9200"
NODES=$(curl -s "$ES_HOST/_cat/nodes?h=name" | sort)

for NODE in $NODES; do
    echo "Processing node: $NODE"

    # Disable allocation
    curl -X PUT "$ES_HOST/_cluster/settings" -H 'Content-Type: application/json' -d'
    {
        "transient": {
            "cluster.routing.allocation.enable": "primaries"
        }
    }'

    # Sync flush
    curl -X POST "$ES_HOST/_flush/synced"

    # Stop node (adjust command as needed)
    ssh $NODE "systemctl stop elasticsearch"

    # Wait for node to leave
    while curl -s "$ES_HOST/_cat/nodes" | grep -q $NODE; do
        sleep 5
    done

    # Perform maintenance on node
    # ssh $NODE "..."

    # Start node
    ssh $NODE "systemctl start elasticsearch"

    # Wait for node to rejoin
    until curl -s "$ES_HOST/_cat/nodes" | grep -q $NODE; do
        sleep 5
    done

    # Re-enable allocation
    curl -X PUT "$ES_HOST/_cluster/settings" -H 'Content-Type: application/json' -d'
    {
        "transient": {
            "cluster.routing.allocation.enable": "all"
        }
    }'

    # Wait for green
    curl -s "$ES_HOST/_cluster/health?wait_for_status=green&timeout=30m"

    echo "Node $NODE completed"
done

Best Practices

Order of Restart

Data-only nodes first
Coordinating nodes
Ingest nodes
Master-eligible nodes last (one at a time, current master last)

Timing Considerations

Schedule during low-traffic periods
Allow sufficient time between nodes
Monitor throughout the process

Monitoring During Restart

# Watch cluster health
watch -n 5 'curl -s localhost:9200/_cluster/health?pretty'

# Watch node status
watch -n 5 'curl -s localhost:9200/_cat/nodes?v'

# Watch recovery progress
watch -n 5 'curl -s localhost:9200/_cat/recovery?active_only=true&v'

Checklist

Snapshot taken before restart
Cluster health green before starting
Allocation disabled before stopping node
Sync flush performed (if applicable)
Node rejoined cluster after restart
Allocation re-enabled
Cluster returned to green before next node
All nodes processed

Elasticsearch Rolling Restart Problems

Proper Rolling Restart Procedure

Before Restarting Any Node

During Node Restart

After Node Rejoins

Common Problems and Solutions

Problem 1: Node Won't Rejoin Cluster

Problem 2: Slow Recovery After Restart

Problem 3: Shard Allocation Disabled Stuck

Problem 4: Node Crashes During Restart

Problem 5: Split Brain After Restart

Problem 6: Timeout Waiting for Green

Automation and Scripting

Automated Rolling Restart Script

Best Practices

Order of Restart

Timing Considerations

Monitoring During Restart

Checklist

Related Topics