Meet the Pulse team at AWS re:Invent!

Read more

Elasticsearch Rolling Restart Problems

Rolling restarts allow cluster maintenance without downtime, but issues can arise that cause delays or failures. This guide helps troubleshoot common rolling restart problems.

Proper Rolling Restart Procedure

Before Restarting Any Node

  1. Check cluster health:
GET /_cluster/health
GET /_cat/nodes?v
  1. Disable shard allocation:
PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.enable": "primaries"
  }
}
  1. Optional: Sync flush (for faster recovery):
POST /_flush/synced

During Node Restart

  1. Stop the node:
systemctl stop elasticsearch
  1. Perform maintenance (config changes, updates, etc.)

  2. Start the node:

systemctl start elasticsearch
  1. Wait for node to join:
GET /_cat/nodes?v

After Node Rejoins

  1. Re-enable allocation:
PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.enable": "all"
  }
}
  1. Wait for green status:
GET /_cluster/health?wait_for_status=green&timeout=10m
  1. Proceed to next node

Common Problems and Solutions

Problem 1: Node Won't Rejoin Cluster

Symptoms:

  • Node starts but doesn't appear in _cat/nodes
  • "master not discovered" in logs

Diagnosis:

grep -i "master\|discovery\|join" /var/log/elasticsearch/*.log

Solutions:

  1. Verify discovery configuration:
# elasticsearch.yml
discovery.seed_hosts: ["host1:9300", "host2:9300", "host3:9300"]
  1. Check network connectivity:
nc -zv other_node_ip 9300
  1. Verify cluster name matches:
GET /_cluster/settings?filter_path=*.cluster.name

Problem 2: Slow Recovery After Restart

Symptoms:

  • Cluster stays yellow for extended period
  • Recovery progress very slow

Diagnosis:

GET /_cat/recovery?v&active_only=true
GET /_cluster/health?level=indices

Solutions:

  1. Increase recovery bandwidth:
PUT /_cluster/settings
{
  "transient": {
    "indices.recovery.max_bytes_per_sec": "500mb"
  }
}
  1. Use sync flush before restart (ES 7.x):
POST /_flush/synced
  1. Check for relocation throttling:
GET /_cluster/settings?filter_path=*.indices.recovery*

Problem 3: Shard Allocation Disabled Stuck

Symptoms:

  • Forgot to re-enable allocation
  • Cluster stuck with primaries only

Solution:

PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.enable": "all"
  }
}

Problem 4: Node Crashes During Restart

Symptoms:

  • Node fails to start
  • Errors during startup

Diagnosis:

journalctl -u elasticsearch --since "10 minutes ago"
cat /var/log/elasticsearch/*.log | tail -200

Common causes and solutions:

  1. Configuration error:

    • Validate elasticsearch.yml
    • Check for syntax errors
  2. Permission issues:

chown -R elasticsearch:elasticsearch /var/lib/elasticsearch
chown -R elasticsearch:elasticsearch /var/log/elasticsearch
  1. Heap configuration:
    • Verify heap settings in jvm.options.d/
    • Ensure not exceeding available memory

Problem 5: Split Brain After Restart

Symptoms:

  • Multiple masters elected
  • Data inconsistency

Prevention:

  • Never restart master nodes simultaneously
  • Use dedicated master nodes (minimum 3)

Recovery:

  1. Stop all but one master node
  2. Let single master stabilize
  3. Restart other masters one at a time

Problem 6: Timeout Waiting for Green

Symptoms:

  • Cluster stays yellow despite waiting

Diagnosis:

GET /_cluster/allocation/explain

Solutions:

  1. Check for allocation issues:
GET /_cat/shards?v&h=index,shard,prirep,state,unassigned.reason
  1. Force allocation if stuck:
POST /_cluster/reroute?retry_failed=true
  1. Check disk watermarks:
GET /_cat/allocation?v

Automation and Scripting

Automated Rolling Restart Script

#!/bin/bash

ES_HOST="localhost:9200"
NODES=$(curl -s "$ES_HOST/_cat/nodes?h=name" | sort)

for NODE in $NODES; do
    echo "Processing node: $NODE"

    # Disable allocation
    curl -X PUT "$ES_HOST/_cluster/settings" -H 'Content-Type: application/json' -d'
    {
        "transient": {
            "cluster.routing.allocation.enable": "primaries"
        }
    }'

    # Sync flush
    curl -X POST "$ES_HOST/_flush/synced"

    # Stop node (adjust command as needed)
    ssh $NODE "systemctl stop elasticsearch"

    # Wait for node to leave
    while curl -s "$ES_HOST/_cat/nodes" | grep -q $NODE; do
        sleep 5
    done

    # Perform maintenance on node
    # ssh $NODE "..."

    # Start node
    ssh $NODE "systemctl start elasticsearch"

    # Wait for node to rejoin
    until curl -s "$ES_HOST/_cat/nodes" | grep -q $NODE; do
        sleep 5
    done

    # Re-enable allocation
    curl -X PUT "$ES_HOST/_cluster/settings" -H 'Content-Type: application/json' -d'
    {
        "transient": {
            "cluster.routing.allocation.enable": "all"
        }
    }'

    # Wait for green
    curl -s "$ES_HOST/_cluster/health?wait_for_status=green&timeout=30m"

    echo "Node $NODE completed"
done

Best Practices

Order of Restart

  1. Data-only nodes first
  2. Coordinating nodes
  3. Ingest nodes
  4. Master-eligible nodes last (one at a time, current master last)

Timing Considerations

  • Schedule during low-traffic periods
  • Allow sufficient time between nodes
  • Monitor throughout the process

Monitoring During Restart

# Watch cluster health
watch -n 5 'curl -s localhost:9200/_cluster/health?pretty'

# Watch node status
watch -n 5 'curl -s localhost:9200/_cat/nodes?v'

# Watch recovery progress
watch -n 5 'curl -s localhost:9200/_cat/recovery?active_only=true&v'

Checklist

  • Snapshot taken before restart
  • Cluster health green before starting
  • Allocation disabled before stopping node
  • Sync flush performed (if applicable)
  • Node rejoined cluster after restart
  • Allocation re-enabled
  • Cluster returned to green before next node
  • All nodes processed
Pulse - Elasticsearch Operations Done Right

Pulse can solve your Elasticsearch issues

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.