Elasticsearch Master Not Discovered Diagnosis

The "master not discovered" error occurs when Elasticsearch nodes cannot elect or connect to a master node. This prevents cluster formation and causes nodes to be unable to join. This guide helps diagnose and resolve master discovery issues.

Understanding Master Discovery

How Master Election Works

Nodes use seed hosts to find other nodes
Master-eligible nodes vote to elect a master
A quorum (majority) must agree on the master
Once elected, the master coordinates the cluster

Common Error Messages

MasterNotDiscoveredException: master node is not discovered yet...
master not discovered or elected yet, an election requires...
master not discovered yet, this node has not previously joined...

Diagnostic Steps

Step 1: Check Node Status

GET /_cat/nodes?v&h=name,ip,node.role,master

Step 2: Check Master Status

GET /_cat/master?v

Step 3: Review Discovery Configuration

cat /etc/elasticsearch/elasticsearch.yml | grep -E "discovery|cluster.initial_master"

Step 4: Check Logs

grep -i "master\|discovery\|election" /var/log/elasticsearch/*.log | tail -50

Common Causes and Solutions

Cause 1: Incorrect Discovery Configuration

Problem: Nodes can't find each other during bootstrap

Diagnosis:

# Check elasticsearch.yml
# Are seed hosts correct?
# Is cluster.initial_master_nodes set?

Solution for Elasticsearch 7.x+:

# elasticsearch.yml
discovery.seed_hosts:
  - 192.168.1.10:9300
  - 192.168.1.11:9300
  - 192.168.1.12:9300

# Only needed for initial cluster bootstrap
cluster.initial_master_nodes:
  - node-1
  - node-2
  - node-3

Important: Remove cluster.initial_master_nodes after initial bootstrap to prevent split-brain during restarts.

Cause 2: Network Connectivity Issues

Problem: Nodes can't communicate on transport port

Diagnosis:

# From each node, test connectivity
nc -zv <other_node_ip> 9300
ping <other_node_ip>

Solutions:

Open port 9300-9400 in firewall
Check security groups (cloud environments)
Verify network routing

Cause 3: Insufficient Master-Eligible Nodes

Problem: Not enough nodes for quorum

Quorum calculation: (master_eligible_nodes / 2) + 1

For 3 master-eligible nodes, need 2 for quorum.

Diagnosis:

GET /_cat/nodes?v&h=name,node.role
# Look for 'm' in node.role

Solution: Ensure enough master-eligible nodes are running:

# On master-eligible nodes
node.roles: [master, data]
# Or just master for dedicated masters
node.roles: [master]

Cause 4: Split Brain Recovery

Problem: Cluster previously split, nodes have conflicting state

Diagnosis:

# Check cluster UUID in logs
grep "cluster.uuid" /var/log/elasticsearch/*.log

Solution:

Stop all nodes
Clear data on minority nodes if needed:

rm -rf /var/lib/elasticsearch/nodes/0/_state/*

Restart master-eligible nodes first
Then restart data nodes

Cause 5: DNS Resolution Failures

Problem: Hostname resolution is slow or failing

Diagnosis:

nslookup <node_hostname>
time nslookup <node_hostname>

Solution: Use IP addresses:

discovery.seed_hosts:
  - 192.168.1.10
  - 192.168.1.11
  - 192.168.1.12

Or add to /etc/hosts:

192.168.1.10 es-node-1
192.168.1.11 es-node-2
192.168.1.12 es-node-3

Cause 6: Long GC Pauses

Problem: GC pauses cause nodes to be considered dead

Diagnosis:

grep "gc\[" /var/log/elasticsearch/*.log | grep -E "[0-9]{4,}ms"

Solutions:

Reduce heap pressure
Increase discovery timeout:

cluster.fault_detection.leader_check.timeout: 30s

Cause 7: Leftover Cluster State

Problem: Node has state from different cluster

Diagnosis:

# Check for cluster UUID mismatch in logs

Solution:

# Stop Elasticsearch
systemctl stop elasticsearch

# Clear cluster state (WARNING: data loss for this node)
rm -rf /var/lib/elasticsearch/nodes/0/_state

# Restart
systemctl start elasticsearch

Bootstrap a New Cluster

For Initial Setup

# elasticsearch.yml on ALL master-eligible nodes

cluster.name: my-cluster

node.name: node-1  # Unique per node

discovery.seed_hosts:
  - 192.168.1.10
  - 192.168.1.11
  - 192.168.1.12

# Use node names that match node.name
cluster.initial_master_nodes:
  - node-1
  - node-2
  - node-3

Start Sequence

Configure all master-eligible nodes
Start all master-eligible nodes (roughly simultaneously)
Wait for master election
Start data-only nodes
Remove cluster.initial_master_nodes from config

Recovery Procedures

Single Node Won't Join

Check logs for specific errors
Verify network connectivity
Compare configuration with working nodes
Clear state if necessary

Entire Cluster Down

Identify which node was last master:

grep "elected-as-master" /var/log/elasticsearch/*.log

Start that node first
Start other master-eligible nodes
Start data nodes

After Network Partition

// Check for unassigned shards
GET /_cluster/allocation/explain

// Reroute if needed
POST /_cluster/reroute?retry_failed=true

Configuration Best Practices

Production Configuration

# elasticsearch.yml

cluster.name: production-cluster
node.name: ${HOSTNAME}

# Network
network.host: 0.0.0.0
transport.port: 9300

# Discovery
discovery.seed_hosts:
  - master1.internal:9300
  - master2.internal:9300
  - master3.internal:9300

# Fault detection
cluster.fault_detection.leader_check.timeout: 30s
cluster.fault_detection.leader_check.interval: 2s
cluster.fault_detection.follower_check.timeout: 30s

Minimum Master Nodes (ES 6.x and earlier)

For older versions:

discovery.zen.minimum_master_nodes: 2  # For 3-node cluster

This is automatic in 7.x+.

Monitoring

Track Master Elections

# Count master elections in logs
grep "elected-as-master" /var/log/elasticsearch/*.log | wc -l

Alert Conditions

More than 1 master election per hour
Master not elected within 5 minutes of node startup
Node unable to join cluster

Elasticsearch Master Not Discovered Diagnosis

Understanding Master Discovery

How Master Election Works

Common Error Messages

Diagnostic Steps

Step 1: Check Node Status

Step 2: Check Master Status

Step 3: Review Discovery Configuration

Step 4: Check Logs

Common Causes and Solutions

Cause 1: Incorrect Discovery Configuration

Cause 2: Network Connectivity Issues

Cause 3: Insufficient Master-Eligible Nodes

Cause 4: Split Brain Recovery

Cause 5: DNS Resolution Failures

Cause 6: Long GC Pauses

Cause 7: Leftover Cluster State

Bootstrap a New Cluster

For Initial Setup

Start Sequence

Recovery Procedures

Single Node Won't Join

Entire Cluster Down

After Network Partition

Configuration Best Practices

Production Configuration

Minimum Master Nodes (ES 6.x and earlier)

Monitoring

Track Master Elections

Alert Conditions

Related Topics