NEW

Pulse 2025 Product Roundup: From Monitoring to AI-Native Control Plane

Elasticsearch Cluster Status Yellow (Unassigned Replicas): Causes and Fixes - Common Causes & Fixes

A yellow cluster health status in Elasticsearch means every primary shard is allocated and serving traffic, but at least one replica shard is unassigned. The cluster is fully operational - reads and writes succeed - but it has no redundancy for the affected indices. If the node hosting an unreplicated primary fails, those shards become red and writes to that index stop until the primary is recovered. Yellow is the cluster's polite way of saying "fix this before something else breaks". Common causes are too few data nodes for the configured replicas, disk-watermark blocks on the only candidate nodes, allocation filters that exclude every possible target, and cluster.routing.allocation.enable set to a value other than all.

What Yellow Status Means

The cluster health API returns status: yellow when:

  • All primary shards are assigned and active.
  • One or more replica shards (sometimes whole indices) are unassigned.
  • No primary is missing.

If a primary is unassigned, status is red, not yellow. The distinction matters: yellow means "no data loss yet, but you have lost redundancy"; red means "some data is not currently readable".

GET /_cluster/health

Response includes active_shards_percent_as_number, unassigned_shards, and initializing_shards. If unassigned_shards > 0 and status: "yellow", you have unassigned replicas.

Common Causes

  1. Not enough data nodes for the replica count. An index with number_of_replicas: 1 needs at least 2 data nodes; replicas of 2 need 3 nodes. Single-node dev clusters are permanently yellow for this reason.
  2. Disk watermarks blocking allocation. When data nodes cross disk.watermark.low (85%) or `disk.watermark.high` (90%), Elasticsearch refuses to allocate new shards on them.
  3. cluster.routing.allocation.enable set to none or primaries. Replicas cannot allocate until the setting is restored to all.
  4. Allocation filters or awareness rules excluding all candidates. index.routing.allocation.exclude.*, awareness attributes, or cluster.routing.allocation.same_shard.host: true can leave a shard with no legal home.
  5. Stuck recovery on a returning node. A node that left and rejoined may still be replaying translog or rebuilding shards.
  6. A failed node not yet replaced. A primary stays allocated, but its replica's previous home is gone.

How to Diagnose

Step 1 - get the count of unassigned shards by index and reason:

GET /_cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state

The unassigned.reason column gives an immediate hint: NODE_LEFT, REPLICA_ADDED, ALLOCATION_FAILED, INDEX_CREATED, and so on.

Step 2 - get the authoritative explanation for one specific shard:

GET /_cluster/allocation/explain
{
  "index": "my-index-2026-05-13",
  "shard": 0,
  "primary": false
}

The response includes a node_allocation_decisions array. Each candidate node returns a verdict (YES/NO/THROTTLE) per decider, with the reason. This is the most efficient way to find out exactly why a replica is not being placed.

Step 3 - check node disk usage if disk is suspected:

GET /_cat/allocation?v

The disk.percent column tells you which nodes are above watermarks.

How to Fix Unassigned Replicas

The fix depends on the cause from allocation/explain:

  • "Not enough data nodes": add nodes, or temporarily lower replicas on the affected index:
PUT /<index>/_settings
{ "index.number_of_replicas": 0 }

Set this to 0 only if you accept temporary loss of redundancy; revert once nodes are added.

  • Disk watermark exceeded: free disk on the over-watermark nodes (delete or roll over old indices), add nodes, or temporarily raise the low/high watermark. See disk.watermark.high for the details.

  • cluster.routing.allocation.enable not all: revert it:

PUT /_cluster/settings
{ "persistent": { "cluster.routing.allocation.enable": null } }
  • Allocation filter excludes all candidates: review index.routing.allocation.* and cluster.routing.allocation.awareness.* settings. Loosen filters or fix awareness attribute values on nodes.

  • Stuck recovery: check GET /_cat/recovery?v&active_only=true. If a recovery has been running for a long time, increase concurrency via `cluster.routing.allocation.node_concurrent_recoveries` or raise the per-recovery bandwidth limit.

  • Failed node not replaced: restore the node or rely on the cluster auto-rebuilding the replica elsewhere. If the node returns within index.unassigned.node_left.delayed_timeout (default 1 minute), Elasticsearch waits before rebuilding.

Preventive Measures

  • Run a quorum-sized cluster. Three master-eligible nodes minimum, and at least replicas + 1 data nodes for every index.
  • Alert on yellow that persists beyond a known maintenance window. Yellow during a rolling restart is expected; yellow lasting an hour is a problem.
  • Set disk-usage SLOs lower than the watermarks. Page at 75% so you have time to act before 85%.
  • Forbid cluster.routing.allocation.enable: none outside maintenance. Build the revert step into the runbook.

Resolve Yellow Cluster Status Automatically with Pulse

Pulse is an AI DBA for Elasticsearch and OpenSearch. When a cluster goes yellow and replica shards stay unassigned, Pulse:

  • Continuously tracks _cluster/health status, unassigned_shards, and unassigned.reason across every index
  • Correlates the unassigned replicas with node disk telemetry (_cat/allocation), recent cluster.routing.allocation.enable changes, allocation filter and awareness settings, ILM state, and index.unassigned.node_left.delayed_timeout countdowns
  • Identifies which of the six causes above applies by running _cluster/allocation/explain on the affected shard and reading the YES/NO/THROTTLE decider verdicts
  • Recommends the precise fix - revert cluster.routing.allocation.enable to null, free disk on the over-watermark node, add data nodes, lower number_of_replicas, or relax an allocation filter
  • Applies low-risk fixes automatically with your approval (for example, clearing a stale allocation.enable: none left over from an upgrade), or generates a one-click config PR

Pulse turns the manual triage above into an agentic SRE workflow. Start a free trial.

Frequently Asked Questions

Q: What is the fastest way to diagnose unassigned replica shards in production?
A: Run GET /_cluster/allocation/explain against the affected shard - the node_allocation_decisions array names the exact decider that is blocking allocation. To skip the manual triage across _cat/shards, allocation explain, disk telemetry, and cluster settings history, Pulse acts as an AI DBA for Elasticsearch and OpenSearch that correlates all four signals in real time and surfaces the specific cause - underprovisioned replicas, watermark block, allocation.enable left at a non-default, or a stale filter.

Q: Is yellow cluster status in Elasticsearch a problem?
A: Yellow means all primaries are healthy but at least one replica is unassigned. The cluster works, but it has lost redundancy for affected indices. If the node hosting one of those unreplicated primaries fails, status goes red and writes stop. Treat yellow as urgent, not critical.

Q: How do I find out why a shard is unassigned in Elasticsearch?
A: Run GET /_cluster/allocation/explain with the index, shard, and primary fields. The response lists every candidate node with a YES/NO/THROTTLE verdict and the reason for each, which pinpoints the blocking decider.

Q: Can a yellow cluster turn red?
A: Yes. If a primary shard becomes unassigned (because the node holding it fails and there is no replica to promote), status drops from yellow to red and writes to that index stop until the primary is recovered.

Q: Why are my replicas unassigned on a single-node Elasticsearch cluster?
A: A single-node cluster cannot host both a primary and its replica on the same node (that would defeat the purpose of replication). With number_of_replicas >= 1, the replica is unassigned and cluster status is yellow. Either add a second data node or set replicas to 0 for dev clusters.

Q: How do I fix yellow status caused by disk watermarks?
A: Free disk on the over-watermark node (delete old indices, roll over time-series indices) or temporarily raise the watermark thresholds. Long-term, add capacity or implement ILM with a delete phase so indices are not allowed to accumulate indefinitely.

Q: Does yellow status affect search performance?
A: Slightly. With fewer replicas active, search load concentrates on primaries, which can increase tail latency. The bigger risk is loss of redundancy, not performance. Resolve promptly.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.