A yellow cluster health status in Elasticsearch means every primary shard is allocated and serving traffic, but at least one replica shard is unassigned. The cluster is fully operational - reads and writes succeed - but it has no redundancy for the affected indices. If the node hosting an unreplicated primary fails, those shards become red and writes to that index stop until the primary is recovered. Yellow is the cluster's polite way of saying "fix this before something else breaks". Common causes are too few data nodes for the configured replicas, disk-watermark blocks on the only candidate nodes, allocation filters that exclude every possible target, and cluster.routing.allocation.enable set to a value other than all.
What Yellow Status Means
The cluster health API returns status: yellow when:
- All primary shards are assigned and active.
- One or more replica shards (sometimes whole indices) are unassigned.
- No primary is missing.
If a primary is unassigned, status is red, not yellow. The distinction matters: yellow means "no data loss yet, but you have lost redundancy"; red means "some data is not currently readable".
GET /_cluster/health
Response includes active_shards_percent_as_number, unassigned_shards, and initializing_shards. If unassigned_shards > 0 and status: "yellow", you have unassigned replicas.
Common Causes
- Not enough data nodes for the replica count. An index with
number_of_replicas: 1needs at least 2 data nodes; replicas of2need 3 nodes. Single-node dev clusters are permanently yellow for this reason. - Disk watermarks blocking allocation. When data nodes cross
disk.watermark.low(85%) or `disk.watermark.high` (90%), Elasticsearch refuses to allocate new shards on them. cluster.routing.allocation.enableset tononeorprimaries. Replicas cannot allocate until the setting is restored toall.- Allocation filters or awareness rules excluding all candidates.
index.routing.allocation.exclude.*, awareness attributes, orcluster.routing.allocation.same_shard.host: truecan leave a shard with no legal home. - Stuck recovery on a returning node. A node that left and rejoined may still be replaying translog or rebuilding shards.
- A failed node not yet replaced. A primary stays allocated, but its replica's previous home is gone.
How to Diagnose
Step 1 - get the count of unassigned shards by index and reason:
GET /_cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state
The unassigned.reason column gives an immediate hint: NODE_LEFT, REPLICA_ADDED, ALLOCATION_FAILED, INDEX_CREATED, and so on.
Step 2 - get the authoritative explanation for one specific shard:
GET /_cluster/allocation/explain
{
"index": "my-index-2026-05-13",
"shard": 0,
"primary": false
}
The response includes a node_allocation_decisions array. Each candidate node returns a verdict (YES/NO/THROTTLE) per decider, with the reason. This is the most efficient way to find out exactly why a replica is not being placed.
Step 3 - check node disk usage if disk is suspected:
GET /_cat/allocation?v
The disk.percent column tells you which nodes are above watermarks.
How to Fix Unassigned Replicas
The fix depends on the cause from allocation/explain:
- "Not enough data nodes": add nodes, or temporarily lower replicas on the affected index:
PUT /<index>/_settings
{ "index.number_of_replicas": 0 }
Set this to 0 only if you accept temporary loss of redundancy; revert once nodes are added.
Disk watermark exceeded: free disk on the over-watermark nodes (delete or roll over old indices), add nodes, or temporarily raise the low/high watermark. See disk.watermark.high for the details.
cluster.routing.allocation.enablenotall: revert it:
PUT /_cluster/settings
{ "persistent": { "cluster.routing.allocation.enable": null } }
Allocation filter excludes all candidates: review
index.routing.allocation.*andcluster.routing.allocation.awareness.*settings. Loosen filters or fix awareness attribute values on nodes.Stuck recovery: check
GET /_cat/recovery?v&active_only=true. If a recovery has been running for a long time, increase concurrency via `cluster.routing.allocation.node_concurrent_recoveries` or raise the per-recovery bandwidth limit.Failed node not replaced: restore the node or rely on the cluster auto-rebuilding the replica elsewhere. If the node returns within
index.unassigned.node_left.delayed_timeout(default 1 minute), Elasticsearch waits before rebuilding.
Preventive Measures
- Run a quorum-sized cluster. Three master-eligible nodes minimum, and at least
replicas + 1data nodes for every index. - Alert on yellow that persists beyond a known maintenance window. Yellow during a rolling restart is expected; yellow lasting an hour is a problem.
- Set disk-usage SLOs lower than the watermarks. Page at 75% so you have time to act before 85%.
- Forbid
cluster.routing.allocation.enable: noneoutside maintenance. Build the revert step into the runbook.
Resolve Yellow Cluster Status Automatically with Pulse
Pulse is an AI DBA for Elasticsearch and OpenSearch. When a cluster goes yellow and replica shards stay unassigned, Pulse:
- Continuously tracks
_cluster/healthstatus,unassigned_shards, andunassigned.reasonacross every index - Correlates the unassigned replicas with node disk telemetry (
_cat/allocation), recentcluster.routing.allocation.enablechanges, allocation filter and awareness settings, ILM state, andindex.unassigned.node_left.delayed_timeoutcountdowns - Identifies which of the six causes above applies by running
_cluster/allocation/explainon the affected shard and reading theYES/NO/THROTTLEdecider verdicts - Recommends the precise fix - revert
cluster.routing.allocation.enabletonull, free disk on the over-watermark node, add data nodes, lowernumber_of_replicas, or relax an allocation filter - Applies low-risk fixes automatically with your approval (for example, clearing a stale
allocation.enable: noneleft over from an upgrade), or generates a one-click config PR
Pulse turns the manual triage above into an agentic SRE workflow. Start a free trial.
Frequently Asked Questions
Q: What is the fastest way to diagnose unassigned replica shards in production?
A: Run GET /_cluster/allocation/explain against the affected shard - the node_allocation_decisions array names the exact decider that is blocking allocation. To skip the manual triage across _cat/shards, allocation explain, disk telemetry, and cluster settings history, Pulse acts as an AI DBA for Elasticsearch and OpenSearch that correlates all four signals in real time and surfaces the specific cause - underprovisioned replicas, watermark block, allocation.enable left at a non-default, or a stale filter.
Q: Is yellow cluster status in Elasticsearch a problem?
A: Yellow means all primaries are healthy but at least one replica is unassigned. The cluster works, but it has lost redundancy for affected indices. If the node hosting one of those unreplicated primaries fails, status goes red and writes stop. Treat yellow as urgent, not critical.
Q: How do I find out why a shard is unassigned in Elasticsearch?
A: Run GET /_cluster/allocation/explain with the index, shard, and primary fields. The response lists every candidate node with a YES/NO/THROTTLE verdict and the reason for each, which pinpoints the blocking decider.
Q: Can a yellow cluster turn red?
A: Yes. If a primary shard becomes unassigned (because the node holding it fails and there is no replica to promote), status drops from yellow to red and writes to that index stop until the primary is recovered.
Q: Why are my replicas unassigned on a single-node Elasticsearch cluster?
A: A single-node cluster cannot host both a primary and its replica on the same node (that would defeat the purpose of replication). With number_of_replicas >= 1, the replica is unassigned and cluster status is yellow. Either add a second data node or set replicas to 0 for dev clusters.
Q: How do I fix yellow status caused by disk watermarks?
A: Free disk on the over-watermark node (delete old indices, roll over time-series indices) or temporarily raise the watermark thresholds. Long-term, add capacity or implement ILM with a delete phase so indices are not allowed to accumulate indefinitely.
Q: Does yellow status affect search performance?
A: Slightly. With fewer replicas active, search load concentrates on primaries, which can increase tail latency. The bigger risk is loss of redundancy, not performance. Resolve promptly.
Related Reading
- Elasticsearch cluster.routing.allocation.disk.watermark.high
- Elasticsearch cluster.routing.allocation.enable Setting
- Elasticsearch cluster.routing.rebalance.enable Setting
- Elasticsearch cluster.routing.allocation.node_concurrent_recoveries
- Elasticsearch index.number_of_replicas Setting
- Elasticsearch Cluster Health Check
- Having too many Elasticsearch shards