Elasticsearch ILM Policy Stuck in Error Step

When Elasticsearch executes an Index Lifecycle Management (ILM) policy and a step fails, the index transitions to a special ERROR step. ILM halts all further progression for that index - no phase transitions, no rollovers, no deletions. The index stays frozen in place until someone intervenes or the underlying issue resolves itself through automatic retries.

Left unattended, stuck ILM policies lead to indices that never roll over (growing unbounded), data that never migrates to warm or cold tiers, and indices that never get deleted. Disk pressure builds, shard counts creep up, and cluster performance degrades. Catching and resolving these errors quickly matters. Reviewing how each index gets created via your index templates and which ILM policy is attached is the right starting point.

Diagnosing a Stuck ILM Policy

The primary diagnostic tool is the ILM Explain API. Run it against any index you suspect is stuck:

GET /my-index-000001/_ilm/explain

When an index is in the ERROR step, the response looks like this:

{
  "indices": {
    "my-index-000001": {
      "index": "my-index-000001",
      "managed": true,
      "policy": "my_policy",
      "phase": "hot",
      "action": "rollover",
      "step": "ERROR",
      "failed_step": "check-rollover-ready",
      "is_auto_retryable_error": true,
      "failed_step_retry_count": 3,
      "step_info": {
        "type": "illegal_argument_exception",
        "reason": "index.lifecycle.rollover_alias [my-alias] does not point to index [my-index-000001]"
      },
      "phase_time_millis": 1698765432000,
      "action_time_millis": 1698765432000,
      "step_time_millis": 1698765499000
    }
  }
}

The fields that matter most: failed_step tells you exactly which step broke, step_info contains the exception type and reason, and is_auto_retryable_error indicates whether Elasticsearch will keep retrying on its own. The failed_step_retry_count shows how many automatic retries have already been attempted.

To scan all managed indices for errors at once, use a wildcard: GET /*/_ilm/explain?only_errors=true. This filters the response to only show indices currently sitting in the ERROR step.

Common Root Causes by Phase

Different ILM phases fail for different reasons. Knowing the typical failure modes per phase saves diagnostic time.

Hot phase - rollover failures. The most frequent culprit is a misconfigured or missing rollover alias. The index name must match the pattern ^.*-\d+$ (ending in a numeric suffix like -000001), and the alias must point to the index with is_write_index: true. If another index already claims the write alias, or if the alias was manually deleted, rollover fails with an illegal_argument_exception. Disk watermark breaches and read-only blocks (FORBIDDEN/5/index read-only) also stall rollover.

Warm phase - shrink and allocation failures. The shrink action requires all primary shards to relocate to a single node before the operation can proceed. If no node has enough disk space to hold the entire index, or if allocation filtering rules (like index.routing.allocation.require) prevent shard movement, the SetSingleNodeAllocateStep fails. Attempting to shrink to more shards than the source also triggers an error. These failures often appear intermittent in clusters with fluctuating disk usage.

Warm phase - forcemerge timeouts. Forcemerge on large shards can exceed internal timeouts, especially when the node is under heavy I/O load. The operation may also conflict with ongoing snapshot operations. When forcemerge stalls, ILM moves to the ERROR step, but the merge itself might still be running in the background on the data node.

Delete phase - snapshot references. If a wait_for_snapshot action is configured, ILM waits for a specific SLM policy to complete a snapshot before allowing deletion. When the SLM policy is misconfigured, paused, or the snapshot fails, the delete phase hangs indefinitely. Separately, attempting to delete an index while a snapshot of that index is in progress throws a snapshot_in_progress_exception.

Fixing a Stuck Policy

Once you have identified and resolved the root cause, you have two options to resume ILM execution.

The simplest approach is the Retry API, which re-executes the failed step:

POST /my-index-000001/_ilm/retry

This resets the index back to the failed_step and runs it again. If the underlying problem has been fixed - the alias is corrected, disk space is freed, the snapshot completed - the step succeeds and ILM resumes normal progression.

For cases where you need to skip a step entirely or jump to a different phase, use the Move to Step API:

POST /_ilm/move/my-index-000001
{
  "current_step": {
    "phase": "hot",
    "action": "rollover",
    "name": "ERROR"
  },
  "next_step": {
    "phase": "warm"
  }
}

The current_step must match the index's actual current position (use the explain API to confirm). The next_step can specify just a phase to jump to its first action, or include action and name for precise control. Use this with caution - skipping steps like rollover means the current index will continue receiving writes without a new index being created.

Elasticsearch also retries failed steps automatically based on the indices.lifecycle.poll_interval setting, which defaults to 10 minutes. For transient errors flagged with is_auto_retryable_error: true, waiting may be sufficient.

Bulk Recovery for Multiple Stuck Indices

When many indices are stuck on the same error, retrying them one at a time is tedious. Use a wildcard pattern with the retry API:

POST /my-index-*/_ilm/retry

For more targeted recovery, combine the explain API output with scripting. Pull all errored indices, filter by failed_step or step_info.type, then issue retry calls programmatically. This is particularly useful after cluster-wide events like disk watermark breaches or network partitions that put dozens of indices into ERROR simultaneously.

If the root cause is a policy misconfiguration rather than a transient cluster issue, update the policy first with PUT _ilm/policy/my_policy. New policy versions take effect at the next phase transition for already-managed indices, but indices stuck in ERROR still need an explicit retry after the policy update.

Prevention Strategies

Proactive monitoring is the best defense. Set up a watch or alerting rule that queries GET /*/_ilm/explain?only_errors=true on a schedule and fires when the response contains any results. Catching errors within minutes instead of days prevents cascading failures.

Before deploying a new ILM policy, validate it against your actual index structure. Confirm that rollover aliases exist and are correctly assigned, that target node attributes for allocation match real nodes, and that shrink target shard counts divide evenly into the source. Test policies in a staging environment with representative data volumes. Keep the indices.lifecycle.poll_interval at its default unless you have a specific reason to change it - lowering it adds overhead, and raising it delays error detection.

For the delete phase, pair wait_for_snapshot with a reliable SLM policy and monitor SLM execution separately. If snapshots fail silently, the delete phase will never proceed. Grant the user or role that last modified the ILM policy the manage_ilm cluster privilege and manage index privilege on all targeted indices - ILM executes actions under that user's permissions, and missing privileges are a subtle cause of failures that only surface at runtime.