Elasticsearch Partial Snapshot Failures: Causes and Resolution

What Is a Partial Snapshot

A partial snapshot occurs when Elasticsearch completes the snapshot process but fails to capture one or more shard copies. The global cluster state and most index data get stored successfully, but at least one shard was not snapshotted. Elasticsearch marks the snapshot with state: PARTIAL rather than SUCCESS.

Partial snapshots are still usable for restore operations. Any index whose shards were fully captured can be restored normally. Indices with failed shards will be missing data from those shards, making them incomplete. This is a critical distinction from a fully FAILED snapshot, where the entire operation was aborted.

Identifying Partial Snapshots

Query the snapshot API to inspect a specific snapshot's status:

GET _snapshot/my_repo/my_snapshot

In the response, look for the state field and the shards summary:

{
  "snapshots": [{
    "snapshot": "my_snapshot",
    "state": "PARTIAL",
    "shards": {
      "total": 50,
      "failed": 2,
      "successful": 48
    },
    "failures": [
      {
        "index": "logs-2025.01",
        "shard_id": 3,
        "reason": "primary shard is not allocated",
        "node_id": null,
        "status": "INTERNAL_SERVER_ERROR"
      }
    ]
  }]
}

The failures array contains one entry per failed shard. Each entry includes the index name, shard ID, a reason string, the node ID where the shard was assigned (or null if unassigned), and an HTTP-style status code. When diagnosing partial snapshots, always start here - the reason field usually points directly to the root cause.

You can also use GET _snapshot/_status to check currently running snapshots for in-flight shard failures before they complete.

Common Causes of Partial Snapshots

Unassigned primary shards. The most frequent cause. If a primary shard has no allocated copy on any node at snapshot time, Elasticsearch cannot read its data. This happens when cluster health is yellow or red. Check GET _cluster/health and GET _cat/shards?v&h=index,shard,prirep,state,unassigned.reason to find unassigned shards.

Node departure during snapshot. If a data node leaves the cluster while a snapshot is running, any shards that were being read from that node will fail. The snapshot continues for remaining shards rather than aborting entirely. Node departures can result from network partitions, JVM crashes, or deliberate rolling restarts timed poorly against snapshot schedules.

Shard relocation during snapshot. Elasticsearch has a SnapshotInProgressAllocationDecider that blocks shard moves while a snapshot is active. However, if a relocation was already in progress when the snapshot started, the shard may be in a transitional state. In older versions (pre-7.x), this protection was less strict and relocations could interfere with active snapshots more easily.

Index closed during snapshot. If an index is closed after the snapshot begins but before its shards are captured, those shards will fail. ILM policies that include a close action can trigger this if the timing overlaps with a snapshot window.

Force-merge running concurrently. A long-running force-merge operation on a shard can conflict with snapshot reads. While not guaranteed to cause a failure, force-merge changes the underlying segment files. Running force-merge operations during snapshot windows increases the risk of shard-level failures.

Concurrent Snapshot Conflicts and ILM/SLM Interactions

Elasticsearch does not allow two snapshots to the same repository to run at the same time. If a second snapshot is triggered while the first is still in progress, Elasticsearch throws a snapshot_in_progress_exception. This applies regardless of whether the snapshots were triggered manually, by SLM, or by ILM's wait_for_snapshot action.

{
  "type": "snapshot_in_progress_exception",
  "reason": "[my_repo:my_snapshot_2] a snapshot is already running"
}

This becomes a practical problem when SLM schedules overlap with snapshot duration. If an SLM policy triggers snapshots every 30 minutes but each snapshot takes 45 minutes, every other invocation will fail. SLM logs these failures and increments its failure counter. After enough consecutive failures, the cluster health API reports a warning via the slm indicator.

ILM adds another layer of complexity. The wait_for_snapshot action in ILM's delete phase holds the index lifecycle until a snapshot from a specified SLM policy completes. If that SLM policy is consistently failing due to concurrent snapshot exceptions, ILM will stall. Indices pile up waiting for a snapshot that never succeeds, consuming disk space and creating a cascading problem.

When ILM and SLM both manipulate indices - ILM performing rollovers, shrinks, or closes while SLM takes snapshots - the timing matters. An ILM close or delete action running against an index whose shards are currently being snapshotted can produce partial results. Coordinate SLM snapshot schedules with ILM phase timing to minimize overlap.

Prevention and Retry Strategies

Start with cluster stability. A green cluster health status before a snapshot begins is the single most effective way to prevent partial snapshots. Monitor GET _cluster/health and gate snapshot operations on status: green when possible. For clusters that are frequently yellow due to intentional single-replica configurations, accept that partial snapshots remain a risk for any index with an unassigned replica if the primary also fails.

Avoid scheduling force-merge operations during snapshot windows. If you use ILM's force-merge action, check that the warm phase force-merge timing does not overlap with your SLM snapshot schedule. Both operate on cron schedules, so a timing audit across policies is worth the effort.

Space SLM schedules conservatively. Measure how long your snapshots actually take under peak load using the duration_in_millis field from GET _snapshot/my_repo/_all, then set your SLM schedule interval to at least 1.5x that duration. This avoids concurrent snapshot exceptions.

For retrying after a partial snapshot, you can take a new snapshot that includes the same indices. Elasticsearch uses incremental snapshots, so successfully captured shards from the partial snapshot are reused. Fix the underlying issue first - reassign the missing shards, wait for a relocating node to stabilize, or reopen the closed index - then trigger a new snapshot. The incremental nature means the retry is fast for everything except the previously failed shards.

To monitor ongoing snapshot health, set up watches or alerts on the GET _snapshot/my_repo/_current endpoint and check for shards.failed > 0 on completed snapshots. Catching partial snapshots early prevents situations where your only backup is incomplete.