Elasticsearch SLM Policy Failures: Diagnosis and Resolution

How SLM Works

Snapshot Lifecycle Management automates snapshot creation and cleanup in Elasticsearch. Each SLM policy defines four things: a cron schedule for when snapshots are taken, a name pattern for generated snapshot names (supporting date math like <nightly-{now/d}>), a target repository where snapshots are stored, and optional retention rules.

Retention rules control automatic deletion of old snapshots. Three parameters govern retention: expire_after sets the age at which snapshots become eligible for deletion, min_count defines the minimum number of snapshots to keep regardless of age, and max_count caps the total number of snapshots retained. Retention runs as a separate cluster-level task on its own schedule, independent of the snapshot creation schedule.

SLM policies are managed through the _slm/policy API. The elected master node handles both snapshot creation and retention execution. If a master node failover occurs mid-snapshot, the new master detects the in-progress snapshot and lets it complete rather than starting a new one.

Common Failure Modes

Repository not found or inaccessible. If the target repository is deleted, misconfigured, or its backing storage becomes unreachable (S3 bucket permissions changed, NFS mount dropped, Azure container deleted), every snapshot attempt fails immediately. The error shows up as a repository_missing_exception or an I/O-level error in the failure details. Verify the repository with GET _snapshot/my_repo and test it with POST _snapshot/my_repo/_verify.

Concurrent snapshot exception. Elasticsearch allows only one snapshot per repository at a time. If a previous snapshot is still running when SLM triggers the next one, it fails with concurrent_snapshot_execution_exception. This happens when snapshots take longer than the interval between scheduled runs. It also occurs when manual snapshots overlap with SLM-triggered ones, or when two SLM policies target the same repository on overlapping schedules.

Retention failing to delete old snapshots. Retention can silently fail to delete snapshots when a snapshot is currently in progress (deletion waits for completion), when the repository is unreachable, or when min_count prevents deletion even though expire_after has been exceeded. Retention failures are less visible than creation failures because they do not block new snapshot creation.

Cluster health or resource issues. SLM snapshots that complete but produce a PARTIAL state are recorded as successes by SLM's own tracking, even though some shards failed. This can mask data protection gaps. Disk pressure on the master node or high JVM memory usage can also cause SLM execution to be delayed or skipped entirely.

Diagnosing SLM Failures

The primary diagnostic endpoint is GET _slm/policy/<policy_name>. The response includes execution metadata:

{
  "my_policy": {
    "policy": { ... },
    "last_success": {
      "snapshot_name": "nightly-2025.01.15",
      "time_string": "2025-01-15T02:00:05.123Z"
    },
    "last_failure": {
      "snapshot_name": "nightly-2025.01.16",
      "time_string": "2025-01-16T02:00:01.456Z",
      "details": "{\"type\":\"concurrent_snapshot_execution_exception\",\"reason\":\"[my_repo:manual-backup] a snapshot is already running\"}"
    },
    "next_execution_millis": 1737079200000
  }
}

The last_failure field contains the snapshot name that failed, the timestamp, and a details string with the exception type and reason. Compare last_success and last_failure timestamps - if last_failure is more recent, the policy is currently in a failing state.

For cluster-wide SLM health, use GET _slm/stats:

{
  "retention_runs": 145,
  "retention_failed": 3,
  "retention_timed_out": 0,
  "retention_deletion_time_millis": 58923,
  "total_snapshots_taken": 412,
  "total_snapshots_failed": 18,
  "total_snapshots_deleted": 267,
  "total_snapshot_deletion_failures": 5,
  "policy_stats": [
    {
      "policy": "nightly-backup",
      "snapshots_taken": 350,
      "snapshots_failed": 12,
      "snapshots_deleted": 230,
      "snapshot_deletion_failures": 3
    }
  ]
}

This endpoint gives you aggregate failure counts across all policies and per-policy breakdowns. A high snapshots_failed count relative to snapshots_taken signals a systematic problem. Rising snapshot_deletion_failures means retention is struggling.

Elasticsearch also writes SLM execution history to the .slm-history-* indices. These indices use ILM for their own lifecycle. You can search them directly for historical failure patterns:

GET .slm-history-*/_search
{
  "query": { "match": { "success": false } },
  "sort": [{ "@timestamp": "desc" }],
  "size": 10
}

Retention Configuration Pitfalls

Retention has two schedules that operators frequently confuse. The schedule field inside an SLM policy controls when snapshots are created. The slm.retention_schedule cluster setting controls when the retention cleanup task runs across all policies. These are independent.

PUT _cluster/settings
{
  "persistent": {
    "slm.retention_schedule": "0 30 1 * * ?"
  }
}

By default, slm.retention_schedule runs daily at 1:30 AM. If you need more frequent cleanup - for example, on clusters generating many snapshots per day - adjust this setting. The retention task evaluates all SLM policies in a single pass.

The min_count parameter overrides expire_after. If min_count is set to 5 and only 5 snapshots exist, none will be deleted even if all are older than expire_after. This is by design to prevent a cluster from having zero backups, but it surprises operators who expect expired snapshots to always be removed.

The max_count parameter is only enforced when the retention task runs successfully. If retention consistently fails (due to concurrent snapshots blocking deletion, or repository issues), snapshot count grows unbounded. Monitor repository disk usage separately from SLM retention - do not assume max_count acts as a hard cap under all conditions.

Retention rules apply only to snapshots created by that specific SLM policy. Manual snapshots and snapshots from other policies are invisible to retention. An operator taking manual snapshots to the same repository should understand those will accumulate unless manually deleted.

SLM vs Manual Snapshot Management

SLM eliminates the need for external schedulers (cron jobs, orchestration tools) to manage snapshot creation. It handles naming, scheduling, and cleanup in a single configuration. For most production clusters, SLM is the better choice because it runs within the cluster and has access to internal state about ongoing operations.

Manual snapshot management still has a role. Operators who need pre-upgrade snapshots, snapshots before schema migrations, or snapshots coordinated with external systems may prefer explicit API calls. Manual snapshots also give full control over the partial parameter, which lets you choose whether a snapshot should proceed when some shards are unavailable.

The trade-off: SLM reduces operational burden at the cost of flexibility. If SLM policies fail silently and nobody monitors _slm/stats or the .slm-history-* indices, the cluster may go without valid backups. Pair SLM with alerting on total_snapshots_failed and retention_failed counters to close this gap. The Elasticsearch health API includes an slm indicator that turns yellow after repeated consecutive failures, controlled by the slm.health.failed_snapshot_warn_threshold setting (default: 5).