Elasticsearch Cross-Cluster Replication Errors

Cross-cluster replication in Elasticsearch works by replaying write operations from a leader index to a follower index on a remote cluster. When it works, it is transparent. When it breaks, the error messages are often indirect - you see a paused follower shard or a RetentionLeaseNotFoundException without clear context on what went wrong or how to recover. This guide covers the CCR failures that show up most often in production and the specific steps to fix each one.

CCR requires a Platinum or Enterprise license on both the leader and follower clusters. It relies on soft deletes to retain operation history on the leader, and it uses retention leases to tell the leader which operations the follower still needs. Most CCR failures trace back to one of these mechanisms breaking down.

Shard History Retention Lease Expired

This is the most common CCR failure in long-running deployments. The follower takes out a retention lease on the leader to preserve the operation history it has not yet replicated. If the follower goes offline or falls far enough behind, that lease expires (default: 12 hours), and the leader merges away the soft-deleted operations. When the follower comes back, it finds a gap in the history it cannot bridge.

The error surfaces as a fatal exception on the follower shard task, typically RetentionLeaseNotFoundException or a message about missing operations in the translog. You can confirm the state with the follower stats API:

GET /my-follower-index/_ccr/stats

Look for fatal_exception in the response. If it references retention leases, the only recovery path is to recreate the follower index from scratch. Pause the follower, close it, then re-establish the follow relationship:

POST /my-follower-index/_ccr/pause_follow
POST /my-follower-index/_close
PUT /my-follower-index/_ccr/follow?wait_for_active_shards=1
{
  "remote_cluster": "leader-cluster",
  "leader_index": "my-leader-index"
}

This triggers a full remote recovery - the follower copies all Lucene segments from the leader. For large indices, this takes time and network bandwidth. To prevent recurrence, increase the retention lease period on the leader if your follower cluster has planned maintenance windows exceeding 12 hours:

PUT /my-leader-index/_settings
{
  "index.soft_deletes.retention_lease.period": "24h"
}

Soft Deletes Not Enabled on Leader

CCR depends on soft deletes to track operation history. Indices created on Elasticsearch 7.0 or later have soft deletes enabled by default. Indices created on 6.x and upgraded to 7.x do not - the setting is fixed at index creation time and cannot be changed afterward.

If you try to follow an index without soft deletes, the follower will fail with: leader index [my-index] does not have soft deletes enabled. The auto-follow coordinator used to silently skip these indices, which made the problem invisible. Later versions surface this as an error in auto-follow stats. Either way, the fix is the same: reindex the data on the leader into a new index (which will have soft deletes enabled by default), then point your follower at the new index.

POST _reindex
{
  "source": { "index": "old-leader-index" },
  "dest": { "index": "new-leader-index" }
}

You cannot retroactively enable soft deletes. This is not a setting you can toggle - it is baked into the index at creation.

Auto-Follow Pattern Errors

Auto-follow patterns match new indices on the leader by name and automatically create corresponding follower indices. They fail silently in several scenarios: the pattern matches an index without soft deletes, the follower cluster lacks the required license, or a naming conflict exists where the follower index name already exists.

Check auto-follow stats to see what the coordinator is actually doing:

GET /_ccr/auto_follow/stats

The response includes recent_auto_follow_errors with the pattern name, leader index, and the exception. A common pitfall is pattern overlap - two auto-follow patterns matching the same leader index. Elasticsearch does not deduplicate these; it tries to create two follower indices for the same leader, and one will fail with an IndexAlreadyExistsException.

Another frequent problem: auto-follow patterns do not apply retroactively. They only match indices created after the pattern was registered. If you create the pattern and expect it to pick up existing indices, nothing happens. You need to manually create follower indices for anything that already exists on the leader.

CCR Paused Due to License

CCR stops working the moment either cluster drops below a Platinum license. The follower shard tasks pause with an ElasticsearchSecurityException stating current license is non-compliant for [ccr]. This happens during license expiration, trial period ending, or if a cluster restart picks up a basic license instead of the platinum one.

Restoring the license does not automatically resume paused followers. After the license is active again, you must explicitly resume each follower:

POST /my-follower-index/_ccr/resume_follow
{
  "max_read_request_operation_count": 5120,
  "max_retry_delay": "500ms"
}

If many follower indices are paused, use the follower info API to list them all and script the resume calls. Check GET /_ccr/stats for any follower with "status": "paused".

Leader Index Deleted While CCR Active

Deleting a leader index while a follower is actively replicating does not immediately error on the follower side. The follower shard tasks will encounter IndexNotFoundException on their next read poll and eventually pause with a fatal exception. The follower index itself remains intact with whatever data it had replicated up to that point.

If the leader deletion was intentional (index lifecycle rotation, for example), pause and unfollow the follower index to promote it to a regular index:

POST /my-follower-index/_ccr/pause_follow
POST /my-follower-index/_ccr/unfollow

After unfollowing, the index becomes a standard Elasticsearch index - writable, no longer linked to any leader. If the deletion was accidental and you recreate the leader, you cannot simply resume the old follower. The index UUIDs will not match. Delete the follower, recreate it, and let it do a full recovery from the new leader.

For clusters using ILM with CCR, watch for race conditions where ILM deletes a leader index before the follower has finished replicating the final operations. Increase the index.lifecycle.origination_date or adjust the delete phase timing to give CCR enough headroom to catch up before the leader is removed.