Elasticsearch node_concurrent_incoming_recoveries: Default 2, Tuning Inbound Recovery

The cluster.routing.allocation.node_concurrent_incoming_recoveries setting limits the number of concurrent shard recoveries a single Elasticsearch data node can receive at one time. It defaults to 2 and, when set, overrides the more general `cluster.routing.allocation.node_concurrent_recoveries` for inbound (destination-side) recoveries only. Tune this when you are scaling out (new nodes need to pull shards in fast) or recovering from a node loss, and the node has the disk and network headroom to absorb more parallel work.

Definition

cluster.routing.allocation.node_concurrent_incoming_recoveries is a dynamic cluster-level setting that caps inbound peer recoveries per node. Inbound here means the node is the destination: a replica is being built on this node from a primary elsewhere, or a shard is being moved here as part of rebalance. The complementary setting, cluster.routing.allocation.node_concurrent_outgoing_recoveries, controls the source side.

Default and Allowed Values

Property Value
Default 2
Type Positive integer, dynamic cluster setting
Scope Cluster, applied per node
Direction Inbound only (this node as destination)
Override of cluster.routing.allocation.node_concurrent_recoveries

When unset, Elasticsearch uses the base node_concurrent_recoveries value (also default 2) for the inbound direction.

How to Change It

# Allow 4 concurrent inbound recoveries per node
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.node_concurrent_incoming_recoveries": 4
  }
}

Reset to default with null. Inspect with:

GET /_cluster/settings?include_defaults=true&filter_path=*.cluster.routing.allocation.node_concurrent_*

If you raise this, consider also raising node_concurrent_outgoing_recoveries so source nodes are not the bottleneck, and the cluster-wide cluster.routing.allocation.cluster_concurrent_rebalance if you need more parallelism for rebalance specifically.

When to Tune It

Raise the inbound limit when:

  • Adding new data nodes to an existing cluster. The new nodes start empty and need to pull many shards in.
  • Recovering after a node restart where many replicas need to be rebuilt on the returning node.
  • The cluster has NVMe storage and fast network and recovery is currently disk- or network-idle but slow.

Keep at the default 2 when:

  • The cluster is steady-state and recovery is a rare event.
  • Spinning disks or constrained network where 2 already saturates I/O.
  • Inbound nodes are also serving heavy search or indexing load.

The setting is purely a parallelism cap. Per-recovery bandwidth is governed separately by indices.recovery.max_bytes_per_sec (default 40 MB/s; in 8.x there is auto-tuning based on hardware class for certain managed services).

Operational Impact

Inbound recovery on a node consumes:

  • Write disk bandwidth on the destination.
  • Receive network bandwidth.
  • A small amount of heap for transport buffers and translog replay.

Each concurrent recovery is bounded by indices.recovery.max_bytes_per_sec, so raising concurrency from 2 to 4 does not double bandwidth - it just lets four shards progress at half the throughput each, which is usually faster overall because shards finish at different points and the work pipelines.

Monitor with:

GET /_cat/recovery?v&active_only=true

The target_node column shows the destination; sort to see how many inbound recoveries any one node is running.

Common Mistakes

  1. Raising only the inbound limit. The outbound side (node_concurrent_outgoing_recoveries) is the bottleneck for source nodes. Raise both together when scaling out aggressively.
  2. Setting incoming and outgoing without checking the base value. Directional values override the base, but if you later remove the directional value, the base default of 2 kicks back in. Be explicit.
  3. Forgetting indices.recovery.max_bytes_per_sec. Concurrency without bandwidth headroom does nothing.
  4. Tuning during incident response. Test in staging first; sudden recovery storms can compound an outage.

Prevent Inbound-Recovery Misconfiguration with Pulse

Pulse is an AI DBA for Elasticsearch and OpenSearch that tracks cluster.routing.allocation.node_concurrent_incoming_recoveries (default 2) and the outbound counterpart together with indices.recovery.max_bytes_per_sec, flagging:

  • Drift between intended and actual values, especially after directional overrides are removed and the base default of 2 silently kicks back in
  • Settings that are unsafe for your workload (e.g. inbound raised to 8 on a new node while outbound stays at 2 on the source, leaving the bottleneck unchanged; tuning during incident response without staging validation)
  • The downstream operational impact: per-node inbound recovery count from _cat/recovery, destination disk write bandwidth, and the search latency cost on the receiving node

When a scale-out adds a new node and it cannot pull shards in fast enough, Pulse names whether the inbound cap, outbound cap, or per-recovery bandwidth is the binding constraint.

Connect your cluster.

Frequently Asked Questions

Q: What is the default for cluster.routing.allocation.node_concurrent_incoming_recoveries?
A: The default is 2. This caps the number of shard recoveries the node can be the destination of at one time. When unset, the same value is inherited from cluster.routing.allocation.node_concurrent_recoveries.

Q: How is this different from node_concurrent_recoveries?
A: node_concurrent_recoveries is the fallback for both inbound and outbound. node_concurrent_incoming_recoveries is specific to recoveries this node is receiving. If both are set, the directional value wins for its direction.

Q: Will raising this setting speed up adding a new node?
A: Often yes, on hardware that can absorb the extra parallelism. New nodes start empty and pull many shards in, so the inbound limit is the binding constraint. Raise to 4-8 on NVMe and fast networks; keep at 2 on slower storage.

Q: Does this setting affect snapshot restore?
A: Yes. Snapshot restore performs shard recovery on the destination, and the inbound limit applies. Restore parallelism is also bounded by repository throughput, which is usually the dominant constraint.

Q: Can I change cluster.routing.allocation.node_concurrent_incoming_recoveries dynamically?
A: Yes. Use PUT /_cluster/settings and the change applies at the next allocator pass. Active recoveries are not interrupted; the new limit governs the next batch.

Q: What happens if I raise this too high?
A: Disk and network saturate, and search and indexing latency spike. Cluster recovery time may not improve - or may actually get worse - if the destination is I/O bound. Raise in small steps and measure.

Q: What's the best tool to tune inbound recovery parallelism when scaling out?
A: Pulse is built for this. It is an AI DBA for Elasticsearch and OpenSearch that tracks node_concurrent_incoming_recoveries against actual destination disk and network throughput, identifies whether the bottleneck is the inbound cap, the outbound cap, or indices.recovery.max_bytes_per_sec, and recommends a change that is safe for the cluster's current hardware.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.