Elasticsearch Coordinating Node Overload: Causes, Symptoms, and Fixes

What Coordinating Nodes Do

Every Elasticsearch node can act as a coordinating node. When a search request arrives, the receiving node becomes the coordinator for that request. It executes a scatter-gather pattern: during the scatter phase, it routes sub-requests to each shard copy that holds relevant data. During the gather phase, it collects partial results from all shards, runs a merge/reduce to produce a single global result set, serializes the response, and sends it back to the client.

The reduce phase is where coordinating nodes consume the most memory. For a simple top-10 query across 100 shards, the coordinator receives 100 sets of 10 hits, sorts them, and returns the global top 10. That is cheap. But for a terms aggregation with 50,000 unique buckets across 100 shards, the coordinator must hold and merge up to 5 million bucket entries in heap before producing the final output. The serialization step also has cost - large JSON response payloads consume memory proportional to their size during the encoding process.

Dedicated coordinating-only nodes (configured with node.roles: [] or all roles set to false) exist to isolate this coordination work from data node responsibilities like indexing, merging, and shard-level search execution.

Memory Pressure from Aggregations and High Cardinality

Large aggregations are the primary cause of coordinating node memory pressure. A terms aggregation on a high-cardinality field (millions of unique values) forces each shard to build and return a large bucket list. The coordinating node must hold all partial aggregation results in memory simultaneously during the reduce phase. Nested sub-aggregations multiply this cost - a two-level terms aggregation with 10,000 outer buckets and 1,000 inner buckets can produce tens of millions of intermediate results.

The memory accounting works roughly like this: before each partial or final reduce, Elasticsearch estimates the memory needed as approximately 1.5x the serialized size of the aggregations being reduced. If this estimate exceeds the circuit breaker limit, a CircuitBreakingException is thrown before the reduce even starts. If the reduce succeeds, the breaker is updated to subtract the source aggregations and add the actual serialized size of the reduced result.

Practical mitigations: use composite aggregations to paginate through high-cardinality results instead of requesting all buckets at once. Set shard_size explicitly on terms aggregations rather than relying on the default (which is size * 1.5 + 10). Reduce the number of levels in nested aggregations. If you need exact counts on high-cardinality fields, consider pre-aggregating at index time using a pipeline or transform.

Circuit Breaker Trips: in_flight_requests and request Breaker

Two circuit breakers are directly relevant to coordinating node overload. The in_flight_requests breaker limits memory consumed by all active transport-level and HTTP-level requests on a node. It defaults to 100% of JVM heap (effectively bounded by the parent breaker at 95%). Its overhead multiplier defaults to 2, meaning a 50MB incoming request body is accounted as 100MB. When this breaker trips, the node rejects new incoming requests entirely.

The request breaker tracks memory allocated during query execution, including aggregation construction and reduce operations. It defaults to 60% of JVM heap. Since Elasticsearch 7.x, the coordinating node checks this breaker strictly when receiving each shard response and during partial/final reduce. If the accumulated memory exceeds the limit, the search is cancelled early and the client receives a circuit breaker exception.

When you see CircuitBreakingException errors in logs, check which breaker tripped:

GET /_nodes/stats/breaker

Look at tripped counts and estimated_size vs limit_size for both in_flight_requests and request. A high trip rate on the request breaker points to aggregation-heavy queries. A high trip rate on in_flight_requests suggests too many concurrent large requests hitting the same node.

Raising breaker limits is a short-term workaround, not a fix. The real solution is reducing per-query memory consumption or spreading load across more coordinating nodes.

Monitoring Coordinating Node Health

Effective monitoring of coordinating nodes requires watching several metrics in combination. JVM heap usage is the top priority - track jvm.mem.heap_used_percent and watch for sustained usage above 75%. At that level, GC pauses become more frequent and longer, which stalls all coordination work on the node.

Thread pool stats reveal queuing and rejection. Check GET /_nodes/stats/thread_pool/search,generic,write. On a coordinating-only node, the search pool handles the coordination work. If queue is persistently non-zero or rejected is climbing, the node is saturated. The generic pool handles internal cluster operations; contention here can delay shard-level request routing.

HTTP connection metrics matter too. Each open HTTP connection consumes a file descriptor and a small amount of memory. A sudden spike in http.current_open connections can indicate a client that is not properly closing connections or a sudden burst of search traffic. Monitor via GET /_nodes/stats/http.

The _nodes/hot_threads API is useful during incidents. On a coordinating node, hot threads during overload typically show time spent in aggregation reduce code, response serialization, or GC. Run it 3-4 times over 30 seconds to get a representative sample.

Sizing, Timeouts, and Protective Settings

Coordinating-only nodes need heap but not much local storage. Allocate JVM heap proportional to the complexity of your heaviest queries. A starting point for most clusters is 4-8GB heap per coordinating node. Clusters with heavy aggregation workloads may need 16-31GB. Do not exceed 31GB per JVM to stay within compressed ordinary object pointers (compressed oops) range. If a single coordinating node at 31GB is still tripping breakers, add more coordinating nodes behind a load balancer rather than increasing heap beyond 32GB. CPU matters for the reduce phase - allocate at least 4 cores per dedicated coordinating node. As a starting ratio, plan 1 coordinating node for every 5-10 data nodes and adjust based on observed queue depth and rejection rates.

When a shard times out during search execution, Elasticsearch can return partial results instead of failing the entire request. This behavior is controlled by allow_partial_search_results, which defaults to true. When enabled, timed-out shards return empty hits, and the response includes "timed_out": true along with a _shards.failed count. The coordinating node merges whatever it received and returns that to the client. This can mask overload - if your application does not check for timed_out or _shards.failed in responses, you may silently serve incomplete results during high load. Set allow_partial_search_results: false in environments where partial data is unacceptable.

The timeout parameter on search requests (e.g., "timeout": "5s") is applied at the shard level during the scatter phase. The coordinating node itself does not enforce an independent timeout on the overall reduce/gather phase. If the reduce is slow due to massive aggregation merges, there is no built-in cap on that duration. To protect against runaway coordination, use the search.default_search_timeout cluster setting and deploy coordinating nodes with enough headroom to handle your worst-case reduce scenarios.