Diagnosing and Fixing Elasticsearch Search Latency Spikes

Identifying Latency Spikes with Hot Threads and Slow Logs

When P99 search latency spikes, the first tool to reach for is the _nodes/hot_threads API. Call GET /_nodes/hot_threads during the incident to capture the busiest JVM threads across your cluster. By default it reports CPU-bound threads. If CPU looks normal but latency is still elevated, switch to type=wait or type=block to catch threads stalled on I/O or locks. Run it several times over 30-60 seconds to spot repeating patterns - a single snapshot is rarely enough.

Slow logs give you the query-level view. Elasticsearch emits slow logs per shard, separately for the query phase and the fetch phase. Configure thresholds at the index level:

PUT /my-index/_settings
{
  "index.search.slowlog.threshold.query.warn": "10s",
  "index.search.slowlog.threshold.query.info": "2s",
  "index.search.slowlog.threshold.query.debug": "500ms",
  "index.search.slowlog.threshold.fetch.warn": "1s",
  "index.search.slowlog.threshold.fetch.info": "500ms"
}

All thresholds default to -1 (disabled), so you must set them explicitly. During an active investigation, temporarily drop the debug threshold to 100ms to capture a broader picture of what the cluster is executing. The logged output includes the query source, shard ID, and execution time.

Common Causes: GC Pauses and Expensive Queries

Garbage collection pauses are one of the most frequent root causes of P99 spikes. A single long GC pause (old generation collection) can stall all search threads on a node for hundreds of milliseconds or more. Check GC logs and monitor jvm.gc.collectors.old.collection_time_in_millis via the node stats API. If you see old GC collections exceeding 500ms regularly, the node is under memory pressure. Common fixes: reduce heap usage by lowering fielddata cache size, reduce the number of shards per node, or increase heap (up to the 32GB compressed oops limit).

Certain query types are inherently expensive. Leading wildcard queries (*foo) force Elasticsearch to scan every term in the inverted index. Regex queries have similar costs. script_score queries execute a Painless script per document per shard - at scale, that adds up fast. Large aggregations with high bucket counts or deeply nested sub-aggregations consume both CPU and heap during the reduce phase. When you spot these in slow logs, the fix is structural: replace leading wildcards with n-gram tokenizers, precompute script_score values into indexed fields, and use composite aggregations for high-cardinality bucketing instead of unbounded terms aggregations.

High concurrent search volume amplifies all of these problems. When too many searches execute simultaneously, thread pools saturate, queries queue, and tail latency grows nonlinearly.

Thread Pool Queues and Coordinating Node Bottlenecks

Elasticsearch uses fixed-size thread pools for search execution. The search thread pool defaults to int((# of allocated processors * 3) / 2) + 1 threads with a queue of 1000. The search_throttled pool handles frozen/throttled indices with just 1 thread and a queue of 100. Monitor both via GET /_nodes/stats/thread_pool/search,search_throttled.

Watch three metrics: active (threads currently executing), queue (requests waiting), and rejected (requests dropped because the queue was full). A sustained high queue count means your nodes cannot keep up with search load. Rejections mean you are actively losing requests. If rejected is climbing, you have a capacity problem - not a tuning problem. Adding nodes or reducing query volume are the real solutions. Bumping queue_size just delays the rejection and increases memory pressure.

The coordinating node is a common overlooked bottleneck. It performs the scatter-gather: fanning out shard-level requests, collecting partial results, and running the final merge/reduce. If your coordinating node is also a data node under heavy indexing or merge load, search coordination suffers. Dedicated coordinating nodes isolate this workload. Monitor HTTP connection counts and JVM heap on these nodes separately.

Cache Misses and Adaptive Replica Selection

Elasticsearch maintains three caches that affect search latency. The node-level request cache stores full aggregation and suggestion results - it is invalidated on any segment change (refresh, merge, flush). The shard-level query cache stores filter clause results at the segment level, surviving refreshes as long as the segment is unchanged. The fielddata cache holds uninverted field data for text fields used in sorting or aggregations.

Cache misses spike after index refreshes, segment merges, or node restarts. If your indices refresh every second (the default) and your queries hit the request cache, every refresh invalidates those cached results. For read-heavy, low-update indices, increase index.refresh_interval to 30s or higher. For fielddata, check indices.fielddata.cache.size and watch for evictions - frequent evictions mean the cache is undersized or you have too many unique field values loaded.

Adaptive Replica Selection (ARS), enabled by default since Elasticsearch 7.x, routes shard requests to the replica most likely to respond fastest. It uses an exponentially weighted moving average of response time, queue depth, and service time. If one node is slow due to GC or merges, ARS routes around it automatically. You can verify it is active with GET /_cluster/settings and checking cluster.routing.use_adaptive_replica_selection. If you suspect uneven shard-level latency, use GET /my-index/_search_shards to see which nodes hold which shards, then correlate with per-node latency metrics.

Merge Operations and Their Impact on Search

Lucene segment merges run in the background and compete with search threads for CPU, I/O, and memory. During heavy merge activity, search latency increases measurably. This is especially visible after bulk indexing, force merges, or during ILM rollover operations when newly created indices undergo rapid segment creation and merging.

Monitor merge activity via GET /_nodes/stats/indices/merges. Key fields are current (active merges) and total_time_in_millis. If merge time is consistently high, throttle merge I/O with index.merge.scheduler.max_thread_count or reduce segments per tier via index.merge.policy.segments_per_tier. For indices that no longer receive writes, a one-time force merge to a single segment eliminates future merge overhead and allows search to use simpler data structures. Run force merges during off-peak hours - they are I/O intensive and will temporarily worsen latency.

Fewer segments also means fewer cache invalidation events, which keeps the request cache and query cache warm longer.