Elasticsearch Timeout exception - Common Causes & Fixes

ElasticsearchTimeoutException (Java client) and related timeout exceptions are raised when an Elasticsearch operation does not complete within the time the client or server allows. Specific subclasses - SocketTimeoutException, ConnectTimeoutException, ReceiveTimeoutTransportException, TaskCancelledException (from per-query timeout) - distinguish the layer where the timeout fired. The cluster keeps running; only the timed-out request fails.

What This Error Means

Timeouts in Elasticsearch happen at multiple layers and the fix depends on which one fired:

Layer Symptom Typical fix
Client socket SocketTimeoutException: Read timed out Raise client socket timeout, optimize query, use async search
Client connect ConnectTimeoutException Network / firewall / DNS / TLS
Search per-query partial results with timed_out: true Raise timeout parameter; query may return what it had
Inter-node transport ReceiveTimeoutTransportException Network between nodes, GC pauses, or slow shard
Task cancellation TaskCancelledException Coordinator killed task (client disconnect / explicit cancel)

Read the exception class first - that tells you which layer to fix.

Common Causes

  1. Slow query exceeding client socket timeout (most common). How to confirm: enable slow log; match the timestamp to the client failure.
  2. Per-query timeout parameter set lower than query latency. How to confirm: the search response has "timed_out": true and partial results.
  3. Inter-node transport delays from GC pauses or network issues. How to confirm: GET _nodes/stats/jvm/gc shows long GC pauses; transport ping logs show delays.
  4. Coordinator search thread pool saturated. How to confirm: GET _nodes/stats/thread_pool/search shows nonzero rejected or sustained queue length.
  5. Large bulk requests timing out at the ingest side. How to confirm: per-request size in client logs is many MB; reduce batch size.
  6. Connection-level timeout (TCP handshake or TLS handshake too slow). How to confirm: error class is ConnectTimeoutException; client log shows handshake-stage failure.

How to Fix Timeout Exception

  1. Identify the exception class. The simple name (SocketTimeoutException, ReceiveTimeoutTransportException, etc.) tells you which layer fired:

    tail -f /var/log/elasticsearch/*.log
    
  2. For client socket timeouts, raise the timeout deliberately and consider async search:

    RestClient.builder(host).setRequestConfigCallback(
      rc -> rc.setSocketTimeout(60000));
    
  3. For per-query timeout, decide whether you want partial results or a full retry:

    GET /my-index/_search?timeout=30s
    

    With timed_out: true and partial data, the cluster returned what it had so far.

  4. For long-running queries, use async search:

    POST <index>/_async_search?wait_for_completion_timeout=2s&keep_alive=1h
    
  5. Optimize slow queries. Run with ?profile=true to see where time is spent:

    { "profile": true, "query": {...} }
    

    Common gains: replace wildcard with keyword term, replace script fields with runtime fields, drop track_total_hits.

  6. For inter-node ReceiveTimeoutTransportException, check GC pauses and network. Tune heap, fix slow shards, or move shards off overloaded nodes via cluster.routing.allocation.* filters.

  7. Scale or reshard if queue rejections are persistent. Add nodes, increase replicas (to spread search load), or rollover oversized indices.

Resolve Timeout Exceptions Automatically with Pulse

Pulse is an AI DBA for Elasticsearch and OpenSearch. When ElasticsearchTimeoutException or its subclasses (SocketTimeoutException, ConnectTimeoutException, ReceiveTimeoutTransportException, TaskCancelledException) fire, Pulse:

  • Classifies the timeout by exception class and layer (client socket, client connect, search per-query with timed_out: true, inter-node transport, task cancellation), then correlates client latency with the slow log, _nodes/stats/thread_pool/search rejected count, _nodes/stats/jvm/gc pause durations, and transport ping logs at the same timestamp
  • Identifies which of the six causes applies: slow query exceeding client socketTimeout, per-query ?timeout= set below latency, inter-node transport delay from GC or network, coordinator search thread pool saturation, oversized bulk batches, or TCP/TLS connect timeout
  • Generates the exact remediation: the RestClient.setSocketTimeout(60000) adjustment, the ?timeout=30s change with explicit "partial results vs full retry" guidance, the _async_search?wait_for_completion_timeout=2s&keep_alive=1h migration, the ?profile=true plan for query optimization, the heap or G1GC tuning, or the cluster.routing.allocation.* move for an overloaded shard
  • Applies dynamic cluster settings with operator approval; leaves client timeout updates, async-search migrations, and query rewrites as one-click PRs targeting the consuming service

Pulse runs predictive alerts on rising p95/p99 latency before timeouts spike, so the question "should we add nodes or rewrite the query" has an answer before users notice the slowdown.

Start a free trial to connect your cluster.

Frequently Asked Questions

Q: What is the difference between SocketTimeoutException and ElasticsearchTimeoutException?
A: SocketTimeoutException is a JDK class raised when bytes do not arrive within the client read timeout. ElasticsearchTimeoutException is a higher-level Elasticsearch client wrapper that may wrap any of several timeout classes. The fix depends on which underlying class is wrapped - inspect the cause chain.

Q: Does "timed_out": true in a search response mean my query failed?
A: Not entirely. The server's per-query timeout is a soft limit - shards that hit it return partial results and the response is marked timed_out: true. You get whatever was already collected. To force a complete result, raise the timeout or remove it.

Q: How do I find which queries are causing timeout exceptions?
A: Enable the search slow log (index.search.slowlog.threshold.query.warn: 5s) and match timestamps to client failures. The _tasks?detailed=true API shows currently-running tasks if you can catch one in flight.

Q: Can timeout exceptions cause data loss?
A: Reads do not affect data. Writes that timed out on the client may have succeeded server-side - the cluster does not roll back partial writes. For bulk indexing, retry idempotent writes with the same _id (and op_type: create or external versioning if needed) to avoid duplicates.

Q: Why does the same query sometimes time out and sometimes succeed?
A: Latency varies with cache warmth, concurrent load, segment merges, and GC pauses. A cold-cache query can be 10x slower. Set timeouts with this variance in mind, or pre-warm caches with periodic background queries.

Q: Should I raise search.default_search_timeout to fix timeouts?
A: That setting defaults to -1 (off, no cluster-level cap) and sets a server-side maximum, not a minimum. Raising it does not help slow queries finish - only optimization or async search does. Use it to prevent runaway queries, not to fix slowness.

Q: What's the fastest way to diagnose timeout exceptions in production?
A: Pulse, the AI DBA for Elasticsearch and OpenSearch, classifies the timeout by exception class and layer, correlates with slow logs, thread-pool pressure, and GC pauses, then names whether the fix is client-side (raise timeout, async-search migration) or server-side (scale, query rewrite, heap tuning). It applies dynamic settings with approval and routes code changes to the right service.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.