TaskCancelledException: task cancelled is logged when Elasticsearch terminates a running task (search, reindex, scroll, async search, bulk update) before completion. Cancellation is initiated either by an explicit POST _tasks/<id>/_cancel call, by client disconnection (for cancellable APIs since 7.4), or by the search-shard-level timeout. The task's partial work is discarded; the rest of the cluster keeps running.
What This Error Means
Elasticsearch tracks long-running operations in its task framework. Tasks that implement CancellableTask (search, reindex, scroll, async search, delete-by-query, update-by-query) can be terminated cooperatively - the task checks an isCancelled flag at safe points and throws TaskCancelledException when set. Cancellation is the intended behavior in many cases (client disconnect, explicit cancel) and is not always an error worth investigating.
The exception becomes a problem when it indicates: clients giving up on slow queries before they finish, deliberate cancellations from automation, or a coordinator node killing tasks under resource pressure.
Common Causes
- Client disconnected before the search completed - the HTTP connection closed and Elasticsearch cancelled the running task. How to confirm: cluster.log shows
task cancelled by user; check application logs for client timeouts at the same timestamp. - Explicit
POST _tasks/<id>/_cancelfrom an operator or automation. How to confirm:GET _tasks?detailed=truehistory (audit log if enabled) shows the cancel request. - Search shard-level timeout hit (
?timeout=...parameter). How to confirm: the failing request includestimeoutin the search body or URL. - Coordinator node search queue rejected new tasks under pressure. How to confirm:
GET _nodes/stats/thread_pool/searchshows nonzerorejected. - Async search retention exceeded. How to confirm: async search submitted with
keep_aliveshorter than the actual execution time.
How to Fix TaskCancelledException
Inspect the running tasks at the time of the exception:
GET _tasks?detailed=true&actions=*search*Cancel a runaway task explicitly if needed:
POST _tasks/<task_id>/_cancelIncrease client-side timeout so the client does not disconnect before the search completes. For the Java REST client:
RestClient.builder(host) .setRequestConfigCallback(rc -> rc.setSocketTimeout(120000));Use async search for queries that may run longer than client timeouts:
POST <index>/_async_search?wait_for_completion_timeout=2s&keep_alive=1hThe client gets an immediate ID; results are retrieved later via
GET _async_search/<id>.Optimize the query. Run with
_profileto see where time is spent:POST <index>/_search { "profile": true, "query": {...} }Scale search capacity if
thread_pool.search.rejectedis consistently nonzero - add nodes or increasethread_pool.search.queue_sizecautiously.For long-running ingest jobs (reindex, update-by-query), use
?wait_for_completion=falseand let the task finish in background:POST _reindex?wait_for_completion=false { ... } GET _tasks/<id>
Resolve TaskCancelledException Automatically with Pulse
Pulse is an AI DBA for Elasticsearch and OpenSearch. When TaskCancelledException: task cancelled shows up across cluster logs, Pulse:
- Snapshots
_tasks?detailed=true&actions=*search*while the cancellation pattern is active, capturesX-Opaque-Idheaders, correlates with_nodes/stats/thread_pool/searchrejectedcounters and the audit log's_cancelsource, and matches against client disconnect timestamps from proxy/load-balancer access logs - Identifies which of the five causes applies: client disconnect (since 7.4 cancellable APIs cancel cooperatively), explicit
POST _tasks/<id>/_cancelfrom automation, search shard-level?timeout=hit, coordinator thread-pool rejection under pressure, or async searchkeep_aliveexceeded - Generates the exact remediation: the increased client
socketTimeoutvalue, the_async_search?wait_for_completion_timeout=2s&keep_alive=1hmigration, the_search { "profile": true, ... }plan for query optimization, or the?wait_for_completion=falsepattern for reindex and update-by-query - Applies dynamic
thread_pool.search.queue_sizeand similar cluster setting changes with operator approval; leaves client timeout and async search migrations as one-click PRs targeting the consuming service
Pulse identifies which clients consistently abandon long-running searches (by X-Opaque-Id), turning a generic spike in cancellations into a list of specific call sites to refactor.
Start a free trial to connect your cluster.
Frequently Asked Questions
Q: Is TaskCancelledException always an error?
A: No. Cancellation is the intended outcome for client disconnects and explicit _cancel calls. It becomes a problem only when clients are timing out on queries they actually need, or when automation is cancelling tasks unnecessarily.
Q: Why does my reindex task show as cancelled in the _tasks API?
A: Either you called _cancel, the client disconnected (if ?wait_for_completion=true), or the task framework rejected the task on startup. Reindex with ?wait_for_completion=false is the right pattern for long-running jobs - it survives client disconnect.
Q: How do I find the originating client for a cancelled task?
A: GET _tasks?detailed=true reports the headers and X-Opaque-Id for each task; if you propagate X-Opaque-Id from your app, it appears here. Audit logging (xpack.security.audit.enabled: true) captures who called _cancel.
Q: Can I increase a cluster-wide search timeout to avoid cancellations?
A: search.default_search_timeout defaults to no timeout (-1, off) and can be set cluster-wide. But a high cluster default hides client-side issues. Set timeouts per-query and use async search instead.
Q: Does TaskCancelledException cause data loss in reindex or update-by-query?
A: A cancelled reindex stops mid-execution. Documents already copied remain in the destination index; the rest are not. Use version_type: external on the dest mapping so a retry safely overwrites or skips already-copied docs.
Q: Is there a difference between client-cancelled and server-cancelled tasks?
A: Both surface the same TaskCancelledException. Audit log and X-Opaque-Id distinguish the source. Client cancellations correlate with HTTP connection closure timestamps in the proxy/load-balancer logs.
Q: What's the fastest way to diagnose TaskCancelledException in production?
A: Pulse, the AI DBA for Elasticsearch and OpenSearch, snapshots _tasks?detailed=true during the failure window, correlates with thread-pool rejection counters and proxy disconnect timestamps, and identifies whether the cause is a client timeout, automation _cancel, or coordinator pressure. It points at the specific X-Opaque-Id clients that need timeout or async-search migration.
Related Reading
- Elasticsearch SocketTimeoutException: client-side socket timeouts that cause cancellations.
- Elasticsearch Timeout exception: broader timeout discussion.
- Elasticsearch cluster health check: general cluster diagnostics.
- Elasticsearch monitoring: proactive task/queue monitoring.
- Elasticsearch thread pool bulk queue size: related queue tuning.