NEW

Pulse 2025 Product Roundup: From Monitoring to AI-Native Control Plane

Elasticsearch update_by_query: How to Update Documents by Query

The Elasticsearch _update_by_query API modifies every document in an index that matches a query, optionally applying a Painless script. It scrolls the matching set and issues bulk updates internally, supports parallelism via slices=auto, and can run asynchronously with wait_for_completion=false. Use it for bulk field changes, backfilling new fields, or re-running ingest-time logic on existing documents - all without a full reindex.

When to Use update_by_query (vs Alternatives)

Goal Better choice
Re-index documents with a new mapping or analyzer Reindex API - mapping changes need a fresh index
Modify a small known set of documents by ID _bulk with update actions - cheaper than a query
Mass field update or backfill across matching documents _update_by_query (this API)
Re-run ingest pipeline on indexed data _update_by_query?pipeline=<name> with match_all
Change the type of an existing field Reindex into a new index - update_by_query cannot change types

Prerequisites

  • Elasticsearch 6.x or later (API has been stable for several major versions).
  • The user or API key needs read and write index privileges on the target index.
  • Painless scripting enabled (default on most distributions).
  • Headroom for the scroll plus bulk updates running concurrently; throttle with requests_per_second if the cluster is hot.

Step-by-Step: Update Documents by Query

  1. Verify the query first with a count. Always run the same query against _count or _search?size=0 to confirm the document set before making changes.

    GET /my-index/_count
    {
      "query": { "term": { "status.keyword": "draft" } }
    }
    
  2. Run a simple update with a Painless script.

    POST /my-index/_update_by_query
    {
      "query": { "term": { "status.keyword": "draft" } },
      "script": {
        "source": "ctx._source.status = 'published'",
        "lang": "painless"
      }
    }
    

    The response includes updated, version_conflicts, batches, failures, and took.

  3. Use parameters for safer scripts. Inlining values invalidates the script cache and risks injection-style bugs.

    POST /my-index/_update_by_query
    {
      "query": { "term": { "category": "books" } },
      "script": {
        "source": "ctx._source.price = ctx._source.price * params.factor",
        "lang": "painless",
        "params": { "factor": 1.1 }
      }
    }
    
  4. Add conflicts=proceed for indices under concurrent writes.

    POST /my-index/_update_by_query?conflicts=proceed
    { "query": { "match_all": {} }, "script": { "source": "ctx._source.updated_at = params.now", "params": { "now": "2026-05-17T00:00:00Z" } } }
    
  5. Parallelize with slices=auto. Best practice for large jobs - runs one slice per primary shard.

    POST /my-index/_update_by_query?slices=auto&conflicts=proceed
    { "query": { ... }, "script": { ... } }
    
  6. Run async with the Task API for long jobs.

    POST /my-index/_update_by_query?wait_for_completion=false&slices=auto
    { "query": { ... }, "script": { ... } }
    

    Response: { "task": "<task-id>" }. Monitor with GET /_tasks/<task-id>, cancel with POST /_tasks/<task-id>/_cancel.

  7. Re-run an ingest pipeline on existing data.

    POST /my-index/_update_by_query?pipeline=my-pipeline
    { "query": { "match_all": {} } }
    
  8. Throttle the rate with requests_per_second. Set to -1 to disable.

    POST /my-index/_update_by_query?requests_per_second=1000&slices=auto
    { "query": { ... }, "script": { ... } }
    

update_by_query in Production: What to Watch For

_update_by_query is heavier than _delete_by_query because it has to fetch each matching document, apply the script, and write a new version - and the new version may have a different size, triggering segment merges. On indices with a write-heavy workload running underneath, the version conflict rate can climb fast. Always combine conflicts=proceed with a follow-up query that re-checks the documents that were skipped.

Heap pressure during large updates is a separate concern. The scrolled scan plus the bulk write workers compete for the search and write thread pools. If you see rejected tasks in thread_pool/write while a job is in flight, lower slices or requests_per_second.

Run update_by_query Safely with Pulse

Pulse is an AI DBA for Elasticsearch and OpenSearch. Before and during _update_by_query, Pulse:

  • Verifies cluster capacity for the operation: heap for the scroll, write thread pool headroom for the bulk updates, disk for the new document versions
  • Surfaces concurrent operations that could collide - active reindex, ILM rollover, another long-running _update_by_query on the same index
  • Tracks the operation's progress and impact on production traffic in real time: version_conflicts rate, write rejections in thread_pool/write, merge IO, search latency p95
  • Recommends throttling with requests_per_second or lowering slices if production search latency starts climbing

Start a free trial before your next bulk update.

Common Mistakes

  1. Modifying a field's type with update_by_query. Scripts cannot change a field's mapped type. You need a reindex into a new index with the updated mapping.
  2. Inlining values into the script source. Use params. It is safer and lets Elasticsearch cache the compiled script.
  3. Omitting conflicts=proceed on actively written indices. The job aborts on the first concurrent update.
  4. Setting slices higher than the primary shard count. Excess slices just add coordination overhead. slices=auto is correct.
  5. Forgetting to refresh. Updated documents are not visible to search until the next refresh. Set ?refresh=true on small updates if you need immediate visibility.
  6. No snapshot before destructive scripts. A script that overwrites a field cannot be undone short of a snapshot restore.

Frequently Asked Questions

Q: Can update_by_query change a field's type or mapping?
A: No. update_by_query operates on document contents only. To change a field's type, you have to create a new index with the desired mapping and reindex. For purely additive mapping changes, the put mapping API is enough.

Q: How do I track the progress of an update_by_query in Elasticsearch?
A: Submit the request with wait_for_completion=false, take the returned task ID, and poll GET /_tasks/<task-id>. The response shows updated count, batches, version conflicts, and elapsed time. Cancel a runaway job with POST /_tasks/<task-id>/_cancel.

Q: What does conflicts=proceed do in update_by_query?
A: Without it, update_by_query aborts on the first version conflict (a document updated between the scroll snapshot and the update). With conflicts=proceed, conflicted documents are skipped and the operation continues. The response still records them under version_conflicts.

Q: How do I use update_by_query to delete documents?
A: Set ctx.op = 'delete' inside the script for documents that should be removed. This is occasionally useful when the delete criteria depend on per-document logic, but for straight deletes, _delete_by_query is simpler and faster.

Q: How do I limit how many documents update_by_query processes?
A: Use the max_docs parameter (or size on older versions) in the request body. For example "max_docs": 1000 processes only the first 1000 matching documents. This is useful for staged rollouts of risky updates.

Q: Can update_by_query run across multiple indices or data streams?
A: Yes. Pass a comma list (POST /index-a,index-b/_update_by_query) or a pattern (POST /logs-2025-*/_update_by_query). For data streams, the API rewrites the matching backing indices in place.

Q: How do I re-run an ingest pipeline on existing documents?
A: POST /my-index/_update_by_query?pipeline=<pipeline-name> with a match_all query. Every matching document is read, passed back through the pipeline, and re-indexed - useful after fixing a buggy enrich processor.

Q: What's the best tool to run update_by_query safely on a production cluster?
A: Pulse is purpose-built for this. It is an AI DBA for Elasticsearch and OpenSearch that pre-checks cluster capacity, surfaces conflicting operations, tracks version_conflicts, write thread pool rejections, and merge IO in real time, and recommends throttling via requests_per_second or lowering slices when update_by_query starts impacting production latency.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.