Elasticsearch Fielddata Circuit Breaker Errors and How to Resolve Them

The fielddata circuit breaker error is one of the most common Elasticsearch errors in clusters where text fields are used in aggregations, sorting, or scripting. The error message reads Data too large, data for [field] would be [size], which is larger than the limit of [limit]. This is not a bug - it is Elasticsearch protecting itself from loading an unbounded amount of data into JVM heap. Understanding what fielddata is, why it exists, and why you almost certainly should not be using it will save you from repeatedly hitting this wall.

What Fielddata Is

Fielddata is an in-memory data structure that Elasticsearch builds at query time to support operations that need access to every value of a field across all documents in a shard. Sorting, aggregations, and scripting on text fields require this structure because the inverted index (which maps terms to documents) cannot efficiently answer questions like "what are the top 10 values" or "sort results by this field's value." The inverted index is optimized for search - finding documents that contain a term - not for iterating over per-document values.

To support these operations on a text field, Elasticsearch builds an uninverted data structure in heap memory. It loads every unique term for that field across the entire shard, maps each term back to the documents it appears in, and holds this structure in JVM heap for the duration it is needed. For a text field that has been analyzed (tokenized, lowercased, stemmed), this means loading every analyzed token - not the original field values. A single document's description field might produce dozens of tokens, and across millions of documents, the resulting fielddata structure can consume gigabytes of heap.

This is fundamentally different from doc values, which are the columnar on-disk data structure that keyword, numeric, date, boolean, and geo_point fields use for sorting and aggregations. Doc values are built at index time, stored on disk, and memory-mapped - they do not consume JVM heap. Text fields do not have doc values by default, which is why fielddata exists as the runtime alternative.

The Fielddata Circuit Breaker

Elasticsearch uses circuit breakers to prevent operations from consuming too much JVM heap. The fielddata circuit breaker specifically limits how much memory fielddata can occupy. It defaults to 40% of the JVM heap (configured via indices.breaker.fielddata.limit). Before loading fielddata for a field, Elasticsearch estimates how much memory it will require. If the estimate exceeds the remaining budget under the circuit breaker limit, the request fails with a CircuitBreakingException rather than attempting the load and risking an OutOfMemoryError.

The parent circuit breaker (defaulting to 95% of heap) provides a second layer of protection across all memory-consuming operations - fielddata, request data, in-flight requests, and more. Even if the fielddata-specific breaker has room, the parent breaker can reject the operation if total heap usage is too high.

Common Triggers

The most common trigger is running aggregations or sort operations on a text field. This happens frequently when engineers write queries like:

{
  "aggs": {
    "top_values": {
      "terms": { "field": "status_message" }
    }
  }
}

If status_message is mapped as text, Elasticsearch must load fielddata to execute this aggregation. On a field with high cardinality - millions of unique analyzed tokens across the index - the memory required can easily exceed the circuit breaker limit. Even on low-cardinality fields, the problem compounds when multiple shards load fielddata simultaneously, or when several text fields are aggregated in the same request.

Sorting on text fields triggers the same problem. So does using text fields in script contexts that access field values (e.g., doc['field_name'].value in Painless scripts). Any operation that requires per-document access to the field's values forces fielddata loading.

The Fix: Use Keyword Multi-Fields

The correct fix in almost all cases is to stop using text fields for aggregations and sorting. Instead, use a keyword sub-field. Define your mapping with a multi-field:

{
  "mappings": {
    "properties": {
      "status_message": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

Then aggregate or sort on status_message.keyword instead of status_message. The keyword field uses doc values - a disk-based columnar structure that does not consume heap. This is the default mapping that Elasticsearch's dynamic mapping produces for string fields: a text field with a .keyword sub-field.

If your index already exists without the keyword sub-field, you will need to update the mapping (adding a new multi-field does not require reindexing) and then reindex the data so the keyword sub-field gets populated. For existing indices where reindexing is impractical, runtime fields offer an alternative by defining a keyword-type runtime field derived from the _source, though with a query-time performance cost.

Why fielddata=true Is Usually Wrong

Elasticsearch lets you enable fielddata on a text field by updating the mapping:

PUT /my-index/_mapping
{
  "properties": {
    "status_message": {
      "type": "text",
      "fielddata": true
    }
  }
}

This removes the circuit breaker protection for that field and allows the uninverted structure to be loaded into heap. It does not solve the memory problem - it just defers the failure from a circuit breaker exception to potential OOM pressure or node instability. Setting fielddata: true is appropriate only when you genuinely need to aggregate on analyzed tokens (e.g., finding the most common stemmed terms in a corpus for text analysis purposes). For the vast majority of use cases - filtering by exact values, sorting alphabetically, bucketing by category - a keyword field is the right tool.

Monitor fielddata usage through the nodes stats API to identify which fields are consuming heap:

GET _nodes/stats/indices/fielddata
GET _nodes/stats/indices/fielddata?fields=*

This returns per-node and optionally per-field memory consumption for fielddata. If a specific field dominates fielddata memory, that field is your target for mapping correction.

When you need to free memory immediately, clear the fielddata cache:

POST /my-index/_cache/clear?fielddata=true
POST /_cache/clear?fielddata=true

The first clears fielddata for a specific index; the second clears it cluster-wide. This will disrupt any in-flight queries that depend on the cached fielddata, but it immediately releases heap. The fielddata will be rebuilt on the next query that requires it, so clearing the cache is a stopgap, not a solution.

The indices.fielddata.cache.size setting controls the maximum size of the fielddata cache (unbounded by default). Setting it to a specific value like 20% causes Elasticsearch to evict least-recently-used entries when the cache reaches that size. This prevents unbounded growth but also means fielddata is constantly evicted and reloaded for active fields, adding latency to queries. Configuring this setting is a reasonable interim measure while you fix the underlying mappings, but it is not a long-term fix. The long-term fix is always to stop loading fielddata in the first place by using the right field types.