NEW

Pulse 2025 Product Roundup: From Monitoring to AI-Native Control Plane

What is Elasticsearch Fielddata? When to Use It and Why It Hurts

Fielddata is the in-memory data structure Elasticsearch uses to access field values column-wise for sorting, aggregations, and scripting. For keyword, numeric, date, and geo fields, this role is filled by doc values on disk, which is fast and cheap. For text fields, doc values are not available, so the only way to get column-wise access is to load the field into heap memory as fielddata - which is why fielddata is disabled by default on text fields.

Fielddata vs Doc Values

Doc values are the default columnar store for nearly every field type. They're written to disk at index time alongside the inverted index and memory-mapped at search time, so they avoid heap pressure and scale with index size. Fielddata is different: it's built on-heap, lazily, on first use, and held in the JVM until the segment is merged out.

Property Doc values Fielddata
Storage Disk (memory-mapped) JVM heap
Default for keyword, numeric, date, geo, ip, boolean text (off by default)
Built At index time Lazily on first use
Cost Cheap, scales with disk Expensive, scales with heap
Granularity Per-segment Per-segment
Lifecycle Lives with the segment Held in heap until segment merge

Doc values cover 95% of sorting and aggregation needs. Fielddata is only for the narrow case where you must run an aggregation or sort against a text field directly.

Why Fielddata is Disabled by Default on Text Fields

Text fields are analyzed: a single document becomes many tokens. Building fielddata loads every unique term and every per-document term list into heap. On a multi-million-document index over a long text field, this consumes tens of gigabytes of heap and triggers garbage collection storms - or trips the fielddata circuit breaker outright. Elasticsearch's documented guidance: "Fielddata can consume a lot of heap space, especially when loading high cardinality text fields."

The fielddata circuit breaker (indices.breaker.fielddata.limit, default 40% of JVM heap) rejects fielddata loads that would exceed the threshold, returning a CircuitBreakingException instead of OOMing the node.

When You Actually Need Fielddata

In practice, almost never. The right pattern is a multi-field mapping with a keyword sub-field for sorting and aggregations:

PUT /articles
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "fields": {
          "raw": { "type": "keyword", "ignore_above": 256 }
        }
      }
    }
  }
}

You search against title (analyzed, scored) and aggregate or sort against title.raw (doc values, cheap). Use ignore_above to skip the keyword indexing for extremely long values.

Reach for "fielddata": true only when:

  • You explicitly need to aggregate or sort on the analyzed tokens (not the raw text).
  • The cardinality of analyzed terms is bounded and small.
  • You've measured the heap impact and sized circuit breakers accordingly.

Enable it like this:

PUT /my-index/_mapping
{
  "properties": {
    "title": {
      "type": "text",
      "fielddata": true
    }
  }
}

Monitoring Fielddata Usage

The _cat/fielddata API shows current fielddata heap usage per field per node:

GET /_cat/fielddata?v&fields=*

For programmatic monitoring, _nodes/stats/indices/fielddata returns the same data plus eviction counts and circuit breaker trips.

Signals of trouble:

  • Rising fielddata memory on a per-node basis without explanation.
  • CircuitBreakingException errors in the search response.
  • JVM old generation pressure correlated with query patterns that aggregate on text fields.

Pulse tracks fielddata heap usage, circuit breaker trips, and the queries that load fielddata, and ties them back to the index mappings responsible. When a node starts trending toward an OOM because someone enabled fielddata: true on a high-cardinality text field, Pulse's automated root-cause analysis flags the offending mapping rather than just the symptom.

Common Mistakes with Fielddata

  1. Enabling fielddata to fix a "field is not indexed for aggregations" error. The fix is usually to add a keyword sub-field, not to enable fielddata.
  2. Enabling fielddata on a high-cardinality text field (descriptions, comments, bodies) - this is the most common cause of OOMs traced to fielddata.
  3. Not setting eager_global_ordinals for fields that legitimately need fielddata for terms aggregations - the first query after a refresh pays the loading cost.
  4. Forgetting to disable fielddata after migrating to a keyword sub-field - the heap stays committed.
  5. Tuning the circuit breaker limit upward to avoid errors instead of fixing the mapping.

Frequently Asked Questions

Q: What is the difference between fielddata and doc values in Elasticsearch?
A: Doc values are an on-disk columnar store built at index time, used by default for keyword, numeric, date, and geo fields. Fielddata is an on-heap structure built lazily at search time, used only for text fields because text fields don't have doc values.

Q: Why is fielddata disabled by default on text fields?
A: Loading fielddata for text fields requires holding every unique analyzed token and per-document term list in JVM heap. On a large index this consumes tens of GB and triggers GC storms or trips the circuit breaker. Elasticsearch disables it by default to prevent accidental cluster outages.

Q: How do I sort on a text field in Elasticsearch without enabling fielddata?
A: Add a keyword sub-field via multi-fields and sort against the sub-field. For example, title.raw of type keyword with ignore_above: 256. Sorting and aggregations then use doc values, which is fast and heap-free.

Q: What is the fielddata circuit breaker?
A: The fielddata circuit breaker (indices.breaker.fielddata.limit, default 40% of JVM heap) estimates the memory needed to load fielddata and rejects loads that would exceed the limit with a CircuitBreakingException. It exists to prevent fielddata from OOMing the node.

Q: How do I check fielddata memory usage in Elasticsearch?
A: Use GET /_cat/fielddata?v for a per-field summary or GET /_nodes/stats/indices/fielddata for full stats including eviction counts. Watch trends over time, not just instantaneous values.

Q: Was fielddata deprecated in Elasticsearch?
A: Fielddata itself is not deprecated, but its use is strongly discouraged on text fields. The pattern of enabling fielddata: true was effectively retired in favor of multi-field keyword sub-fields starting in Elasticsearch 2.x. It's still supported because some text-analysis use cases (aggregating on analyzed tokens) genuinely require it.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.