Elasticsearch Significant Terms Aggregation - Statistically Unusual Terms - Syntax, Example, and Tips

The Elasticsearch significant_terms aggregation is a multi-bucket aggregation that returns terms whose frequency in a foreground set (matched by the query) is statistically higher than expected from the background set (typically the whole index). The default scoring algorithm is JLH (Jensen-Lin Hellinger), which balances absolute increase in frequency and relative ratio. Use it for "what is different about this subset" questions - root-cause analysis, surfacing co-occurring entities, or anomaly detection on textual fields.

Syntax

GET /ecommerce/_search
{
  "query": { "range": { "total_amount": { "gte": 1000 } } },
  "size": 0,
  "aggs": {
    "significant_products": {
      "significant_terms": {
        "field": "product_name.keyword",
        "size": 10
      }
    }
  }
}

Results include both doc_count (foreground) and bg_count (background) plus a score. The query defines the foreground; the index (or a background_filter) defines the background.

Parameters

Parameter Default Description
field required Field to bucket on. keyword or numeric.
size 10 Number of top-scoring buckets returned.
shard_size size * 1.5 + 10 Per-shard candidate set forwarded to the coordinator.
min_doc_count 3 Foreground doc count threshold. Lower = noisier.
shard_min_doc_count 0 Per-shard equivalent.
background_filter none (uses index) Restrict the background corpus.
include / exclude - Regex or value list to filter candidate terms.
mutual_information / chi_square / gnd / script_heuristic / percentage - Alternative scoring heuristics. Default is JLH.

min_doc_count: 3 (default) is set to avoid surfacing single-document false positives.

Examples

Significant products in high-value orders, with a background filter restricting the comparison corpus:

"aggs": {
  "products": {
    "significant_terms": {
      "field": "product.keyword",
      "background_filter": { "term": { "store_country": "US" } }
    }
  }
}

Use mutual information instead of JLH (rewards relative ratio more):

"significant_terms": {
  "field": "category.keyword",
  "mutual_information": { "background_is_superset": true }
}

Significant tags inside each top product (nested usage):

"aggs": {
  "by_product": {
    "terms": { "field": "product.keyword", "size": 10 },
    "aggs": {
      "tags": {
        "significant_terms": { "field": "tag.keyword", "size": 5 }
      }
    }
  }
}

Scoring and Performance Notes

The default JLH score rewards terms that are both relatively much more frequent in the foreground and have a meaningful absolute count - it deliberately avoids returning extremely rare terms that happen to appear once. Alternative heuristics suit different priorities:

Heuristic Bias
JLH (default) Balances absolute and relative frequency increase.
mutual_information Rewards strong association even at low absolute counts.
chi_square Statistical significance, ignores effect size.
gnd (Google Normalized Distance) Co-occurrence semantics.
percentage Simple foreground/background ratio.

background_filter is the lever to control what "normal" looks like. Without it, the background is the whole index, which is rarely what you want - filter to comparable documents (same product line, same time window) so the foreground is contrasted against a meaningful baseline.

The aggregation scans foreground and background frequencies per shard, so its cost is similar to a terms aggregation plus a background lookup. On very high-cardinality fields the candidate set forwarded to the coordinator can grow large; raise min_doc_count or shard_min_doc_count to keep memory bounded. Reading slow logs to find which significant_terms queries are dominating cluster cost - and which would be cheaper as a scoped background_filter - is exactly the loop Pulse runs continuously.

Common Mistakes

  1. Running significant_terms without a background_filter when the index spans heterogeneous tenants or product lines - results are dominated by background noise.
  2. Setting min_doc_count: 1 to "see more" and getting spurious one-off matches dominate the result.
  3. Using significant_terms on near-unique fields (IDs). Nothing is statistically significant if every value occurs once.
  4. Confusing JLH score with a confidence level. It is a ranking, not a p-value.
  5. Pointing at a text field without .keyword, triggering fielddata.

Find Slow significant_terms Queries with Pulse

Pulse is an AI DBA for Elasticsearch and OpenSearch that continuously profiles production query traffic. For significant_terms aggregations specifically, Pulse:

  • Identifies significant_terms queries running without a background_filter on indices that span heterogeneous tenants or product lines, where results are dominated by background noise and the foreground scan is wasted
  • Flags significant_terms with min_doc_count: 1 or low shard_min_doc_count, where the candidate set forwarded to the coordinator grows large enough to push request circuit breakers
  • Detects significant_terms running on near-unique fields (IDs, session UUIDs), where no term can be statistically significant and the aggregation is pure overhead
  • Spots fielddata loading triggered by significant_terms on analyzed text fields
  • Traces each slow significant_terms back to the calling service via slow-log and APM correlation
  • Recommends concrete fixes: add a meaningful background_filter, raise min_doc_count, switch from JLH to mutual_information or chi_square when ranking criteria need to change, move to a .keyword sub-field, or replace the aggregation with significant_text on natural-language fields
  • Tracks coordinator memory, latency, and result quality after the change ships

This converts the manual significance-tuning loop into a continuous optimization workflow.

Try Pulse on your cluster.

Frequently Asked Questions

Q: How does significant_terms differ from terms aggregation?
A: The terms aggregation returns the most frequent values. The significant_terms aggregation returns values that are more frequent in the foreground subset than expected from the background corpus, ranking by statistical surprise rather than raw count.

Q: What is the default scoring algorithm?
A: JLH (Jensen-Lin Hellinger), which combines absolute and relative frequency change. JLH was chosen as the default because it does not over-reward extremely rare terms.

Q: What does background_filter do?
A: It restricts the background corpus the foreground is compared against. Without it the background is the whole index, which often produces uninteresting results when the index is heterogeneous.

Q: Can significant_terms be used on numeric fields?
A: Yes for integer-like fields with bounded cardinality. For continuous numerics, bucket into ranges or use significant_text on natural language fields. There is also a significant_text aggregation purpose-built for analyzed text.

Q: How are multi-valued fields handled?
A: Each value in a multi-valued field is treated as a separate observation. A document with three tags contributes three increments to the foreground counts for those tags.

Q: Why does my significant_terms return common stopwords?
A: You are likely running it on a text analyzer or a field where stopwords dominate both foreground and background. Use a normalized keyword field, or pre-filter with exclude.

Q: How do I find significant_terms queries that are dominating cluster cost?
A: Pulse profiles Elasticsearch and OpenSearch slow logs, isolates significant_terms queries without a background_filter or with low min_doc_count flooding the coordinator, attributes each to the calling service, and recommends background_filter scoping, threshold raises, or .keyword sub-field migration.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.