Elasticsearch Rare Terms Aggregation - Finding Infrequent Values - Syntax, Example, and Tips

The Elasticsearch rare_terms aggregation is a multi-bucket aggregation that returns terms appearing at most max_doc_count times across the matched documents. It is the answer to "find values that are rare", which the terms aggregation cannot answer reliably - sorting terms by ascending count is documented as unsupported because per-shard top-N truncation makes the result non-deterministic. rare_terms uses a probabilistic approach (a Bloom filter under the hood) to identify rare values across all shards.

Syntax

GET /web_logs/_search
{
  "size": 0,
  "aggs": {
    "rare_user_agents": {
      "rare_terms": {
        "field": "user_agent.keyword",
        "max_doc_count": 1,
        "precision": 0.001
      }
    }
  }
}

The result is a bucket list of terms whose global document count is at most max_doc_count.

Parameters

Parameter Default Description
field required Field to bucket on. Must be keyword, numeric, or ip.
max_doc_count 1 Upper bound (inclusive) on the document count of returned terms. Hard limit is 100.
precision 0.001 Bloom filter precision. Lower = fewer false positives, more memory. Range 0.00001 to 0.1.
missing - Bucket value used when the field is absent.
include / exclude - Regex or exact-value list to filter terms before applying the rarity filter.

The hard limit of max_doc_count <= 100 is an Elasticsearch safeguard - "rare" loses meaning past that, and memory cost grows.

Examples

User agents containing "Bot" that appear in 5 or fewer documents:

"rare_terms": {
  "field": "user_agent.keyword",
  "max_doc_count": 5,
  "include": ".*Bot.*"
}

Rare error codes with a metric sub-aggregation:

"aggs": {
  "rare_errors": {
    "rare_terms": { "field": "error.code", "max_doc_count": 3 },
    "aggs": {
      "first_seen": { "min": { "field": "@timestamp" } }
    }
  }
}

Tighter precision when false positives are unacceptable (more memory):

"rare_terms": {
  "field": "session_id",
  "max_doc_count": 1,
  "precision": 0.00001
}

Precision and Memory Notes

rare_terms uses a Bloom filter sized by precision to track candidate terms across shards. A lower precision value reduces the false-positive rate (terms reported as rare when they are not) at the cost of more memory per shard. The default 0.001 keeps memory modest and produces a small fraction of false positives, which the aggregation re-checks on the coordinating node before returning results.

Despite the probabilistic plumbing, the returned counts in each bucket are exact - the filter is used only to decide which terms are candidates. Memory cost scales with the number of unique values in the field, not with max_doc_count. For very high-cardinality fields (millions of unique values), rare_terms can be memory-heavy even at default precision.

rare_terms aggregations on fields like session_id, request_id, or other near-unique identifiers are operational hazards - nearly every value is rare, so the result is huge. Pulse monitors Elasticsearch aggregation costs and flags clusters where heap pressure correlates with high-cardinality probabilistic aggregations.

Common Mistakes

  1. Running rare_terms on near-unique fields (request IDs, sessions). Nearly all values qualify and the result blows up.
  2. Trying to sort a regular terms aggregation by ascending count to find rare values. The result is not reliable across shards - use rare_terms.
  3. Setting max_doc_count above 100. The request fails - the parameter is capped.
  4. Tightening precision aggressively without checking memory impact.
  5. Ignoring include/exclude. Pre-filtering values cuts memory and makes results actionable.

Frequently Asked Questions

Q: How does rare_terms differ from terms aggregation?
A: The terms aggregation returns the most frequent terms; sorting it by ascending count is documented as unsupported. The rare_terms aggregation is purpose-built for finding terms that appear at most max_doc_count times, using a Bloom filter to handle cross-shard rarity correctly.

Q: What is the maximum max_doc_count?
A: 100. Above that, the request is rejected. The aggregation is designed for genuinely rare values, not arbitrary low-frequency thresholds.

Q: Are the returned counts exact?
A: Yes. The Bloom filter is only used to identify candidate rare terms; reported doc_count values are exact, after the coordinator re-checks candidates.

Q: Can rare_terms be used on numeric fields?
A: Yes, on integer and date fields. For floats, exact equality is rarely meaningful, so it is usually better to bucket numerically first.

Q: How do I reduce memory usage on a high-cardinality field?
A: Apply include/exclude patterns to narrow the candidate set, scope the query to a smaller time window, or raise precision toward 0.1 if false positives are acceptable.

Q: Can I use rare_terms for anomaly detection?
A: Yes, especially combined with a sub-aggregation like min/max on @timestamp to flag newly-appearing rare values. Pair with alerting on the count of rare buckets returned.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.