The Elasticsearch rare_terms aggregation is a multi-bucket aggregation that returns terms appearing at most max_doc_count times across the matched documents. It is the answer to "find values that are rare", which the terms aggregation cannot answer reliably - sorting terms by ascending count is documented as unsupported because per-shard top-N truncation makes the result non-deterministic. rare_terms uses a probabilistic approach (a Bloom filter under the hood) to identify rare values across all shards.
Syntax
GET /web_logs/_search
{
"size": 0,
"aggs": {
"rare_user_agents": {
"rare_terms": {
"field": "user_agent.keyword",
"max_doc_count": 1,
"precision": 0.001
}
}
}
}
The result is a bucket list of terms whose global document count is at most max_doc_count.
Parameters
| Parameter | Default | Description |
|---|---|---|
field |
required | Field to bucket on. Must be keyword, numeric, or ip. |
max_doc_count |
1 | Upper bound (inclusive) on the document count of returned terms. Hard limit is 100. |
precision |
0.001 | Bloom filter precision. Lower = fewer false positives, more memory. Range 0.00001 to 0.1. |
missing |
- | Bucket value used when the field is absent. |
include / exclude |
- | Regex or exact-value list to filter terms before applying the rarity filter. |
The hard limit of max_doc_count <= 100 is an Elasticsearch safeguard - "rare" loses meaning past that, and memory cost grows.
Examples
User agents containing "Bot" that appear in 5 or fewer documents:
"rare_terms": {
"field": "user_agent.keyword",
"max_doc_count": 5,
"include": ".*Bot.*"
}
Rare error codes with a metric sub-aggregation:
"aggs": {
"rare_errors": {
"rare_terms": { "field": "error.code", "max_doc_count": 3 },
"aggs": {
"first_seen": { "min": { "field": "@timestamp" } }
}
}
}
Tighter precision when false positives are unacceptable (more memory):
"rare_terms": {
"field": "session_id",
"max_doc_count": 1,
"precision": 0.00001
}
Precision and Memory Notes
rare_terms uses a Bloom filter sized by precision to track candidate terms across shards. A lower precision value reduces the false-positive rate (terms reported as rare when they are not) at the cost of more memory per shard. The default 0.001 keeps memory modest and produces a small fraction of false positives, which the aggregation re-checks on the coordinating node before returning results.
Despite the probabilistic plumbing, the returned counts in each bucket are exact - the filter is used only to decide which terms are candidates. Memory cost scales with the number of unique values in the field, not with max_doc_count. For very high-cardinality fields (millions of unique values), rare_terms can be memory-heavy even at default precision.
rare_terms aggregations on fields like session_id, request_id, or other near-unique identifiers are operational hazards - nearly every value is rare, so the result is huge. Pulse monitors Elasticsearch aggregation costs and flags clusters where heap pressure correlates with high-cardinality probabilistic aggregations.
Common Mistakes
- Running rare_terms on near-unique fields (request IDs, sessions). Nearly all values qualify and the result blows up.
- Trying to sort a regular terms aggregation by ascending count to find rare values. The result is not reliable across shards - use rare_terms.
- Setting
max_doc_countabove 100. The request fails - the parameter is capped. - Tightening
precisionaggressively without checking memory impact. - Ignoring
include/exclude. Pre-filtering values cuts memory and makes results actionable.
Frequently Asked Questions
Q: How does rare_terms differ from terms aggregation?
A: The terms aggregation returns the most frequent terms; sorting it by ascending count is documented as unsupported. The rare_terms aggregation is purpose-built for finding terms that appear at most max_doc_count times, using a Bloom filter to handle cross-shard rarity correctly.
Q: What is the maximum max_doc_count?
A: 100. Above that, the request is rejected. The aggregation is designed for genuinely rare values, not arbitrary low-frequency thresholds.
Q: Are the returned counts exact?
A: Yes. The Bloom filter is only used to identify candidate rare terms; reported doc_count values are exact, after the coordinator re-checks candidates.
Q: Can rare_terms be used on numeric fields?
A: Yes, on integer and date fields. For floats, exact equality is rarely meaningful, so it is usually better to bucket numerically first.
Q: How do I reduce memory usage on a high-cardinality field?
A: Apply include/exclude patterns to narrow the candidate set, scope the query to a smaller time window, or raise precision toward 0.1 if false positives are acceptable.
Q: Can I use rare_terms for anomaly detection?
A: Yes, especially combined with a sub-aggregation like min/max on @timestamp to flag newly-appearing rare values. Pair with alerting on the count of rare buckets returned.
Related Reading
- Terms Aggregation: the high-frequency counterpart.
- Significant Terms Aggregation: terms over-represented in a subset, complementary signal.
- Cardinality Aggregation: estimate how many distinct values exist before running rare_terms.
- Top Hits Aggregation: fetch example documents for each rare term.
- Elasticsearch Query Language: the DSL rare_terms runs inside.