Elasticsearch Cardinality Aggregation - Distinct Count with HyperLogLog++ - Syntax, Example, and Tips

The Elasticsearch cardinality aggregation is a single-value metric aggregation that returns the approximate count of distinct values in a field. It is built on the HyperLogLog++ (HLL++) algorithm, which trades a small, bounded error for fixed memory usage that does not grow with input size. Use it for "how many unique users / IPs / sessions" questions where exact counts are unaffordable and a typical error of under 1% is acceptable.

Syntax

GET /events/_search
{
  "size": 0,
  "aggs": {
    "unique_users": {
      "cardinality": {
        "field": "user_id",
        "precision_threshold": 3000
      }
    }
  }
}

The result is a single number under aggregations.unique_users.value.

Parameters

Parameter Default Description
field required (or script) Field to count distinct values from.
precision_threshold 3000 Counts below this are near-exact. Above, error grows. Max accepted value is 40000.
missing - Value to substitute when the field is missing. Treated as a distinct value.
script - Use a script instead of a field. Slower; disables global ordinals.
execution_hint auto direct, global_ordinals, segment_ordinals, or save_memory_heuristic.

Memory per cardinality aggregation is roughly precision_threshold * 8 bytes per shard. At the maximum (40000) that is about 320 KB per shard per cardinality aggregation - a single number, but multiplied across nested buckets it adds up.

Examples

Unique users in the last 24 hours:

GET /events/_search
{
  "size": 0,
  "query": { "range": { "@timestamp": { "gte": "now-24h" } } },
  "aggs": {
    "uniques": { "cardinality": { "field": "user_id" } }
  }
}

Unique users per country (nested under terms):

"aggs": {
  "by_country": {
    "terms": { "field": "country.keyword", "size": 20 },
    "aggs": {
      "uniques": { "cardinality": { "field": "user_id", "precision_threshold": 10000 } }
    }
  }
}

Tighter precision for analytics dashboards:

"cardinality": { "field": "session_id", "precision_threshold": 40000 }

Precision, Memory, and Performance Notes

HyperLogLog++ guarantees counts below precision_threshold are exact or near-exact. Above that, error scales with the square root of cardinality. With the default precision_threshold of 3000, expected relative error is around 1-2% at 100,000 distinct values and grows slowly from there. The maximum precision_threshold is 40000, which gives roughly 0.4% error at one million distinct values - and costs 16x the memory of the default.

The aggregation hashes each value (Murmur3) and tracks the leading-zero distribution. Pre-hashing your data at index time with the mapper-murmur3 plugin lets the aggregation skip the hash step at query time - useful for very hot dashboards on keyword fields. Numeric fields are hashed automatically with no plugin needed.

Cardinality aggregations under high-cardinality terms buckets are a frequent operational hotspot - each bucket holds its own HLL sketch. A terms aggregation returning 10000 buckets, each with a cardinality sub-agg at precision 40000, consumes meaningful heap per shard per request. Walking circuit-breaker stats by hand and tracing each trip back to the responsible query is exactly the loop Pulse runs continuously.

Common Mistakes

  1. Treating precision_threshold as a hard accuracy guarantee above the threshold. It is the boundary below which counts are near-exact.
  2. Setting precision_threshold to 40000 reflexively when the dataset has only a few thousand distinct values - it wastes memory with no accuracy gain.
  3. Using missing and forgetting it counts as a distinct value, inflating uniques by one.
  4. Running cardinality on a text field - it triggers fielddata loading. Use a .keyword sub-field.
  5. Expecting cardinality results to sum across buckets - HLL sketches do merge correctly when run by Elasticsearch, but summing the integer values client-side double-counts overlap.

Optimize Cardinality Aggregations for Memory with Pulse

Pulse is an AI DBA for Elasticsearch and OpenSearch that continuously profiles production query traffic. For cardinality aggregations specifically, Pulse:

  • Identifies cardinality sub-aggregations nested inside high-cardinality terms aggregations, where each bucket holds its own HLL++ sketch at precision_threshold * 8 bytes per shard - the pattern that drives heap pressure during dashboard refreshes
  • Flags precision_threshold: 40000 set reflexively on fields with only a few thousand distinct values, where 16x memory is being spent for no accuracy gain
  • Spots cardinality aggregations running on text fields without a .keyword sub-field, triggering fielddata loading
  • Traces each slow or memory-heavy cardinality query back to the calling service via slow-log and APM correlation
  • Recommends concrete fixes: right-size precision_threshold against actual cardinality, add a Murmur3 hash sub-field via the mapper-murmur3 plugin to skip per-query hashing on hot dashboards, move expensive cardinality sub-aggs out of high-cardinality parents, and wrap cardinality on nested fields in a nested aggregation
  • Tracks heap, circuit breaker, and latency impact after the change ships

This converts the manual sketch-sizing and circuit-breaker loop into a continuous optimization workflow.

Try Pulse on your cluster.

Frequently Asked Questions

Q: How accurate is the cardinality aggregation?
A: The cardinality aggregation is near-exact for distinct counts below precision_threshold (default 3000). Above that, error scales with the square root of cardinality, typically 1-2% at 100k uniques with default settings, dropping to about 0.4% at the maximum threshold of 40000.

Q: What is the maximum precision_threshold?
A: 40000. Higher values are accepted by the API but capped at 40000 internally. Memory cost grows linearly with this setting.

Q: How does cardinality differ from value_count?
A: The cardinality aggregation counts distinct values approximately; the value_count aggregation counts all non-null values including duplicates exactly. They answer different questions.

Q: Can I get an exact distinct count?
A: Not from this aggregation. For exact counts, use a composite aggregation to enumerate every unique value and count pages client-side, accepting the much higher cost.

Q: Does cardinality work on nested fields?
A: Yes, but wrap it in a nested aggregation so it operates on the inner documents rather than parent values.

Q: Should I pre-hash my field for cardinality queries?
A: For very high-frequency dashboards on string fields, install the mapper-murmur3 plugin and add a murmur3 sub-field. The aggregation reads the pre-computed hash and skips per-query hashing. Numeric fields are already efficient.

Q: How is null handled?
A: Documents missing the field are not counted unless missing is set, in which case the substitute value contributes one distinct value to the result.

Q: How do I monitor cardinality aggregations for memory pressure in production?
A: Pulse tracks request and fielddata circuit breakers, identifies cardinality sub-aggregations nested under high-cardinality terms aggregations and oversized precision_threshold values, attributes each to the calling service, and recommends right-sizing the threshold or pre-hashing the field with the mapper-murmur3 plugin.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.