ClickHouse Bitmap Functions: groupBitmap, bitmapAnd, bitmapOr for Cardinality and Retention

ClickHouse exposes a family of bitmap functions built on Roaring bitmaps — a compressed bitset format that stores large sets of unsigned integers compactly and supports very fast set operations. They are the right tool when you need exact distinct counts plus intersections, unions, and differences over sets of IDs, which is exactly the shape of retention and cohort-overlap analysis.

The entry point is the groupBitmap aggregate function and its -State variant, which produce a bitmap you can intersect (bitmapAnd), union (bitmapOr), and measure (bitmapCardinality). This guide covers how the implementation works, the core function set, and a complete retention example.

When Bitmap Functions Help

A normal distinct count uses uniqExact or count(DISTINCT ...). That answers "how many unique users," but it discards the set itself — so answering "how many of yesterday's users came back today" forces an expensive JOIN or IN subquery between two large result sets.

Bitmap functions keep the whole set as a single, compressed value. Once you have one bitmap per day (or per cohort, segment, or feature flag), questions like "users in both A and B," "users in A but not B," or "users in any of A, B, C" become single, cheap bitmap operations instead of repeated scans and joins. For large ID populations this is frequently an order of magnitude faster than set-based subqueries.

Bitmaps require IDs to be unsigned integers (UInt8/16/32/64). If your identifiers are strings or UUIDs, map them to a dense integer space first — see Common Issues below.

Roaring Bitmaps Under the Hood

ClickHouse does not use a single fixed representation. Internally a bitmap starts as a small Set object and switches to a Roaring bitmap once cardinality exceeds 32 elements. This hybrid keeps tiny sets cheap (no Roaring overhead for a handful of IDs) while large sets get Roaring's compression and fast boolean operations.

Roaring bitmaps partition the 32-bit (or chunked 64-bit) key space into containers and pick the most efficient container type per chunk — array, bitset, or run-length — so both sparse and dense ID ranges stay compact. The practical consequence: storage scales with the structure of your IDs, not naively with the maximum ID value, and AND/OR/XOR run in time proportional to the data rather than the universe size.

Core Bitmap Functions

Bitmap objects are produced by groupBitmapState(...) and have type AggregateFunction(groupBitmap, UInt*). The functions below operate on those objects.

Function Purpose
groupBitmap(expr) Aggregate: cardinality (distinct count) of UInt* values, built via a bitmap
groupBitmapState(expr) Aggregate: returns the bitmap object itself (for storage or further ops)
bitmapBuild(array) Build a bitmap from an array of unsigned integers
bitmapToArray(bitmap) Convert a bitmap back to a sorted array of integers
bitmapCardinality(bitmap) Number of set bits (distinct elements)
bitmapAnd(a, b) Intersection — elements in both a and b
bitmapOr(a, b) Union — elements in either a or b
bitmapXor(a, b) Symmetric difference — elements in exactly one
bitmapAndnot(a, b) Difference — elements in a but not b
bitmapAndCardinality(a, b) Cardinality of the intersection, without materializing it
bitmapOrCardinality(a, b) Cardinality of the union
bitmapXorCardinality(a, b) Cardinality of the symmetric difference
bitmapAndnotCardinality(a, b) Cardinality of the difference
bitmapContains(bitmap, x) Whether a single value is present
bitmapHasAny(a, b) Whether the two bitmaps overlap at all
bitmapHasAll(a, b) Whether a contains every element of b
bitmapSubsetInRange(bitmap, start, end) Elements in [start, end)
bitmapMin(bitmap) / bitmapMax(bitmap) Smallest / largest element

The *Cardinality variants are the ones you want for retention: they return just the count of an intersection or union without allocating the intermediate bitmap.

Aggregate operations over many bitmaps

When you have a column of bitmap states (for example one row per day, each holding that day's active users), you combine them with the group-level aggregates:

  • groupBitmapAnd(expr) — intersect all bitmaps in the group and return the cardinality.
  • groupBitmapOr(expr) — union all bitmaps in the group and return the cardinality.
  • groupBitmapXor(expr) — symmetric difference across the group.

Each has a -State variant (groupBitmapAndState, etc.) that returns the combined bitmap instead of the count. The argument must be of type AggregateFunction(groupBitmap, UInt*).

Building Bitmaps: groupBitmap and groupBitmapState

A quick distinct count looks like an ordinary aggregate:

CREATE TABLE events (user_id UInt32, event_date Date) ENGINE = Memory;
INSERT INTO events VALUES (1, '2026-01-01'), (1, '2026-01-01'), (2, '2026-01-01'), (3, '2026-01-02');

SELECT groupBitmap(user_id) AS distinct_users FROM events;
-- 3

To keep the set for later set math, use groupBitmapState and store it in an AggregateFunction column. Pre-aggregating one bitmap per day with AggregatingMergeTree means each day's set is computed once and reused across every retention query:

CREATE TABLE daily_active_users
(
    event_date Date,
    users AggregateFunction(groupBitmap, UInt32)
)
ENGINE = AggregatingMergeTree
ORDER BY event_date;

INSERT INTO daily_active_users
SELECT event_date, groupBitmapState(user_id)
FROM events
GROUP BY event_date;

You can also build a bitmap on the fly from an array with bitmapBuild([...]), which is handy for tests and for feeding precomputed ID lists into bitmap operations.

Retention Calculation with bitmapAnd

Retention is "what fraction of a cohort came back later." With one bitmap per day, day-N retention is the intersection of the day-0 cohort and the day-N cohort, divided by the day-0 cohort size.

Materialize each day's bitmap once, then intersect:

WITH
    (SELECT groupBitmapMergeState(users) FROM daily_active_users WHERE event_date = '2026-01-01') AS day0,
    (SELECT groupBitmapMergeState(users) FROM daily_active_users WHERE event_date = '2026-01-08') AS day7
SELECT
    bitmapCardinality(day0)                       AS cohort_size,
    bitmapAndCardinality(day0, day7)              AS retained,
    round(bitmapAndCardinality(day0, day7) / bitmapCardinality(day0), 4) AS day7_retention;

bitmapAndCardinality computes the size of the intersection directly — there is no join between the two day-cohorts and no intermediate result set to materialize.

To compute a whole retention curve in one pass, combine each day's bitmap against a fixed cohort. The pattern below intersects every later day with the day-0 bitmap:

WITH (SELECT groupBitmapMergeState(users)
      FROM daily_active_users
      WHERE event_date = '2026-01-01') AS cohort
SELECT
    event_date,
    dateDiff('day', toDate('2026-01-01'), event_date)        AS day_offset,
    bitmapAndCardinality(cohort, groupBitmapMergeState(users)) AS retained
FROM daily_active_users
WHERE event_date >= '2026-01-01'
GROUP BY event_date
ORDER BY event_date;

For multi-day cohort overlap (for example, users active on every day of a range), groupBitmapAnd intersects the whole group at once:

SELECT groupBitmapAnd(users) AS active_all_days
FROM daily_active_users
WHERE event_date BETWEEN '2026-01-01' AND '2026-01-07';

Swap in groupBitmapOr for "active on any day in the range" — that union is the denominator for weekly or monthly active-user metrics.

Beyond Retention: Segmentation and Funnels

The same primitives answer adjacent questions:

  • Segment overlap: store one bitmap per segment or feature flag, then bitmapAndCardinality(segment_a, segment_b) for the overlap and bitmapAndnotCardinality(segment_a, segment_b) for the exclusive portion.
  • Reach / dedup union: bitmapOrCardinality (or groupBitmapOr) gives the deduplicated size of combined audiences without re-scanning raw events.
  • Membership checks: bitmapContains(segment, user_id) and bitmapHasAny(a, b) for fast "is this user in the set" and "do these sets touch" tests.

For ordered step-by-step funnels where event sequence matters, bitmaps are the wrong tool — reach for windowFunnel instead. Bitmaps answer set membership, not sequence.

Bitmap Functions vs. Approximate Methods

Bitmaps are exact and keep the set; sketch-based methods are approximate and keep only an estimate. Choose based on whether you need set operations and exact answers.

Aspect Bitmap functions (groupBitmap) Approximate (uniq, uniqCombined)
Accuracy Exact Approximate (small relative error)
Set operations Yes — AND/OR/XOR/ANDNOT No (HLL states only merge for union counts)
Input type UInt* only Any type
Memory on high cardinality Grows with the actual ID set Fixed-size sketch
Best for Retention, cohort overlap, segments Pure distinct counts at massive cardinality

If you only need a distinct count and not the set itself, uniq / uniqCombined and the broader approximate methods guide are lighter on memory. For exact counts without set math, see count(DISTINCT). Reach for bitmaps specifically when you need to combine sets after counting them.

Common Issues and Gotchas

  1. IDs must be unsigned integers. Bitmap functions reject strings, UUIDs, and signed integers. Map non-integer keys to a dense UInt32/UInt64 space first — a ClickHouse dictionary is a convenient way to maintain a stable string-to-int mapping, or use a hash if occasional collisions are acceptable for your accuracy needs.

  2. Sparse, very large IDs cost memory. Roaring compresses well, but a set of mostly-empty high integers is still larger than a dense one. Densifying your ID space (assign sequential ints) keeps bitmaps small and operations fast.

  3. groupBitmap returns a count, not a bitmap. To do set math you need groupBitmapState / groupBitmapMergeState. Storing the plain count and trying to intersect it later is a common mistake.

  4. Mixing states and finals. A column of AggregateFunction(groupBitmap, ...) must be merged with groupBitmapMergeState(...) (to get a bitmap) or groupBitmapMerge(...) (to get the count) before use in scalar bitmap functions. Passing a raw state column to bitmapAnd will not type-check.

  5. Type width must match. When combining bitmaps, the underlying UInt* type should be consistent. Mixing UInt32 and UInt64 bitmaps can fail or force conversions.

Best Practices

  1. Pre-aggregate one bitmap per grain. Maintain daily (or hourly) bitmaps in an AggregatingMergeTree or a materialized view so each cohort's set is computed once and every retention query reuses it.

  2. Use the *Cardinality functions when you only need a count. bitmapAndCardinality(a, b) avoids allocating the intersection that bitmapCardinality(bitmapAnd(a, b)) would materialize.

  3. Keep the ID space dense. Sequential UInt32 IDs give the smallest, fastest bitmaps. Resolve strings to ints at ingest, not at query time.

  4. Prefer group aggregates for many-bitmap combines. groupBitmapAnd / groupBitmapOr over a column beat folding bitmaps together one pair at a time.

  5. Match the tool to the question. Bitmaps for set membership and overlap; windowFunnel for ordered funnels; uniqCombined for pure high-cardinality distinct counts.

How Pulse Helps

Bitmap-based retention pipelines tend to fail quietly: a state column built with the wrong type width, an AggregatingMergeTree whose merges fall behind so bitmaps stay fragmented, or memory spikes when a query unions thousands of large bitmaps at once. Pulse monitors ClickHouse clusters for exactly these conditions — stalled merges, growing unmerged parts on aggregating tables, and memory-heavy aggregation queries — and surfaces the offending query and table so you can fix the root cause instead of guessing. It is built and run by ClickHouse and search-infrastructure engineers who have operated these workloads at scale.

Frequently Asked Questions

Q: Why does ClickHouse use Roaring bitmaps instead of plain bitsets?

A plain bitset would allocate space proportional to the largest ID, wasting memory on sparse sets. Roaring partitions the key space into containers and picks the most compact representation (array, bitset, or run) per chunk, so both sparse and dense ID sets stay small while set operations remain fast. ClickHouse also starts with a lightweight Set object and only switches to Roaring once a bitmap exceeds 32 elements.

Q: Do my user IDs have to be integers?

Yes. Bitmap functions accept only UInt8/16/32/64. For string or UUID identifiers, maintain a string-to-integer mapping (a dictionary or hash) and store the integer ID in the bitmap.

Q: What is the difference between groupBitmap and groupBitmapState?

groupBitmap returns the distinct count (a UInt64). groupBitmapState returns the bitmap object itself, which you can store in an AggregateFunction column and later intersect, union, or measure. Use the -State form whenever you need set operations later.

Q: How is bitmap retention faster than a JOIN-based approach?

A JOIN or IN-subquery retention query re-scans and re-matches two large user sets per question. With pre-aggregated bitmaps, each day's set already exists as one compressed value, and bitmapAndCardinality computes the overlap directly — no join, no intermediate result set — which is often 10x or more faster on large ID populations.

Q: When should I use approximate methods instead of bitmaps?

If you only need a distinct count (not the set) and cardinality is very high, uniq or uniqCombined use a fixed-size sketch and far less memory. Use bitmaps when you need exact answers and set operations like intersection and union. See the approximate methods guide.

Q: Can I compute multi-day "active every day" cohorts in one query?

Yes. With one bitmap state per day, groupBitmapAnd(users) over a date range returns the count of users present in every day's bitmap; groupBitmapOr(users) returns the count present on any day. Both run as single aggregates over the bitmap column.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.