ClickHouse exposes a family of bitmap functions built on Roaring bitmaps — a compressed bitset format that stores large sets of unsigned integers compactly and supports very fast set operations. They are the right tool when you need exact distinct counts plus intersections, unions, and differences over sets of IDs, which is exactly the shape of retention and cohort-overlap analysis.
The entry point is the groupBitmap aggregate function and its -State variant, which produce a bitmap you can intersect (bitmapAnd), union (bitmapOr), and measure (bitmapCardinality). This guide covers how the implementation works, the core function set, and a complete retention example.
When Bitmap Functions Help
A normal distinct count uses uniqExact or count(DISTINCT ...). That answers "how many unique users," but it discards the set itself — so answering "how many of yesterday's users came back today" forces an expensive JOIN or IN subquery between two large result sets.
Bitmap functions keep the whole set as a single, compressed value. Once you have one bitmap per day (or per cohort, segment, or feature flag), questions like "users in both A and B," "users in A but not B," or "users in any of A, B, C" become single, cheap bitmap operations instead of repeated scans and joins. For large ID populations this is frequently an order of magnitude faster than set-based subqueries.
Bitmaps require IDs to be unsigned integers (UInt8/16/32/64). If your identifiers are strings or UUIDs, map them to a dense integer space first — see Common Issues below.
Roaring Bitmaps Under the Hood
ClickHouse does not use a single fixed representation. Internally a bitmap starts as a small Set object and switches to a Roaring bitmap once cardinality exceeds 32 elements. This hybrid keeps tiny sets cheap (no Roaring overhead for a handful of IDs) while large sets get Roaring's compression and fast boolean operations.
Roaring bitmaps partition the 32-bit (or chunked 64-bit) key space into containers and pick the most efficient container type per chunk — array, bitset, or run-length — so both sparse and dense ID ranges stay compact. The practical consequence: storage scales with the structure of your IDs, not naively with the maximum ID value, and AND/OR/XOR run in time proportional to the data rather than the universe size.
Core Bitmap Functions
Bitmap objects are produced by groupBitmapState(...) and have type AggregateFunction(groupBitmap, UInt*). The functions below operate on those objects.
| Function | Purpose |
|---|---|
groupBitmap(expr) |
Aggregate: cardinality (distinct count) of UInt* values, built via a bitmap |
groupBitmapState(expr) |
Aggregate: returns the bitmap object itself (for storage or further ops) |
bitmapBuild(array) |
Build a bitmap from an array of unsigned integers |
bitmapToArray(bitmap) |
Convert a bitmap back to a sorted array of integers |
bitmapCardinality(bitmap) |
Number of set bits (distinct elements) |
bitmapAnd(a, b) |
Intersection — elements in both a and b |
bitmapOr(a, b) |
Union — elements in either a or b |
bitmapXor(a, b) |
Symmetric difference — elements in exactly one |
bitmapAndnot(a, b) |
Difference — elements in a but not b |
bitmapAndCardinality(a, b) |
Cardinality of the intersection, without materializing it |
bitmapOrCardinality(a, b) |
Cardinality of the union |
bitmapXorCardinality(a, b) |
Cardinality of the symmetric difference |
bitmapAndnotCardinality(a, b) |
Cardinality of the difference |
bitmapContains(bitmap, x) |
Whether a single value is present |
bitmapHasAny(a, b) |
Whether the two bitmaps overlap at all |
bitmapHasAll(a, b) |
Whether a contains every element of b |
bitmapSubsetInRange(bitmap, start, end) |
Elements in [start, end) |
bitmapMin(bitmap) / bitmapMax(bitmap) |
Smallest / largest element |
The *Cardinality variants are the ones you want for retention: they return just the count of an intersection or union without allocating the intermediate bitmap.
Aggregate operations over many bitmaps
When you have a column of bitmap states (for example one row per day, each holding that day's active users), you combine them with the group-level aggregates:
groupBitmapAnd(expr)— intersect all bitmaps in the group and return the cardinality.groupBitmapOr(expr)— union all bitmaps in the group and return the cardinality.groupBitmapXor(expr)— symmetric difference across the group.
Each has a -State variant (groupBitmapAndState, etc.) that returns the combined bitmap instead of the count. The argument must be of type AggregateFunction(groupBitmap, UInt*).
Building Bitmaps: groupBitmap and groupBitmapState
A quick distinct count looks like an ordinary aggregate:
CREATE TABLE events (user_id UInt32, event_date Date) ENGINE = Memory;
INSERT INTO events VALUES (1, '2026-01-01'), (1, '2026-01-01'), (2, '2026-01-01'), (3, '2026-01-02');
SELECT groupBitmap(user_id) AS distinct_users FROM events;
-- 3
To keep the set for later set math, use groupBitmapState and store it in an AggregateFunction column. Pre-aggregating one bitmap per day with AggregatingMergeTree means each day's set is computed once and reused across every retention query:
CREATE TABLE daily_active_users
(
event_date Date,
users AggregateFunction(groupBitmap, UInt32)
)
ENGINE = AggregatingMergeTree
ORDER BY event_date;
INSERT INTO daily_active_users
SELECT event_date, groupBitmapState(user_id)
FROM events
GROUP BY event_date;
You can also build a bitmap on the fly from an array with bitmapBuild([...]), which is handy for tests and for feeding precomputed ID lists into bitmap operations.
Retention Calculation with bitmapAnd
Retention is "what fraction of a cohort came back later." With one bitmap per day, day-N retention is the intersection of the day-0 cohort and the day-N cohort, divided by the day-0 cohort size.
Materialize each day's bitmap once, then intersect:
WITH
(SELECT groupBitmapMergeState(users) FROM daily_active_users WHERE event_date = '2026-01-01') AS day0,
(SELECT groupBitmapMergeState(users) FROM daily_active_users WHERE event_date = '2026-01-08') AS day7
SELECT
bitmapCardinality(day0) AS cohort_size,
bitmapAndCardinality(day0, day7) AS retained,
round(bitmapAndCardinality(day0, day7) / bitmapCardinality(day0), 4) AS day7_retention;
bitmapAndCardinality computes the size of the intersection directly — there is no join between the two day-cohorts and no intermediate result set to materialize.
To compute a whole retention curve in one pass, combine each day's bitmap against a fixed cohort. The pattern below intersects every later day with the day-0 bitmap:
WITH (SELECT groupBitmapMergeState(users)
FROM daily_active_users
WHERE event_date = '2026-01-01') AS cohort
SELECT
event_date,
dateDiff('day', toDate('2026-01-01'), event_date) AS day_offset,
bitmapAndCardinality(cohort, groupBitmapMergeState(users)) AS retained
FROM daily_active_users
WHERE event_date >= '2026-01-01'
GROUP BY event_date
ORDER BY event_date;
For multi-day cohort overlap (for example, users active on every day of a range), groupBitmapAnd intersects the whole group at once:
SELECT groupBitmapAnd(users) AS active_all_days
FROM daily_active_users
WHERE event_date BETWEEN '2026-01-01' AND '2026-01-07';
Swap in groupBitmapOr for "active on any day in the range" — that union is the denominator for weekly or monthly active-user metrics.
Beyond Retention: Segmentation and Funnels
The same primitives answer adjacent questions:
- Segment overlap: store one bitmap per segment or feature flag, then
bitmapAndCardinality(segment_a, segment_b)for the overlap andbitmapAndnotCardinality(segment_a, segment_b)for the exclusive portion. - Reach / dedup union:
bitmapOrCardinality(orgroupBitmapOr) gives the deduplicated size of combined audiences without re-scanning raw events. - Membership checks:
bitmapContains(segment, user_id)andbitmapHasAny(a, b)for fast "is this user in the set" and "do these sets touch" tests.
For ordered step-by-step funnels where event sequence matters, bitmaps are the wrong tool — reach for windowFunnel instead. Bitmaps answer set membership, not sequence.
Bitmap Functions vs. Approximate Methods
Bitmaps are exact and keep the set; sketch-based methods are approximate and keep only an estimate. Choose based on whether you need set operations and exact answers.
| Aspect | Bitmap functions (groupBitmap) |
Approximate (uniq, uniqCombined) |
|---|---|---|
| Accuracy | Exact | Approximate (small relative error) |
| Set operations | Yes — AND/OR/XOR/ANDNOT | No (HLL states only merge for union counts) |
| Input type | UInt* only |
Any type |
| Memory on high cardinality | Grows with the actual ID set | Fixed-size sketch |
| Best for | Retention, cohort overlap, segments | Pure distinct counts at massive cardinality |
If you only need a distinct count and not the set itself, uniq / uniqCombined and the broader approximate methods guide are lighter on memory. For exact counts without set math, see count(DISTINCT). Reach for bitmaps specifically when you need to combine sets after counting them.
Common Issues and Gotchas
IDs must be unsigned integers. Bitmap functions reject strings, UUIDs, and signed integers. Map non-integer keys to a dense
UInt32/UInt64space first — a ClickHouse dictionary is a convenient way to maintain a stable string-to-int mapping, or use a hash if occasional collisions are acceptable for your accuracy needs.Sparse, very large IDs cost memory. Roaring compresses well, but a set of mostly-empty high integers is still larger than a dense one. Densifying your ID space (assign sequential ints) keeps bitmaps small and operations fast.
groupBitmapreturns a count, not a bitmap. To do set math you needgroupBitmapState/groupBitmapMergeState. Storing the plain count and trying to intersect it later is a common mistake.Mixing states and finals. A column of
AggregateFunction(groupBitmap, ...)must be merged withgroupBitmapMergeState(...)(to get a bitmap) orgroupBitmapMerge(...)(to get the count) before use in scalar bitmap functions. Passing a raw state column tobitmapAndwill not type-check.Type width must match. When combining bitmaps, the underlying
UInt*type should be consistent. MixingUInt32andUInt64bitmaps can fail or force conversions.
Best Practices
Pre-aggregate one bitmap per grain. Maintain daily (or hourly) bitmaps in an AggregatingMergeTree or a materialized view so each cohort's set is computed once and every retention query reuses it.
Use the
*Cardinalityfunctions when you only need a count.bitmapAndCardinality(a, b)avoids allocating the intersection thatbitmapCardinality(bitmapAnd(a, b))would materialize.Keep the ID space dense. Sequential
UInt32IDs give the smallest, fastest bitmaps. Resolve strings to ints at ingest, not at query time.Prefer group aggregates for many-bitmap combines.
groupBitmapAnd/groupBitmapOrover a column beat folding bitmaps together one pair at a time.Match the tool to the question. Bitmaps for set membership and overlap;
windowFunnelfor ordered funnels;uniqCombinedfor pure high-cardinality distinct counts.
How Pulse Helps
Bitmap-based retention pipelines tend to fail quietly: a state column built with the wrong type width, an AggregatingMergeTree whose merges fall behind so bitmaps stay fragmented, or memory spikes when a query unions thousands of large bitmaps at once. Pulse monitors ClickHouse clusters for exactly these conditions — stalled merges, growing unmerged parts on aggregating tables, and memory-heavy aggregation queries — and surfaces the offending query and table so you can fix the root cause instead of guessing. It is built and run by ClickHouse and search-infrastructure engineers who have operated these workloads at scale.
Frequently Asked Questions
Q: Why does ClickHouse use Roaring bitmaps instead of plain bitsets?
A plain bitset would allocate space proportional to the largest ID, wasting memory on sparse sets. Roaring partitions the key space into containers and picks the most compact representation (array, bitset, or run) per chunk, so both sparse and dense ID sets stay small while set operations remain fast. ClickHouse also starts with a lightweight Set object and only switches to Roaring once a bitmap exceeds 32 elements.
Q: Do my user IDs have to be integers?
Yes. Bitmap functions accept only UInt8/16/32/64. For string or UUID identifiers, maintain a string-to-integer mapping (a dictionary or hash) and store the integer ID in the bitmap.
Q: What is the difference between groupBitmap and groupBitmapState?
groupBitmap returns the distinct count (a UInt64). groupBitmapState returns the bitmap object itself, which you can store in an AggregateFunction column and later intersect, union, or measure. Use the -State form whenever you need set operations later.
Q: How is bitmap retention faster than a JOIN-based approach?
A JOIN or IN-subquery retention query re-scans and re-matches two large user sets per question. With pre-aggregated bitmaps, each day's set already exists as one compressed value, and bitmapAndCardinality computes the overlap directly — no join, no intermediate result set — which is often 10x or more faster on large ID populations.
Q: When should I use approximate methods instead of bitmaps?
If you only need a distinct count (not the set) and cardinality is very high, uniq or uniqCombined use a fixed-size sketch and far less memory. Use bitmaps when you need exact answers and set operations like intersection and union. See the approximate methods guide.
Q: Can I compute multi-day "active every day" cohorts in one query?
Yes. With one bitmap state per day, groupBitmapAnd(users) over a date range returns the count of users present in every day's bitmap; groupBitmapOr(users) returns the count present on any day. Both run as single aggregates over the bitmap column.