uniqExact gives exact unique counts but stores every distinct value, so its aggregate states grow linearly with cardinality. On a large AggregatingMergeTree this becomes the dominant storage cost. uniqCombined and uniq are approximate, but their state size is bounded by the algorithm rather than the cardinality, so the savings can be dramatic. This article shows how to convert existing uniqExactState columns into uniqCombinedState or uniqState columns without going back to raw data.
The trick is that uniqExactState's on-disk format is straightforward enough to reinterpret with arrayReduce.
How uniqExactState is Stored
Numeric values
For numeric input types, uniqExactState stores a count followed by the raw values. The count is LEB128-encoded, the values are concatenated without a delimiter:
SELECT hex(uniqExactState(arrayJoin([1, 3])));
-- 020103
-- 02 = count, LEB128
-- 01 = first value
-- 03 = second value
This format is identical to the serialization of AggregateFunction(groupArray, T) for the same inner type. That coincidence is what makes the conversion possible: the bytes a uniqExactState produces for (1, 3) are interpretable as a groupArray over (1, 3).
String values
For strings, ClickHouse hashes each value with sipHash128 and stores the resulting 16-byte hash. The format becomes a count followed by FixedString(16) values:
SELECT hex(uniqExactState(toString(arrayJoin([1, 2]))));
-- 024809CB4528E00621CF626BE9FA14E2BF...
-- 02 = count
-- followed by two 128-bit hashes
So uniqExactState(String) is byte-identical to groupArrayState(FixedString(16)) over the SipHash128 of those strings. This is the lever for the conversion.
Conversion Approach
To migrate the column:
- Read the bytes of
uniqExactStateasAggregateFunction(groupArray, UInt128). - Finalize the groupArray to get an
Array(UInt128)of the underlying hashes. - Reinterpret each element as
FixedString(16). - Pass the resulting array into
arrayReduce('uniqCombinedState', ...)orarrayReduce('uniqState', ...)to produce the new state.
arrayReduce accepts an aggregate function name and an array, executing the aggregation as if those array elements had been fed in row by row.
Migration SQL
Add the new columns:
ALTER TABLE aggregates
ADD COLUMN uniq AggregateFunction(uniq, FixedString(16)),
ADD COLUMN uniqCombined AggregateFunction(uniqCombined, FixedString(16));
Populate them from the existing uniqExact column:
ALTER TABLE aggregates
UPDATE
uniqCombined = arrayReduce('uniqCombinedState',
arrayMap(x -> reinterpretAsFixedString(x),
finalizeAggregation(
unhex(hex(uniqExact))::AggregateFunction(groupArray, UInt128)
)
)
),
uniq = arrayReduce('uniqState',
arrayMap(x -> reinterpretAsFixedString(x),
finalizeAggregation(
unhex(hex(uniqExact))::AggregateFunction(groupArray, UInt128)
)
)
)
WHERE 1
SETTINGS mutations_sync = 2;
Walking through the expression:
unhex(hex(uniqExact))rebinds the bytes fromAggregateFunction(uniqExact, String)to a raw byte sequence.::AggregateFunction(groupArray, UInt128)reinterprets those bytes under the groupArray signature.finalizeAggregation(...)returnsArray(UInt128).arrayMap(x -> reinterpretAsFixedString(x), ...)converts each UInt128 hash into theFixedString(16)form thatuniqStateanduniqCombinedStateexpect.arrayReduce('uniqCombinedState', ...)runs the target aggregator over those values.
mutations_sync = 2 waits for all replicas. Use 1 if you only need to confirm on the current replica.
After the mutation, materialized views that feed this table should switch from:
uniqExactState(my_string_col)
to:
uniqCombinedState(sipHash128(my_string_col))
(or the equivalent for uniqState). The MV now produces states in the new format directly, and old/new states can be merged because both are over FixedString(16).
Storage Results
Typical compressed sizes for the same set of strings, observed in the original example:
| Column type | Compressed size |
|---|---|
uniqExact |
153.21 MiB |
uniqCombined |
76.62 MiB |
uniq |
38.33 MiB |
Roughly half the size for uniqCombined, quarter for uniq. The savings come from the bounded state size of the HLL-family functions: regardless of how many unique strings are aggregated, the state size has a ceiling.
The tradeoff is accuracy. uniqCombined has well-known error bounds (around 0.5 to 1% relative error for typical settings). uniq is HyperLogLog with similar bounds. If your dashboard tolerates a few percent error, the storage win is large.
Numeric Columns
For numeric input types the conversion is similar, but the intermediate groupArray uses the original numeric type rather than UInt128:
-- uniqExactState(UInt64) bytes reinterpret as groupArrayState(UInt64)
finalizeAggregation(unhex(hex(uniqExact_uint64))::AggregateFunction(groupArray, UInt64))
And uniq and uniqCombined accept the numeric type directly, so the reinterpretAsFixedString step is unnecessary:
arrayReduce('uniqCombinedState',
finalizeAggregation(unhex(hex(uniqExact))::AggregateFunction(groupArray, UInt64))
)
Common Pitfalls
- Wrong inner type in the cast. If your
uniqExactStatewas overString, the inner type isUInt128(the SipHash128 output). For numeric inputs, it is the original numeric type. Get this wrong and the bytes are misinterpreted with garbage results. - Forgetting to update materialized views. The schema is migrated but the MV keeps writing
uniqExactState. Update the MV definition before or alongside the column change, then optionallyDETACHandATTACHto materialize the new schema. - Running the UPDATE without
mutations_sync. Mutations are asynchronous. Watchsystem.mutationsto confirm completion before dropping the old column. - Dropping the old column too early. Keep both columns until you have verified the new ones produce expected counts at a few sample group-by levels.
- Assuming exact accuracy after migration.
uniqCombinedanduniqare approximate. If a downstream consumer requires exact counts, the conversion is not appropriate; consider a different strategy like a periodic snapshot.
How Pulse Helps With Aggregate Migrations
Aggregate state migrations are easy to get wrong and hard to roll back. Pulse inspects AggregatingMergeTree tables and surfaces columns whose serialization can be converted to cheaper representations, monitors mutation progress on large updates, and validates that materialized view definitions match the column types they write into. Connect your ClickHouse cluster to Pulse and migrate aggregate states with a safety net.
Frequently Asked Questions
Q: Why does this conversion work without reading raw data?
Because uniqExactState and groupArrayState share the same on-disk format for their inner type. Reinterpreting the bytes lets you pull the original values (or hashes, for strings) back out, then feed them into a different aggregate function.
Q: Do I lose accuracy with uniqCombined?
Yes, but bounded. uniqCombined has an error of roughly 0.5 to 1% for typical cardinalities. uniq is similar. If you need exact counts on a specific dashboard, keep uniqExact for that one.
Q: Can I convert in the other direction (uniq to uniqExact)?
No. uniq and uniqCombined are lossy: they discard the original values and keep only a probabilistic summary. There is no path back to exact counts without re-aggregating from the raw data.
Q: What about uniqHLL12?
uniqHLL12 is also HLL-family and would accept the same conversion path. Use arrayReduce('uniqHLL12State', ...) with the same expression.
Q: Does this work on Replicated tables?
Yes. Run the ALTER TABLE ... UPDATE with mutations_sync = 2 to wait for all replicas. The mutation propagates through the replication queue like any other.
Q: Will the new states merge correctly with existing data?
Yes, as long as the materialized view is updated to emit states in the new format. After the migration, all states in the new column are uniqCombinedState(FixedString(16)) and merge cleanly.