After upgrading ClickHouse from a version before 21.6 to a newer build, AggregateFunction(uniq, UUID) columns in AggregatingMergeTree tables can return unique counts that look almost doubled. The cause is a change in the internal hash function used for UUID values inside uniq: PR #23631 replaced SipHash64 with intHash64 for big-integer types, UUID included. States created before the upgrade and states created after the upgrade hash the same UUID to different values, so merging them across the version boundary inflates the cardinality estimate.
This article walks through the symptom, the root cause, and three migration paths to bring the column back to a single consistent hash space.
Symptom
You have a table similar to this:
CREATE TABLE uniq_state
(
key UInt64,
value AggregateFunction(uniq, UUID)
)
ENGINE = AggregatingMergeTree
ORDER BY key;
Before the upgrade, SELECT key, uniqMerge(value) FROM uniq_state GROUP BY key returned the expected unique counts. After the upgrade, the same query returns nearly 2x the previous result for any key that has states written both before and after the upgrade. Newly written keys are fine. Old-only keys are fine. The doubling comes from merging mixed-version states.
Root Cause
uniq internally hashes each input value to UInt64, then uses that hash inside its HyperLogLog-style estimator. For UUID, that hash function changed in 21.6:
- Pre-21.6:
SipHash64(uuid) - 21.6 and later:
intHash64(uuid)
The same UUID produces a completely different 64-bit hash in the two versions. From uniq's perspective the pre-upgrade and post-upgrade states represent different sets of items, so uniqMerge adds the cardinalities together instead of correctly deduplicating.
The fix is to stop relying on the implicit UUID hash inside uniqState and feed in a stable, explicit hash yourself. sipHash64(uuid) is the canonical choice: the underlying behavior is identical to the old internal hashing.
The new column type becomes AggregateFunction(uniq, UInt64) because sipHash64 returns UInt64.
Option 1: Add a Parallel Column
Keep the existing UUID column for compatibility while populating a new UInt64-based column going forward.
ALTER TABLE uniq_state_3
ADD COLUMN value_2 AggregateFunction(uniq, UInt64)
DEFAULT unhex(hex(value));
ALTER TABLE uniq_state_3
UPDATE value_2 = value_2 WHERE 1;
The DEFAULT unhex(hex(value)) reinterprets the binary representation of the old state under the new aggregate function type. Then a no-op mutation materializes the default into actual storage.
New writes have to use the hashed form:
INSERT INTO uniq_state_3 (key, value_2)
SELECT
number % 10000 AS key,
uniqState(sipHash64(reinterpretAsUUID(number)))
FROM numbers(1000000)
GROUP BY key;
Update materialized views that feed the table to compute uniqState(sipHash64(uuid_column)) instead of uniqState(uuid_column).
Option 2: Modify the Column In Place
If you want to keep the column name value and not maintain two columns, change the type through String as an intermediate:
ALTER TABLE uniq_state_3
MODIFY COLUMN value String;
ALTER TABLE uniq_state_3
MODIFY COLUMN value AggregateFunction(uniq, UInt64);
ClickHouse rejects a direct AggregateFunction(uniq, UUID) to AggregateFunction(uniq, UInt64) conversion because the inner types differ, but routing through String preserves the bytes on disk while letting you swap the declared aggregate signature.
After the type change, populate new data with the explicit hash:
INSERT INTO uniq_state_3 (key, value)
SELECT
number % 10000 AS key,
uniqState(sipHash64(reinterpretAsUUID(number)))
FROM numbers(1000000)
GROUP BY key;
Option 3: Cast at Insert Time
If you cannot afford the schema change or want to keep the column declared as AggregateFunction(uniq, UUID), cast the state back to String at insert time:
INSERT INTO uniq_state_4 (key, value)
SELECT
number % 10000 AS key,
CAST(uniqState(sipHash64(reinterpretAsUUID(number))), 'String')
FROM numbers(1000000)
GROUP BY key;
The trick: ClickHouse stores aggregate states as opaque byte sequences. Casting to String puts the new-hash bytes into the same physical column, and uniqMerge produces correct results as long as every state in the merge was produced the same way.
This is the most fragile option because nothing in the schema documents the change. Use it only as a short-term measure.
Backfilling Old States
Old states cannot be retroactively re-hashed: the original UUIDs are gone, only the hashed cardinality estimator survives. Two practical paths:
- Replay raw data: if you still have the source UUIDs (in a raw events table, in S3, in Kafka logs), drop the affected aggregate rows and re-aggregate with
uniqState(sipHash64(uuid)). - Live with mixed history: rotate the aggregate table by partition and treat the pre-upgrade partitions as approximate. Read-side dashboards that group by month after the cutover stay accurate.
Common Pitfalls
- Updating only the table, not the materialized view. The view continues to write
uniqState(uuid)and the problem perpetuates. UpdateSELECTclauses in every MV that feeds the table. - Forgetting
reinterpretAsUUID. When testing withnumbers(), you have to convertUInt64toUUIDfirst, otherwise you're hashing the wrong type and the demo does not match real data. - Trying to convert
AggregateFunction(uniq, UUID)directly toAggregateFunction(uniq, UInt64). The directMODIFY COLUMNis rejected. Use the String intermediate step. - Assuming
uniqExactis affected. This issue is specific touniq(HLL-based).uniqExactstores hashed values too, but its serialization format and merge semantics are different and not subject to the UUID hash change in PR #23631.
How Pulse Helps With Cross-Version Compatibility Issues
Hash-function changes that quietly corrupt aggregate counts are exactly the kind of post-upgrade regression that is invisible in error logs. Pulse tracks every ClickHouse upgrade against the version notes and surfaces tables that store AggregateFunction states which changed serialization or hashing between the from- and to-version. Connect your cluster to Pulse and have a checklist for every upgrade.
Frequently Asked Questions
Q: Which versions are affected?
Tables created on ClickHouse before 21.6 with AggregateFunction(uniq, UUID) columns, then upgraded to 21.6 or later. The hash change is in PR #23631.
Q: Are other big-integer types affected?
The PR changed the internal hash for big integers in general. Other types under uniq (Int128, UInt128, Int256, UInt256, UUID) saw the same kind of change. UUID is the most commonly reported because it is the most common big-int aggregation key in user-facing tables.
Q: Why does sipHash64 work as the migration path?
Because the old internal hash was effectively SipHash64. Explicitly calling sipHash64(uuid) reproduces the pre-21.6 behavior, but now it is part of your query rather than an internal detail that can change between versions.
Q: Will uniqMerge give the exact pre-upgrade number after migrating?
For new data, yes. For data merged across the cutover, no, because the old states are still hashed with the old internal function and cannot be re-derived without raw inputs. The merged result remains inflated for affected rows; new aggregations from raw events on the migrated schema are accurate.
Q: Does this affect uniqHLL12 or uniqCombined?
uniqCombined and uniqHLL12 have their own serialization formats. The PR specifically targets uniq on big-int types. Check the changelog for the exact version you upgraded through if you use those variants.