NEW

Pulse 2025 Product Roundup: From Monitoring to AI-Native Control Plane

ClickHouse AggregateFunction(uniq, UUID) Doubled After Upgrade

After upgrading ClickHouse from a version before 21.6 to a newer build, AggregateFunction(uniq, UUID) columns in AggregatingMergeTree tables can return unique counts that look almost doubled. The cause is a change in the internal hash function used for UUID values inside uniq: PR #23631 replaced SipHash64 with intHash64 for big-integer types, UUID included. States created before the upgrade and states created after the upgrade hash the same UUID to different values, so merging them across the version boundary inflates the cardinality estimate.

This article walks through the symptom, the root cause, and three migration paths to bring the column back to a single consistent hash space.

Symptom

You have a table similar to this:

CREATE TABLE uniq_state
(
    key UInt64,
    value AggregateFunction(uniq, UUID)
)
ENGINE = AggregatingMergeTree
ORDER BY key;

Before the upgrade, SELECT key, uniqMerge(value) FROM uniq_state GROUP BY key returned the expected unique counts. After the upgrade, the same query returns nearly 2x the previous result for any key that has states written both before and after the upgrade. Newly written keys are fine. Old-only keys are fine. The doubling comes from merging mixed-version states.

Root Cause

uniq internally hashes each input value to UInt64, then uses that hash inside its HyperLogLog-style estimator. For UUID, that hash function changed in 21.6:

  • Pre-21.6: SipHash64(uuid)
  • 21.6 and later: intHash64(uuid)

The same UUID produces a completely different 64-bit hash in the two versions. From uniq's perspective the pre-upgrade and post-upgrade states represent different sets of items, so uniqMerge adds the cardinalities together instead of correctly deduplicating.

The fix is to stop relying on the implicit UUID hash inside uniqState and feed in a stable, explicit hash yourself. sipHash64(uuid) is the canonical choice: the underlying behavior is identical to the old internal hashing.

The new column type becomes AggregateFunction(uniq, UInt64) because sipHash64 returns UInt64.

Option 1: Add a Parallel Column

Keep the existing UUID column for compatibility while populating a new UInt64-based column going forward.

ALTER TABLE uniq_state_3
    ADD COLUMN value_2 AggregateFunction(uniq, UInt64)
    DEFAULT unhex(hex(value));

ALTER TABLE uniq_state_3
    UPDATE value_2 = value_2 WHERE 1;

The DEFAULT unhex(hex(value)) reinterprets the binary representation of the old state under the new aggregate function type. Then a no-op mutation materializes the default into actual storage.

New writes have to use the hashed form:

INSERT INTO uniq_state_3 (key, value_2)
SELECT
    number % 10000 AS key,
    uniqState(sipHash64(reinterpretAsUUID(number)))
FROM numbers(1000000)
GROUP BY key;

Update materialized views that feed the table to compute uniqState(sipHash64(uuid_column)) instead of uniqState(uuid_column).

Option 2: Modify the Column In Place

If you want to keep the column name value and not maintain two columns, change the type through String as an intermediate:

ALTER TABLE uniq_state_3
    MODIFY COLUMN value String;

ALTER TABLE uniq_state_3
    MODIFY COLUMN value AggregateFunction(uniq, UInt64);

ClickHouse rejects a direct AggregateFunction(uniq, UUID) to AggregateFunction(uniq, UInt64) conversion because the inner types differ, but routing through String preserves the bytes on disk while letting you swap the declared aggregate signature.

After the type change, populate new data with the explicit hash:

INSERT INTO uniq_state_3 (key, value)
SELECT
    number % 10000 AS key,
    uniqState(sipHash64(reinterpretAsUUID(number)))
FROM numbers(1000000)
GROUP BY key;

Option 3: Cast at Insert Time

If you cannot afford the schema change or want to keep the column declared as AggregateFunction(uniq, UUID), cast the state back to String at insert time:

INSERT INTO uniq_state_4 (key, value)
SELECT
    number % 10000 AS key,
    CAST(uniqState(sipHash64(reinterpretAsUUID(number))), 'String')
FROM numbers(1000000)
GROUP BY key;

The trick: ClickHouse stores aggregate states as opaque byte sequences. Casting to String puts the new-hash bytes into the same physical column, and uniqMerge produces correct results as long as every state in the merge was produced the same way.

This is the most fragile option because nothing in the schema documents the change. Use it only as a short-term measure.

Backfilling Old States

Old states cannot be retroactively re-hashed: the original UUIDs are gone, only the hashed cardinality estimator survives. Two practical paths:

  1. Replay raw data: if you still have the source UUIDs (in a raw events table, in S3, in Kafka logs), drop the affected aggregate rows and re-aggregate with uniqState(sipHash64(uuid)).
  2. Live with mixed history: rotate the aggregate table by partition and treat the pre-upgrade partitions as approximate. Read-side dashboards that group by month after the cutover stay accurate.

Common Pitfalls

  • Updating only the table, not the materialized view. The view continues to write uniqState(uuid) and the problem perpetuates. Update SELECT clauses in every MV that feeds the table.
  • Forgetting reinterpretAsUUID. When testing with numbers(), you have to convert UInt64 to UUID first, otherwise you're hashing the wrong type and the demo does not match real data.
  • Trying to convert AggregateFunction(uniq, UUID) directly to AggregateFunction(uniq, UInt64). The direct MODIFY COLUMN is rejected. Use the String intermediate step.
  • Assuming uniqExact is affected. This issue is specific to uniq (HLL-based). uniqExact stores hashed values too, but its serialization format and merge semantics are different and not subject to the UUID hash change in PR #23631.

How Pulse Helps With Cross-Version Compatibility Issues

Hash-function changes that quietly corrupt aggregate counts are exactly the kind of post-upgrade regression that is invisible in error logs. Pulse tracks every ClickHouse upgrade against the version notes and surfaces tables that store AggregateFunction states which changed serialization or hashing between the from- and to-version. Connect your cluster to Pulse and have a checklist for every upgrade.

Frequently Asked Questions

Q: Which versions are affected?

Tables created on ClickHouse before 21.6 with AggregateFunction(uniq, UUID) columns, then upgraded to 21.6 or later. The hash change is in PR #23631.

Q: Are other big-integer types affected?

The PR changed the internal hash for big integers in general. Other types under uniq (Int128, UInt128, Int256, UInt256, UUID) saw the same kind of change. UUID is the most commonly reported because it is the most common big-int aggregation key in user-facing tables.

Q: Why does sipHash64 work as the migration path?

Because the old internal hash was effectively SipHash64. Explicitly calling sipHash64(uuid) reproduces the pre-21.6 behavior, but now it is part of your query rather than an internal detail that can change between versions.

Q: Will uniqMerge give the exact pre-upgrade number after migrating?

For new data, yes. For data merged across the cutover, no, because the old states are still hashed with the old internal function and cannot be re-derived without raw inputs. The merged result remains inflated for affected rows; new aggregations from raw events on the migrated schema are accurate.

Q: Does this affect uniqHLL12 or uniqCombined?

uniqCombined and uniqHLL12 have their own serialization formats. The PR specifically targets uniq on big-int types. Check the changelog for the exact version you upgraded through if you use those variants.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.