ClickHouse Approximate Methods: Trading Accuracy for Speed at Scale

Analytics at scale often doesn't require exact answers. When you're counting unique users across billions of events, computing percentiles for latency monitoring, or finding the most frequent values in a log stream, an answer that's 98% accurate and arrives in milliseconds is far more useful than an exact answer that takes minutes and exhausts your memory.

ClickHouse provides a rich set of approximate aggregation functions purpose-built for this trade-off. This guide covers what's available, how each method works, and when to use them.

Approximate Distinct Counting

Counting unique values is the most common — and most expensive — approximate operation. ClickHouse provides the standard `count(DISTINCT)` syntax, which by default computes an exact count. But for high-cardinality columns, the approximate alternatives are dramatically faster and lighter on memory.

uniq()

SELECT uniq(user_id) FROM events;

Uses an adaptive sampling algorithm. ClickHouse automatically selects the best internal algorithm based on the data:

For small sets: exact counting
For larger sets: approximate algorithm with typically less than 2% error

Memory usage: Significantly lower than exact counting for high-cardinality columns.

Best for: General-purpose approximate counting. This is the recommended default when exact counts aren't required.

uniqCombined()

SELECT uniqCombined(user_id) FROM events;

Uses a hybrid approach combining three algorithms:

Array for very small cardinalities
Hash table for medium cardinalities
HyperLogLog for high cardinalities

Provides approximately 2–3% error for high-cardinality data and can optionally accept a precision parameter:

-- Higher precision (more memory, less error)
SELECT uniqCombined(17)(user_id) FROM events;

The precision parameter (default 17) sets the number of HyperLogLog cells as 2^precision. Higher values = more accuracy, more memory.

A 64-bit variant, uniqCombined64(), uses 64-bit hashes for better accuracy at extremely high cardinalities (billions of unique values) where 32-bit hash collisions become significant.

uniqHLL12()

SELECT uniqHLL12(user_id) FROM events;

Pure HyperLogLog implementation with 2^12 cells. Fixed memory usage regardless of cardinality (~2.5 KB per aggregation state). Error rate is approximately 1.6%.

Best for: Scenarios where you need a fixed, predictable memory footprint — especially useful in materialized views or AggregatingMergeTree tables where states are stored long-term.

uniqTheta()

SELECT uniqTheta(user_id) FROM events;

Based on the Theta Sketch algorithm from Apache DataSketches. Supports set operations (union, intersection, difference) on sketch states, which the HyperLogLog-based functions do not.

Best for: Advanced use cases requiring set operations on distinct count estimates — e.g., computing the overlap between two user segments without re-querying raw data.

Distinct Count Performance Comparison

For a column with ~100 million unique values:

Function	Accuracy	Memory per State	Relative Speed
`count(DISTINCT)` / `uniqExact`	Exact	High (proportional to cardinality)	1x (baseline)
`uniq`	~2% error	Low	3–5x faster
`uniqCombined`	~2–3% error	Low–Medium	3–5x faster
`uniqCombined64`	~2–3% error	Medium	3–4x faster
`uniqHLL12`	~1.6% error	Fixed (~2.5 KB)	3–5x faster
`uniqTheta`	~2–3% error	Medium	2–3x faster

The speed advantage grows with cardinality. At 10 billion unique values, approximate methods can be 10–50x faster than exact counting.

Configuring count(DISTINCT) to Use Approximate Methods

You can change which function `count(DISTINCT)` uses under the hood via the count_distinct_implementation setting:

-- Use approximate counting globally
SET count_distinct_implementation = 'uniq';
SELECT count(DISTINCT user_id) FROM events;

-- Or per-query
SELECT count(DISTINCT user_id) FROM events
SETTINGS count_distinct_implementation = 'uniqCombined';

This is useful when you want standard SQL syntax but don't need exact results, or when migrating from other databases where count(DISTINCT) was already approximate.

Approximate Quantiles and Percentiles

Exact percentile computation requires sorting the entire dataset — expensive for billions of rows. ClickHouse offers approximate alternatives.

quantileTDigest()

SELECT quantileTDigest(0.99)(response_time_ms) FROM requests;

Uses the t-digest algorithm, which maintains a compact sketch of the data distribution. Accuracy is highest at the tails (p99, p999) where it matters most for latency monitoring.

-- Multiple percentiles in one pass
SELECT
    quantilesTDigest(0.5, 0.9, 0.95, 0.99)(response_time_ms)
FROM requests;

Best for: Latency percentiles in monitoring dashboards, SLA reporting.

quantileTiming()

SELECT quantileTiming(0.95)(duration_ms) FROM events;

Optimized specifically for timing data (millisecond values). Uses a fixed-size histogram, making it very fast and memory-efficient but limited to values under ~30 seconds (30,000 ms). Values above the range are clamped.

Best for: Request latency and page load time monitoring where values are bounded.

quantileBFloat16()

SELECT quantileBFloat16(0.5)(value) FROM metrics;

Uses brain floating point format for reduced precision but very compact state. Suitable for rough estimates across massive datasets where memory is a constraint.

Top-K: Finding the Most Frequent Values

topK()

-- Find the 10 most common error messages
SELECT topK(10)(error_message) FROM logs;

Returns an approximate list of the most frequent values using the Space-Saving algorithm. Results are approximate — the exact order and inclusion of borderline values may vary.

-- With counts
SELECT topKWeighted(10)(url, hit_count) FROM page_views;

topKWeighted supports weighted counting, where each value contributes a specified weight rather than 1.

Best for: Finding hot paths, frequent errors, popular pages, or common query patterns.

Exact vs. Approximate Top-K

For exact results, you'd write:

SELECT error_message, count() AS c
FROM logs
GROUP BY error_message
ORDER BY c DESC
LIMIT 10;

This requires a full GROUP BY across all distinct values. topK() avoids materializing the full group-by result, using constant memory regardless of cardinality.

Sampling: Approximate Everything

ClickHouse supports SAMPLE clauses that process only a fraction of the data:

-- Process ~10% of data
SELECT
    count() * 10 AS estimated_total,
    uniq(user_id) AS approx_users
FROM events
SAMPLE 0.1;

-- Sample a fixed number of rows
SELECT avg(response_time) FROM requests SAMPLE 10000;

Sampling works at the granule level and requires a SAMPLE BY expression in the table definition:

CREATE TABLE events (
    ...
) ENGINE = MergeTree()
ORDER BY (event_date, cityHash64(user_id))
SAMPLE BY cityHash64(user_id);

Best for: Exploratory analysis on very large tables, or when any aggregation can tolerate proportional error. Sampling reduces I/O across all aggregation functions, not just specific ones.

Storing Approximate States in Materialized Views

One of the most powerful patterns is storing approximate aggregation states in AggregatingMergeTree tables for incremental computation:

CREATE TABLE daily_metrics
(
    date Date,
    unique_users AggregateFunction(uniq, UInt64),
    p99_latency AggregateFunction(quantileTDigest(0.99), Float64),
    top_urls AggregateFunction(topK(20), String)
)
ENGINE = AggregatingMergeTree()
ORDER BY date;

CREATE MATERIALIZED VIEW daily_metrics_mv
TO daily_metrics
AS
SELECT
    toDate(event_time) AS date,
    uniqState(user_id) AS unique_users,
    quantileTDigestState(0.99)(response_time) AS p99_latency,
    topKState(20)(url) AS top_urls
FROM events
GROUP BY date;

-- Query with merge
SELECT
    date,
    uniqMerge(unique_users) AS unique_users,
    quantileTDigestMerge(0.99)(p99_latency) AS p99_latency,
    topKMerge(20)(top_urls) AS top_urls
FROM daily_metrics
GROUP BY date;

The -State suffix captures the intermediate sketch, and -Merge combines sketches at query time. This lets you compute daily states during ingestion and roll them up into weekly or monthly views without re-scanning raw data:

-- Weekly unique users from daily pre-aggregated states
SELECT
    toStartOfWeek(date) AS week,
    uniqMerge(unique_users) AS weekly_uniques
FROM daily_metrics
GROUP BY week;

This correctly handles users who appear on multiple days — they're deduplicated during the merge. A simple SUM of daily distinct counts would overcount.

When to Use Approximate Methods

Situation	Approach
Exact results required (billing, compliance)	`count(DISTINCT)` / `uniqExact` / `quantileExact`
General analytics dashboards	`uniq`, `quantileTDigest`, `topK`
Storing pre-aggregated states	`uniqCombined` or `uniqHLL12` with `-State`/`-Merge`
Billions of unique values	`uniqCombined64`
Set operations on cardinality sketches	`uniqTheta`
Fixed memory budget	`uniqHLL12`
Exploratory analysis on huge tables	`SAMPLE` clause
Latency monitoring (bounded values)	`quantileTiming`
Finding hot paths or frequent values	`topK`

Common Pitfalls

NULL Handling

All approximate functions exclude NULL values. If you need to count NULLs as a distinct value:

SELECT uniq(ifNull(user_id, 0)) FROM events;

Multi-Column Distinct Counts

You can count distinct combinations of multiple columns:

SELECT uniq(user_id, session_id) FROM events;

Results Vary Slightly Between Runs

Approximate algorithms use hashing, and the order of data processing across parallel threads can affect results. Variations are within the stated error bounds and are normal. If you need deterministic results, use exact methods.

Mixing Approximate and Exact

You can freely mix methods in the same query:

SELECT
    uniqExact(order_id) AS exact_orders,
    uniq(user_id) AS approx_users,
    quantileTDigest(0.99)(latency) AS approx_p99
FROM orders;

Frequently Asked Questions

Q: Is a 2% error acceptable for business metrics?

For most analytics — dashboards, trend analysis, audience sizing — yes. A 2% error on 1 million unique users means the count is between 980,000 and 1,020,000. For billing or compliance where exact counts are contractual, use exact methods.

Q: Do approximate functions work on distributed tables?

Yes, and this is where they excel. Each shard computes a compact sketch state that's sent to the coordinator for merging — far less network overhead than transferring millions of unique values for exact counting.

Q: Can I combine approximate states across different time granularities?

Yes. Store states at the finest granularity (e.g., hourly) and merge up to coarser granularities (daily, weekly, monthly) at query time using -Merge combinators. The sketches correctly handle deduplication across time periods.

Q: How do I know if approximate methods would help my queries?

Check system.query_log for queries with high memory_usage and read_bytes relative to the result size. High-cardinality GROUP BY or distinct count queries that use GBs of memory are prime candidates for approximate alternatives.

Q: Are there approximate methods for COUNT, SUM, or AVG?

COUNT, SUM, and AVG are already efficient and don't benefit from approximation in the same way. For these, use SAMPLE to reduce the data scanned instead. The approximate function family targets operations that are inherently expensive due to cardinality (distinct counts) or sorting (quantiles).