ClickHouse Top-N with 'Other' / Remain Pattern: Bucketing Low-Cardinality Tail Data

Q: Is `topK` exact?

No. topK uses the Filtered Space-Saving algorithm and does not guarantee the result. It becomes approximate once uniq(column) > N * load_factor (load_factor defaults to 3). Use it for "good enough" top-N widgets, not for billing or anything requiring exact ranks.

Q: Does `LIMIT` affect the `WITH TOTALS` row?

No. LIMIT is applied to the visible rows before totals are computed, so the totals always reflect every grouped row that passed WHERE , not just the returned page. That is exactly why WITH TOTALS works for the remainder: subtract the returned top-N sum from the total to get "Other."

Q: Which method is cheapest on a very large table?

A single-pass method. WITH TOTALS is cheapest if your client can subtract; the window + WITH CUBE method is the cheapest that returns a labeled "Other" row directly. Avoid the UNION ALL method on large tables, since it aggregates the data two to three times.

Q: My grouping key has millions of distinct values — what should I use?

The topK hybrid (Method 5) or the window-function method. Both avoid materializing one row per distinct group, which the row_number() approach does not. If exact ranks matter, use the window method; if approximate ranks are acceptable, the topK hybrid scales best.

A common analytics requirement is to show the top N categories — countries, URLs, error codes, products — and collapse everything else into a single "Other" (or "Remain") row so the long tail does not clutter a chart or a table. ClickHouse offers several ways to do this in one query, each with different trade-offs around exactness, cardinality, and number of passes over the data.

This guide covers the practical patterns: a two-pass UNION ALL, single-pass window functions, an array-based approach, and the approximate topK route. It also explains when an exact "Other" total matters and when an approximate one is good enough.

The Problem in One Picture

Given a table of events keyed by a category column k, you want output like:

key	total
us	4,210,000
de	1,880,000
fr	990,000
... (top N)	...
Other	7,640,000

The "Other" row must equal the sum over every group not in the top N, not just the rows you happened to return. That single constraint is what makes the pattern non-trivial: a plain GROUP BY ... ORDER BY ... LIMIT N throws the tail away instead of summing it.

For the examples below, assume a table:

CREATE TABLE top_with_rest
(
    k     String,
    number UInt64
)
ENGINE = MergeTree
ORDER BY k;

Method 1: UNION ALL (Two Passes, Exact, Works Everywhere)

The most portable approach computes the top N in one branch and the remainder in another, joined with UNION ALL:

SELECT k, sum(number) AS res
FROM top_with_rest
GROUP BY k
ORDER BY res DESC
LIMIT 10

UNION ALL

SELECT 'Other' AS k, sum(number) AS res
FROM top_with_rest
WHERE k NOT IN (
    SELECT k
    FROM top_with_rest
    GROUP BY k
    ORDER BY sum(number) DESC
    LIMIT 10
);

The result is exact. The cost is that the data is scanned more than once — the top-N subquery runs in both branches (once to produce the top rows, once inside NOT IN), so the table is effectively aggregated two to three times. For small or well-partitioned tables this is fine; for large tables it is the most expensive option. It works on every ClickHouse version, with no window-function or experimental settings.

Method 2: Window Function (Single Pass, Exact)

Window functions (stable since ClickHouse 21.9, no setting required on current 24.x/25.x) let you aggregate once and compute the remainder from a running total. The idea: aggregate by k with WITH CUBE so you also get the grand total as a NULL key, then subtract the visible top-N sum from it.

SELECT
    if(isNotNull(k), k, 'Other') AS key,
    if(isNotNull(k), sum, toUInt64(sum - top_sum)) AS res
FROM
(
    SELECT
        k,
        sum,
        sumIf(sum, isNotNull(k)) OVER () AS top_sum
    FROM
    (
        SELECT
            toNullable(k) AS k,
            sum(number) AS sum
        FROM top_with_rest
        GROUP BY k WITH CUBE
        ORDER BY sum DESC
        LIMIT 11
    )
);

WITH CUBE over a single key adds one extra row where k is NULL — that row holds the grand total across all groups. LIMIT 11 keeps the top 10 real groups plus that grand-total row. The window sumIf(sum, isNotNull(k)) OVER () sums only the visible real groups; subtracting it from the grand total yields the exact remainder. This scans and aggregates the data once, which is the main advantage over UNION ALL.

Method 3: row_number() to Relabel the Tail

If you prefer to keep the tail rows and relabel them rather than subtract totals, rank the groups and rewrite anything past rank N to a constant key, then re-aggregate:

SELECT
    k,
    sum(sum) AS res
FROM
(
    SELECT
        if(rn > 10, 'Other', k) AS k,
        sum
    FROM
    (
        SELECT
            k,
            sum,
            row_number() OVER (ORDER BY sum DESC) AS rn
        FROM
        (
            SELECT k, sum(number) AS sum
            FROM top_with_rest
            GROUP BY k
        )
    )
)
GROUP BY k
ORDER BY res DESC;

This is the most readable variant and is exact, but it materializes one row per distinct group before collapsing the tail. If k has millions of distinct values, that intermediate set is large. Prefer Method 2 when cardinality is high.

Method 4: WITH TOTALS (Top N Now, Remainder Client-Side)

If your client can do one subtraction, WITH TOTALS is the cheapest single-pass option. It appends one extra summary row (or a separate totals field in JSON* formats) containing the aggregate over all rows that passed WHERE — and crucially, LIMIT is applied before totals are computed, so the totals always reflect the full data set, not just the returned page:

SELECT
    k,
    sum(number) AS res
FROM top_with_rest
GROUP BY k WITH TOTALS
ORDER BY res DESC
LIMIT 10;

You get the top 10 rows plus a totals row holding the grand total. The client computes Other = grand_total - sum(returned top-N values). This is ideal for dashboards that already parse the totals field — no second pass, no window function. The only catch is that the "Other" value is derived on the client rather than returned as a labeled row. (Note that totals_mode only changes behavior when a HAVING clause is present; without HAVING, totals always cover every grouped row.)

Method 5: Approximate Top N with topK

When you only need approximately the most frequent values — for example a "top 10 user agents" widget where exact ranks do not matter — topK finds them in a single pass with bounded memory, using the Filtered Space-Saving algorithm:

SELECT topK(10)(k) AS top_keys
FROM top_with_rest;

topK(N)(column) returns an array of up to N approximately-most-frequent values in descending frequency order. N defaults to 10 and can go up to 65536. An optional load_factor (default 3) controls accuracy: the result is approximate once uniq(column) > N * load_factor. A third 'counts' argument makes the result include approximate counts and error values. For frequency-weighted inputs use `topKWeighted`.

topK gives you the keys but not a matching "Other" total. A practical hybrid is to get the candidate keys with topK first, then compute exact sums for those keys and the remainder:

WITH (SELECT topK(10)(k) FROM top_with_rest) AS keys
SELECT
    if(has(keys, k), k, 'Other') AS bucket,
    sum(number) AS res
FROM top_with_rest
GROUP BY bucket
ORDER BY res DESC;

This is single-pass on the main aggregation (the topK subquery is a separate, cheap pass) and scales to very high cardinality because the tail is never materialized as individual groups. The trade-off is that the top-N membership is approximate — a borderline group could land in "Other" or vice versa. For background on accuracy trade-offs of topK, uniq, and friends, see Approximate methods in ClickHouse and the topK function reference.

Choosing a Method

Method	Passes	Exact?	Returns 'Other' row	Best for
`UNION ALL`	2–3	Yes	Yes	Small tables, max portability
Window + `WITH CUBE`	1	Yes	Yes	High cardinality, single round trip
`row_number()` relabel	1	Yes	Yes	Readability, low/medium cardinality
`WITH TOTALS`	1	Yes	No (client computes)	Dashboards that already parse totals
`topK` hybrid	~2	Approx. ranks	Yes	Very high cardinality, ranks not critical

Rule of thumb: for an exact "Other" in a single pass, use the window-function method. For the cheapest query when your client can subtract, use WITH TOTALS. Reach for topK only when the grouping key is so high-cardinality that exact ranking is wasteful and approximate ranks are acceptable.

Best Practices

Push the work into the database. Returning every group to the client and bucketing in application code defeats the purpose — for a high-cardinality key that can mean shipping millions of rows. Collapse the tail in SQL.
Apply WHERE and time filters before grouping. The "Other" total is only meaningful relative to the filtered set. Make sure every branch (especially in the UNION ALL method) uses the same filter, or the remainder will not reconcile with the visible rows.
Keep the top-N membership consistent across branches. In Method 1, the LIMIT and ORDER BY in the main query and in the NOT IN subquery must match exactly, otherwise a group can be both shown and counted in "Other," or dropped entirely.
Pre-aggregate hot dashboards. If the same top-N-plus-remainder query runs constantly, feed it from a materialized view or a projection that keeps per-key sums up to date, then apply the bucketing on the much smaller pre-aggregated table.
Mind ties at the boundary. When several groups have identical sums around rank N, the cut-off is arbitrary unless you add a tie-breaker to ORDER BY (for example ORDER BY res DESC, k). Without it, exact methods stay numerically correct but the labels in "Other" can shift between runs.

Common Issues

The remainder does not add up. Almost always a filter mismatch between the top-N selection and the remainder calculation, or counting NULL keys inconsistently. Verify that sum(top N) + Other = sum(everything) with a quick WITH TOTALS check.
topK returns slightly different keys each run. Expected — it is approximate. Increase load_factor (e.g. topK(10, 10)(k)) to widen the tracked set, or switch to an exact method if stability matters.
High memory on the row_number() method. That approach materializes one row per distinct group before collapsing. On millions of distinct keys, prefer the window + WITH CUBE method or the topK hybrid.
WITH TOTALS row appears where you do not expect it. In TabSeparated/CSV it follows the data after a blank line; in JSON* it is a separate totals field; in Pretty* it is a second table. Parse it according to your output format.

How Pulse Helps

Choosing between an exact two-pass query and an approximate single-pass one is the kind of decision that depends on your real cardinality, data volume, and how often the query runs. Pulse monitors ClickHouse query patterns and resource usage in production, so you can see which dashboard queries re-scan large tables, where a materialized view or projection would remove a repeated top-N aggregation, and whether topK's approximation is acceptable for a given workload. Pulse is run by ClickHouse experts who can advise on the exactness-versus-cost trade-offs behind these patterns rather than guessing.

Frequently Asked Questions

Q: How do I get the top N categories plus an "Other" bucket in one query?

Use the window-function method (Method 2): aggregate by key with WITH CUBE to obtain the grand total as a NULL-keyed row, then subtract the sum of the visible top-N rows from it using sumIf(...) OVER (). It is exact and scans the data once.

Q: Is topK exact?

No. topK uses the Filtered Space-Saving algorithm and does not guarantee the result. It becomes approximate once uniq(column) > N * load_factor (load_factor defaults to 3). Use it for "good enough" top-N widgets, not for billing or anything requiring exact ranks.

Q: Does LIMIT affect the WITH TOTALS row?

No. LIMIT is applied to the visible rows before totals are computed, so the totals always reflect every grouped row that passed WHERE, not just the returned page. That is exactly why WITH TOTALS works for the remainder: subtract the returned top-N sum from the total to get "Other."

Q: Which method is cheapest on a very large table?

A single-pass method. WITH TOTALS is cheapest if your client can subtract; the window + WITH CUBE method is the cheapest that returns a labeled "Other" row directly. Avoid the UNION ALL method on large tables, since it aggregates the data two to three times.

Q: My grouping key has millions of distinct values — what should I use?

The topK hybrid (Method 5) or the window-function method. Both avoid materializing one row per distinct group, which the row_number() approach does not. If exact ranks matter, use the window method; if approximate ranks are acceptable, the topK hybrid scales best.

Q: Can I pre-aggregate this so it is fast on every dashboard load?

Yes. Keep per-key sums in a materialized view or a projection, then run the top-N-plus-"Other" bucketing against that smaller table. The bucketing logic is identical; it just runs over pre-aggregated rows instead of raw events. See also Approximate methods and countDistinct for related cardinality-aware techniques.