ClickHouse offers three clauses that look interchangeable for "give me unique rows" but behave very differently: DISTINCT, GROUP BY, and LIMIT BY. Picking the wrong one leads to surprising memory usage, non-deterministic results, or simply the wrong answer.
This guide explains the semantics of each, when one is clearly the right tool, and how DISTINCT ON and LIMIT 1 BY relate. It covers query-time deduplication only — for collapsing duplicate rows at storage time, see ReplacingMergeTree.
The Short Answer
| Clause | What it returns | Threading | Result delivery | Use when |
|---|---|---|---|---|
DISTINCT |
Unique combinations of the selected columns | Final dedup step is single-threaded | Streams blocks as they are processed | You want a list of unique values/tuples and project only the dedup columns |
GROUP BY |
One row per group, plus aggregates | Multi-threaded | Returns after aggregation completes | You need aggregates (count, sum, argMax) alongside the unique keys |
LIMIT n BY expr |
The first n rows per distinct value of expr |
Single-threaded | Streams the result set | You need full rows (all columns) per key, or a specific representative row |
Rule of thumb: use DISTINCT to dedup the exact columns you select, GROUP BY when you also need aggregates, and LIMIT 1 BY when you want one complete row per key while keeping columns the key doesn't determine.
DISTINCT
SELECT DISTINCT keeps only unique rows in the result. Uniqueness is evaluated over every column in the SELECT list, not over the underlying table.
SELECT DISTINCT country, device_type
FROM events;
Key behaviors:
- NULL handling:
DISTINCTtreatsNULLas a normal value, whereNULL == NULL. Two rows that are bothNULLcollapse into one. - Executed before ORDER BY: deduplication happens first, then sorting is applied to the surviving rows.
DISTINCTandORDER BYmay reference different columns. - Streaming + early exit:
DISTINCTstreams data blocks as they are processed, so when combined withLIMITit can stop as soon as enough unique rows are found rather than scanning the whole table.
That early-exit property makes DISTINCT ... LIMIT dramatically cheaper than a bare DISTINCT. Asking for a bounded number of distinct values touches only a fraction of the data and uses a tiny fraction of the memory:
-- Stops early; touches far less data than a full DISTINCT
SELECT DISTINCT user_id
FROM events
LIMIT 1000;
The catch: the final deduplication step is single-threaded. On wide result sets with many distinct combinations, DISTINCT can use significant memory holding the set of seen values, and it does not parallelize the final merge the way GROUP BY does.
For counting (rather than listing) distinct values, do not wrap DISTINCT in count() — use `count(DISTINCT ...)` / `uniqExact` / `uniq` instead, and consider approximate methods when exactness is not required.
GROUP BY
GROUP BY collapses rows that share the grouping key into one row per group. With no aggregate function, GROUP BY k produces the same set of keys as SELECT DISTINCT k — but it is implemented differently.
-- Equivalent key set to SELECT DISTINCT country
SELECT country
FROM events
GROUP BY country;
-- The real reason to use GROUP BY: aggregates per key
SELECT
country,
count() AS events,
uniqExact(user_id) AS users,
max(event_time) AS last_seen
FROM events
GROUP BY country;
Key behaviors:
- Multi-threaded aggregation: ClickHouse parallelizes
GROUP BYacross threads, which is why a plainGROUP BY kis typically faster thanSELECT DISTINCT kon the same data. - Blocking, not streaming: results are returned only after aggregation finishes, so
GROUP BY ... LIMITcannot exit as early asDISTINCT ... LIMIT(the full grouping still completes, though projections can pre-aggregate it). - Aggregates are the point: if you need anything beyond the unique keys — counts, sums, or "the value at the max of another column" via `argMax` —
GROUP BYis the only one of the three that produces them.
Use GROUP BY whenever you need aggregates, or when you want the fastest path to a unique key list on large data and do not benefit from DISTINCT's early-exit.
LIMIT BY
LIMIT n BY expressions returns the first n rows for each distinct value of expressions. It is unrelated to the plain LIMIT clause (which caps the total result), and the two can be used together.
-- One row per user_id
SELECT *
FROM events
ORDER BY user_id, event_time DESC
LIMIT 1 BY user_id;
Syntax variants:
LIMIT n BY expressions
LIMIT offset, n BY expressions
LIMIT n OFFSET offset BY expressions
What makes LIMIT BY different from the other two: it returns whole rows, including columns that are not part of the key. DISTINCT and GROUP BY force you to either include a column in the dedup/group key or aggregate it; LIMIT 1 BY lets you keep arbitrary columns from a chosen representative row.
The most important rule is determinism: LIMIT BY returns the first rows per group in the current stream order. Without ORDER BY, that order is non-deterministic because of multi-threading. Always pair LIMIT 1 BY with an ORDER BY that places the row you want first:
-- Deterministic: the most recent row per device
SELECT device_id, status, event_time
FROM device_events
ORDER BY device_id, event_time DESC
LIMIT 1 BY device_id;
DISTINCT ON (columns) is supported as syntactic sugar and is equivalent to LIMIT 1 BY columns:
-- These two return one row per (a, b)
SELECT DISTINCT ON (a, b) a, b, c FROM t;
SELECT a, b, c FROM t LIMIT 1 BY a, b;
Note DISTINCT ON is not the same as DISTINCT, and the two keywords cannot be combined in one query. LIMIT BY is single-threaded and streams its result; like DISTINCT, combining it with a top-level LIMIT lets it stop early.
Picking the Right One
Work top to bottom; the first match is usually correct:
- You only need the count of distinct values → use `count(DISTINCT col)` / `uniqExact` / `uniq`, not a
DISTINCTsubquery. - You need aggregates per key (count, sum, min/max, `argMax`) →
GROUP BY. - You need exactly one full row per key, keeping columns outside the key →
LIMIT 1 BY keywith an explicitORDER BY(or theDISTINCT ONalias). - You need a list of unique values/tuples and select only those columns →
DISTINCT, ideally bounded withLIMIT.
"Latest row per key" specifically
This is the most common case where people reach for the wrong clause. Two good options:
-- Option A: full latest row per key, all columns preserved
SELECT *
FROM events
ORDER BY user_id, event_time DESC
LIMIT 1 BY user_id;
-- Option B: pull specific columns at the max timestamp, no sort needed
SELECT
user_id,
argMax(status, event_time) AS latest_status,
max(event_time) AS latest_time
FROM events
GROUP BY user_id;
Option A keeps the entire row and is intuitive but requires a sort. Option B avoids a global sort and is often cheaper, but you must enumerate each column you want via `argMax`. For genuinely append-and-supersede workloads, deduplicate at storage time with ReplacingMergeTree instead of doing it on every query.
Best Practices
- Never use
LIMIT 1 BYwithoutORDER BY. The "first" row per key is arbitrary otherwise, and your results will change between runs. - Bound
DISTINCTwithLIMITwhen you only need a sample of distinct values — it turns a full scan into an early-exit query and slashes memory. - Prefer
GROUP BYoverSELECT DISTINCTfor large unique-key lists; the multi-threaded aggregation path is usually faster. - Don't dedup at query time if it should be a storage concern. Recurring "latest row" needs belong in ReplacingMergeTree/AggregatingMergeTree, not in every query.
- Match the dedup columns to the SELECT list.
SELECT DISTINCT a, bdedups on(a, b); adding a column you forgot about silently changes the result.
Common Issues
DISTINCTuses too much memory. The final dedup step is single-threaded and holds all seen combinations. Reduce the number of selected columns, add aLIMIT, or switch toGROUP BY. For counting, useuniqExact/uniq— see count(DISTINCT).LIMIT 1 BYreturns a different row each run. You omittedORDER BY, or it doesn't fully order the rows within each key. Add the tiebreaker columns.GROUP BY ... LIMITis not faster than expected.GROUP BYis blocking; the grouping completes beforeLIMITapplies. UseDISTINCT ... LIMITfor early-exit on unique values, or a projection for repeated aggregates.- Confusing
DISTINCTandDISTINCT ON.DISTINCT ON (cols)isLIMIT 1 BY cols; plainDISTINCTdedups the whole row. They are different features and can't be mixed in one query.
How Pulse Helps
Choosing between DISTINCT, GROUP BY, and LIMIT 1 BY is rarely a one-time decision — query patterns drift, data volumes grow, and a clause that was cheap at a million rows becomes a memory hog at a billion. Pulse monitors ClickHouse query performance and memory usage in production, surfacing the specific queries where a single-threaded DISTINCT or an unordered LIMIT BY is causing spikes or non-deterministic results. The team behind Pulse runs ClickHouse, Elasticsearch, and OpenSearch clusters at scale, so the guidance you get is grounded in how these clauses actually behave under load rather than in benchmarks on synthetic data.
Frequently Asked Questions
Q: Is SELECT DISTINCT a the same as SELECT a GROUP BY a?
They return the same set of values, but execute differently. GROUP BY is multi-threaded and usually faster on large data; DISTINCT streams blocks and can exit early when combined with LIMIT. Use GROUP BY when you also need aggregates.
Q: Why does my LIMIT 1 BY query return different rows each time?
Because there is no ORDER BY, or the ORDER BY doesn't fully determine which row comes first within each key. LIMIT BY returns the first rows in stream order, which is non-deterministic across threads. Add an ORDER BY covering the key plus a tiebreaker.
Q: Does ClickHouse support DISTINCT ON?
Yes. DISTINCT ON (columns) is supported and is equivalent to LIMIT 1 BY columns — it returns one row per distinct combination of those columns. It is distinct from the plain DISTINCT keyword and the two cannot be combined in a single query.
Q: What's the cheapest way to get one full row per key?
LIMIT 1 BY key with an ORDER BY that places the desired row first keeps the whole row. If you only need specific columns, GROUP BY key with `argMax` avoids a global sort and is often cheaper.
Q: How do I just count distinct values without listing them?
Use count(DISTINCT col), uniqExact(col) for an exact count, or uniq(col) for a fast approximate count — not SELECT count() FROM (SELECT DISTINCT col ...). See count(DISTINCT) in ClickHouse and approximate methods.
Q: Should I deduplicate in the query or in the table?
If duplicates come from updates/upserts that you always want collapsed, handle it at storage time with ReplacingMergeTree. Use query-time DISTINCT/LIMIT 1 BY for ad-hoc deduplication or when different queries need different dedup keys.