DISTINCT vs GROUP BY vs LIMIT 1 BY: Choosing the Right Deduplication Approach

ClickHouse offers three clauses that look interchangeable for "give me unique rows" but behave very differently: DISTINCT, GROUP BY, and LIMIT BY. Picking the wrong one leads to surprising memory usage, non-deterministic results, or simply the wrong answer.

This guide explains the semantics of each, when one is clearly the right tool, and how DISTINCT ON and LIMIT 1 BY relate. It covers query-time deduplication only — for collapsing duplicate rows at storage time, see ReplacingMergeTree.

The Short Answer

Clause What it returns Threading Result delivery Use when
DISTINCT Unique combinations of the selected columns Final dedup step is single-threaded Streams blocks as they are processed You want a list of unique values/tuples and project only the dedup columns
GROUP BY One row per group, plus aggregates Multi-threaded Returns after aggregation completes You need aggregates (count, sum, argMax) alongside the unique keys
LIMIT n BY expr The first n rows per distinct value of expr Single-threaded Streams the result set You need full rows (all columns) per key, or a specific representative row

Rule of thumb: use DISTINCT to dedup the exact columns you select, GROUP BY when you also need aggregates, and LIMIT 1 BY when you want one complete row per key while keeping columns the key doesn't determine.

DISTINCT

SELECT DISTINCT keeps only unique rows in the result. Uniqueness is evaluated over every column in the SELECT list, not over the underlying table.

SELECT DISTINCT country, device_type
FROM events;

Key behaviors:

  • NULL handling: DISTINCT treats NULL as a normal value, where NULL == NULL. Two rows that are both NULL collapse into one.
  • Executed before ORDER BY: deduplication happens first, then sorting is applied to the surviving rows. DISTINCT and ORDER BY may reference different columns.
  • Streaming + early exit: DISTINCT streams data blocks as they are processed, so when combined with LIMIT it can stop as soon as enough unique rows are found rather than scanning the whole table.

That early-exit property makes DISTINCT ... LIMIT dramatically cheaper than a bare DISTINCT. Asking for a bounded number of distinct values touches only a fraction of the data and uses a tiny fraction of the memory:

-- Stops early; touches far less data than a full DISTINCT
SELECT DISTINCT user_id
FROM events
LIMIT 1000;

The catch: the final deduplication step is single-threaded. On wide result sets with many distinct combinations, DISTINCT can use significant memory holding the set of seen values, and it does not parallelize the final merge the way GROUP BY does.

For counting (rather than listing) distinct values, do not wrap DISTINCT in count() — use `count(DISTINCT ...)` / `uniqExact` / `uniq` instead, and consider approximate methods when exactness is not required.

GROUP BY

GROUP BY collapses rows that share the grouping key into one row per group. With no aggregate function, GROUP BY k produces the same set of keys as SELECT DISTINCT k — but it is implemented differently.

-- Equivalent key set to SELECT DISTINCT country
SELECT country
FROM events
GROUP BY country;

-- The real reason to use GROUP BY: aggregates per key
SELECT
    country,
    count() AS events,
    uniqExact(user_id) AS users,
    max(event_time) AS last_seen
FROM events
GROUP BY country;

Key behaviors:

  • Multi-threaded aggregation: ClickHouse parallelizes GROUP BY across threads, which is why a plain GROUP BY k is typically faster than SELECT DISTINCT k on the same data.
  • Blocking, not streaming: results are returned only after aggregation finishes, so GROUP BY ... LIMIT cannot exit as early as DISTINCT ... LIMIT (the full grouping still completes, though projections can pre-aggregate it).
  • Aggregates are the point: if you need anything beyond the unique keys — counts, sums, or "the value at the max of another column" via `argMax`GROUP BY is the only one of the three that produces them.

Use GROUP BY whenever you need aggregates, or when you want the fastest path to a unique key list on large data and do not benefit from DISTINCT's early-exit.

LIMIT BY

LIMIT n BY expressions returns the first n rows for each distinct value of expressions. It is unrelated to the plain LIMIT clause (which caps the total result), and the two can be used together.

-- One row per user_id
SELECT *
FROM events
ORDER BY user_id, event_time DESC
LIMIT 1 BY user_id;

Syntax variants:

LIMIT n BY expressions
LIMIT offset, n BY expressions
LIMIT n OFFSET offset BY expressions

What makes LIMIT BY different from the other two: it returns whole rows, including columns that are not part of the key. DISTINCT and GROUP BY force you to either include a column in the dedup/group key or aggregate it; LIMIT 1 BY lets you keep arbitrary columns from a chosen representative row.

The most important rule is determinism: LIMIT BY returns the first rows per group in the current stream order. Without ORDER BY, that order is non-deterministic because of multi-threading. Always pair LIMIT 1 BY with an ORDER BY that places the row you want first:

-- Deterministic: the most recent row per device
SELECT device_id, status, event_time
FROM device_events
ORDER BY device_id, event_time DESC
LIMIT 1 BY device_id;

DISTINCT ON (columns) is supported as syntactic sugar and is equivalent to LIMIT 1 BY columns:

-- These two return one row per (a, b)
SELECT DISTINCT ON (a, b) a, b, c FROM t;
SELECT a, b, c FROM t LIMIT 1 BY a, b;

Note DISTINCT ON is not the same as DISTINCT, and the two keywords cannot be combined in one query. LIMIT BY is single-threaded and streams its result; like DISTINCT, combining it with a top-level LIMIT lets it stop early.

Picking the Right One

Work top to bottom; the first match is usually correct:

  1. You only need the count of distinct values → use `count(DISTINCT col)` / `uniqExact` / `uniq`, not a DISTINCT subquery.
  2. You need aggregates per key (count, sum, min/max, `argMax`) → GROUP BY.
  3. You need exactly one full row per key, keeping columns outside the key → LIMIT 1 BY key with an explicit ORDER BY (or the DISTINCT ON alias).
  4. You need a list of unique values/tuples and select only those columns → DISTINCT, ideally bounded with LIMIT.

"Latest row per key" specifically

This is the most common case where people reach for the wrong clause. Two good options:

-- Option A: full latest row per key, all columns preserved
SELECT *
FROM events
ORDER BY user_id, event_time DESC
LIMIT 1 BY user_id;

-- Option B: pull specific columns at the max timestamp, no sort needed
SELECT
    user_id,
    argMax(status, event_time) AS latest_status,
    max(event_time)            AS latest_time
FROM events
GROUP BY user_id;

Option A keeps the entire row and is intuitive but requires a sort. Option B avoids a global sort and is often cheaper, but you must enumerate each column you want via `argMax`. For genuinely append-and-supersede workloads, deduplicate at storage time with ReplacingMergeTree instead of doing it on every query.

Best Practices

  1. Never use LIMIT 1 BY without ORDER BY. The "first" row per key is arbitrary otherwise, and your results will change between runs.
  2. Bound DISTINCT with LIMIT when you only need a sample of distinct values — it turns a full scan into an early-exit query and slashes memory.
  3. Prefer GROUP BY over SELECT DISTINCT for large unique-key lists; the multi-threaded aggregation path is usually faster.
  4. Don't dedup at query time if it should be a storage concern. Recurring "latest row" needs belong in ReplacingMergeTree/AggregatingMergeTree, not in every query.
  5. Match the dedup columns to the SELECT list. SELECT DISTINCT a, b dedups on (a, b); adding a column you forgot about silently changes the result.

Common Issues

  • DISTINCT uses too much memory. The final dedup step is single-threaded and holds all seen combinations. Reduce the number of selected columns, add a LIMIT, or switch to GROUP BY. For counting, use uniqExact/uniq — see count(DISTINCT).
  • LIMIT 1 BY returns a different row each run. You omitted ORDER BY, or it doesn't fully order the rows within each key. Add the tiebreaker columns.
  • GROUP BY ... LIMIT is not faster than expected. GROUP BY is blocking; the grouping completes before LIMIT applies. Use DISTINCT ... LIMIT for early-exit on unique values, or a projection for repeated aggregates.
  • Confusing DISTINCT and DISTINCT ON. DISTINCT ON (cols) is LIMIT 1 BY cols; plain DISTINCT dedups the whole row. They are different features and can't be mixed in one query.

How Pulse Helps

Choosing between DISTINCT, GROUP BY, and LIMIT 1 BY is rarely a one-time decision — query patterns drift, data volumes grow, and a clause that was cheap at a million rows becomes a memory hog at a billion. Pulse monitors ClickHouse query performance and memory usage in production, surfacing the specific queries where a single-threaded DISTINCT or an unordered LIMIT BY is causing spikes or non-deterministic results. The team behind Pulse runs ClickHouse, Elasticsearch, and OpenSearch clusters at scale, so the guidance you get is grounded in how these clauses actually behave under load rather than in benchmarks on synthetic data.

Frequently Asked Questions

Q: Is SELECT DISTINCT a the same as SELECT a GROUP BY a?

They return the same set of values, but execute differently. GROUP BY is multi-threaded and usually faster on large data; DISTINCT streams blocks and can exit early when combined with LIMIT. Use GROUP BY when you also need aggregates.

Q: Why does my LIMIT 1 BY query return different rows each time?

Because there is no ORDER BY, or the ORDER BY doesn't fully determine which row comes first within each key. LIMIT BY returns the first rows in stream order, which is non-deterministic across threads. Add an ORDER BY covering the key plus a tiebreaker.

Q: Does ClickHouse support DISTINCT ON?

Yes. DISTINCT ON (columns) is supported and is equivalent to LIMIT 1 BY columns — it returns one row per distinct combination of those columns. It is distinct from the plain DISTINCT keyword and the two cannot be combined in a single query.

Q: What's the cheapest way to get one full row per key?

LIMIT 1 BY key with an ORDER BY that places the desired row first keeps the whole row. If you only need specific columns, GROUP BY key with `argMax` avoids a global sort and is often cheaper.

Q: How do I just count distinct values without listing them?

Use count(DISTINCT col), uniqExact(col) for an exact count, or uniq(col) for a fast approximate count — not SELECT count() FROM (SELECT DISTINCT col ...). See count(DISTINCT) in ClickHouse and approximate methods.

Q: Should I deduplicate in the query or in the table?

If duplicates come from updates/upserts that you always want collapsed, handle it at storage time with ReplacingMergeTree. Use query-time DISTINCT/LIMIT 1 BY for ad-hoc deduplication or when different queries need different dedup keys.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.