ReplacingMergeTree Duplicate Deduplication: The Async Merge Gotcha

Q: How do I always get deduplicated results?

Add the FINAL keyword to your SELECT , or write an explicit GROUP BY key with argMax(col, version) for each column. Both apply the replacement logic at query time regardless of merge state.

Q: Is FINAL or argMax faster?

It depends. FINAL is convenient and fast when you filter on the primary key, but slow for wide unfiltered scans (and PREWHERE optimization is off by default under FINAL , though it can be enabled via optimize_move_to_prewhere_if_final ). An explicit GROUP BY ... argMax often wins on large, unfiltered queries. Benchmark both for your workload.

Q: Does the version column make deduplication happen sooner?

No. The version column only determines which row survives once a merge or FINAL runs. It has no effect on when deduplication occurs.

The single most common surprise with ReplacingMergeTree is that it does not deduplicate on insert. Rows with the same sorting key happily coexist in a table, and a plain SELECT returns every one of them until a background merge eventually collapses them. This page explains exactly why that happens, when (or whether) it gets resolved, and the query-time and design-time patterns that give you correct results immediately.

If you are new to the engine itself, start with ReplacingMergeTree basics and the MergeTree family guide. This page focuses specifically on the duplicate-visibility gotcha.

Why Duplicates Stay Visible

ReplacingMergeTree removes duplicates only during a background merge, never at insert time. Per the official ClickHouse documentation, "data deduplication occurs only during a merge. Merging occurs in the background at an unknown time, so you can't plan for it."

Two facts follow from this:

Every insert creates a new data part. Deduplication happens between parts when ClickHouse merges them. Until that merge runs, the old and new versions of a row live in different parts and both are returned.
Merges are asynchronous and not guaranteed. ClickHouse decides when to merge based on its own heuristics. A small table that never accumulates enough parts may sit with duplicates indefinitely. As the docs put it, the engine "doesn't guarantee rows will be deduplicated, and you shouldn't rely on it."

The "duplicate key" is defined by the table's ORDER BY clause — not by a PRIMARY KEY constraint in the relational sense. Two rows with identical ORDER BY values are candidates for replacement.

CREATE TABLE users
(
    user_id   UInt64,
    name      String,
    updated_at DateTime
)
ENGINE = ReplacingMergeTree(updated_at)
ORDER BY user_id;

INSERT INTO users VALUES (1, 'Alice', '2026-06-01 10:00:00');
INSERT INTO users VALUES (1, 'Alice Smith', '2026-06-02 10:00:00');

-- Returns TWO rows, not one — no merge has run yet
SELECT * FROM users;

The optional version column (updated_at above) only decides which row wins once a merge happens — it does nothing to make deduplication immediate.

The Partition Trap: Duplicates That Never Collapse

There is a sharper version of this gotcha. ClickHouse only merges parts within a single partition. If two rows with the same sorting key land in different partitions, they will never be merged into one — no amount of waiting or OPTIMIZE helps.

-- BAD: partitioning by a column that changes between versions of a row
CREATE TABLE events
(
    id      UInt64,
    status  String,
    day     Date
)
ENGINE = ReplacingMergeTree
PARTITION BY day      -- different days => different partitions
ORDER BY id;

If the same id is inserted on two different days, the rows go to different partitions and are never deduplicated against each other. The best practice is to choose a partition key such that the same sorting key never appears in more than one partition. Deduplication is scoped per-partition, and so is the FINAL optimization described below.

Workaround 1: The FINAL Keyword

FINAL forces ClickHouse to apply the merge/replacement logic at query time, guaranteeing correct results regardless of whether a background merge has run.

-- Correct, deduplicated result — every time
SELECT * FROM users FINAL;

SELECT count() FROM users FINAL;

FINAL keeps only the highest-version row for each sorting key (and, if you use the delete column, removes rows whose latest version is marked deleted). The trade-off is performance: ClickHouse must reconcile rows at read time. The overhead is most noticeable when your query does not filter on primary-key columns, because more data must be read and deduplicated. The PREWHERE optimization is not applied under FINAL by default. It can be enabled with the optimize_move_to_prewhere_if_final setting (off by default), though with a restriction: only conditions on the sorting key are moved to PREWHERE, to avoid filtering rows prematurely before FINAL merges them. Starting in v25.11, the apply_prewhere_after_final setting offers an alternative approach that applies PREWHERE after FINAL runs.

A few practical notes:

Filter on the primary key when you can. SELECT * FROM users FINAL WHERE user_id = 1 is far cheaper than a full-table FINAL, because deduplication scope shrinks to the relevant ranges.
Parallelize across partitions. Setting do_not_merge_across_partitions_select_final = 1 lets FINAL deduplicate each partition independently and in parallel, which speeds up partitioned tables. This is only correct when the same sorting key never appears in more than one partition — exactly the design discussed above.
Don't use FINAL on the wrong engine. Applying it where it isn't supported throws an error; see ILLEGAL_FINAL.

SELECT *
FROM users FINAL
WHERE user_id = 1
SETTINGS do_not_merge_across_partitions_select_final = 1;

Workaround 2: GROUP BY with argMax

When FINAL is too slow — typically wide SELECT * scans without a primary-key filter — an explicit GROUP BY with argMax() often outperforms it and gives you full control over the deduplication logic.

-- Keep the latest value of each column per key, using the version column
SELECT
    user_id,
    argMax(name, updated_at)       AS name,
    max(updated_at)                AS updated_at
FROM users
GROUP BY user_id;

argMax(col, version) returns the value of col from the row with the maximum version. List every non-key column you need with its own argMax. An equivalent pattern uses a subquery to select the latest version per key:

SELECT *
FROM users
WHERE (user_id, updated_at) IN
(
    SELECT user_id, max(updated_at)
    FROM users
    GROUP BY user_id
);

The argMax approach is verbose, but it sidesteps FINAL's read-time reconciliation and can be much faster on large, unfiltered scans.

Workaround 3: Materialized Views for Pre-Deduplicated State

If you repeatedly query the deduplicated state, push the cost off the read path with an incremental aggregation. An AggregatingMergeTree target fed by a materialized view can maintain a single "latest row per key" using argMaxState/argMaxMerge, so reads never pay deduplication overhead. This is more work to set up than FINAL, but it is the right answer for hot dashboards and frequent lookups. See the materialized view guide for the full pattern.

FINAL vs. OPTIMIZE FINAL vs. argMax

Approach	When it runs	Guarantees correct result	Cost
Plain `SELECT`	—	No (duplicates visible)	Cheapest, but wrong
`SELECT ... FINAL`	Per query, at read time	Yes	Read-time overhead; worse without a PK filter
`GROUP BY` + `argMax`	Per query, at read time	Yes (you define the logic)	Often faster than `FINAL` on wide scans
`OPTIMIZE TABLE ... FINAL`	Manual, one-off	Yes, on disk after it completes	Heavy: rewrites parts, locks; not for routine use
Materialized view → AggregatingMergeTree	Incrementally on insert	Yes (in the target table)	Setup cost; cheapest reads

Should You Use OPTIMIZE TABLE ... FINAL?

OPTIMIZE TABLE users FINAL forces a merge of all parts in each partition right now, physically removing duplicates on disk. It is useful for one-off cleanup or testing, but it is not a strategy for production reads:

It rewrites entire partitions, which is expensive and I/O-heavy.
It can time out on large tables; see OPTIMIZE FINAL timeout.
New inserts immediately reintroduce duplicates, so you'd be running it forever.

Treat background merges plus query-time FINAL (or argMax) as the normal path, and reserve OPTIMIZE ... FINAL for maintenance.

Best Practices

Never assume insert-time deduplication. Always read with FINAL or argMax if you need correct results immediately.
Design partitions so a key lives in exactly one partition. Otherwise duplicates can never collapse, and do_not_merge_across_partitions_select_final becomes unsafe.
Always define a version column (UInt*, Date, DateTime, or DateTime64) so the "latest" row is deterministic. Without it, the most recently inserted row wins, which is ambiguous under concurrent inserts.
Filter on the primary key under FINAL to keep deduplication scope small.
Benchmark FINAL vs. argMax for your queries — the winner depends on filters and column width.
Use a materialized view when the same deduplicated result is read frequently.

Common Issues

"My counts are too high." A plain count() counts duplicate parts. Use SELECT count() FROM t FINAL.
Duplicates never disappear even after OPTIMIZE. The duplicate keys are split across partitions — fix the PARTITION BY clause.
FINAL is slow. The query likely doesn't filter on the primary key, or runs a wide SELECT *. Add a PK filter, try argMax, or precompute with a materialized view.
Deleted rows still appear. The delete (is_deleted) column marks rows logically; they are only physically removed under cleanup. By default ClickHouse keeps delete markers to preserve version ordering, so query with FINAL to exclude them.
Cross-shard duplicates. Deduplication is local to each shard. A Distributed table over multiple shards can still return duplicates for the same key if it exists on more than one shard.

How Pulse Helps

Diagnosing duplicate-visibility problems usually means correlating part counts, merge activity, and query patterns — which is tedious to do by hand across a busy cluster. Pulse continuously monitors ClickHouse merge behavior and part accumulation, surfacing ReplacingMergeTree tables where duplicates are piling up because merges are falling behind or because of partitioning mistakes. It flags slow FINAL queries and excessive unmerged parts before they become correctness or performance incidents, with concrete, ClickHouse-specific recommendations rather than generic alerts.

Frequently Asked Questions

Q: Why does ReplacingMergeTree still show duplicate rows after I insert?

Because deduplication happens only during background merges, which run asynchronously at an unpredictable time. Until a merge collapses the parts, every inserted version is returned. Use FINAL or a GROUP BY ... argMax query to get the deduplicated result immediately.

Q: How do I always get deduplicated results?

Add the FINAL keyword to your SELECT, or write an explicit GROUP BY key with argMax(col, version) for each column. Both apply the replacement logic at query time regardless of merge state.

Q: Will OPTIMIZE TABLE ... FINAL permanently fix duplicates?

It removes existing duplicates on disk, but new inserts immediately create new parts that may contain duplicates again. It also rewrites whole partitions and can be very expensive, so it is for one-off maintenance, not routine querying.

Q: Why do some duplicates never get removed no matter what?

ClickHouse only merges parts within the same partition. If two rows with the same ORDER BY key are in different partitions, they will never be merged together. Choose a partition key so that any given sorting key lands in only one partition.

Q: Is FINAL or argMax faster?

It depends. FINAL is convenient and fast when you filter on the primary key, but slow for wide unfiltered scans (and PREWHERE optimization is off by default under FINAL, though it can be enabled via optimize_move_to_prewhere_if_final). An explicit GROUP BY ... argMax often wins on large, unfiltered queries. Benchmark both for your workload.

Q: Does the version column make deduplication happen sooner?

No. The version column only determines which row survives once a merge or FINAL runs. It has no effect on when deduplication occurs.