ClickHouse MergeTree: The Engine Family Explained

Q: What is the difference between `ORDER BY` and `PRIMARY KEY`?

ORDER BY defines how data is sorted on disk. PRIMARY KEY defines the sparse index. If you omit PRIMARY KEY , ClickHouse uses ORDER BY . If you specify both, the primary key must be a prefix of the sort key. The most common reason to specify them separately is to have a short index (fast to scan) and a longer sort order (better compression and faster filters).

Q: Should I use `FINAL` on every Replacing/Summing query?

No. FINAL is a "force merge at query time" instruction and is expensive on large tables. Use it for occasional point lookups; for analytics, rely on GROUP BY + sum() (for SummingMergeTree) or argMax(value, version) (for ReplacingMergeTree) instead. The official docs cover the patterns under each engine.

Q: Should I use ReplacingMergeTree or CollapsingMergeTree for upserts?

For most upsert and CDC workloads, ReplacingMergeTree is the better default: you only insert the new row, you never need to know the prior state, and out-of-order inserts are resolved by the version column. Reach for CollapsingMergeTree only when you need exact sum() / count() aggregates without FINAL (via the sign trick) or native deletes, and your writer can emit the cancel-and-replace pair in order. See the side-by-side comparison above.

Q: What's the relationship between MergeTree and the Distributed engine?

Distributed is not a storage engine; it is a query router. A Distributed table sits in front of a set of MergeTree (or Replicated*MergeTree) tables on shards and fans queries out. Storage always happens in a MergeTree-family table.

The MergeTree family is the answer to almost every ClickHouse table design question. MergeTree is the foundational engine; the variants (ReplacingMergeTree, SummingMergeTree, AggregatingMergeTree, CollapsingMergeTree, VersionedCollapsingMergeTree, GraphiteMergeTree) add specific behaviors at merge time, and the Replicated prefix layers Keeper-based replication on top of any of them. This guide covers the shared MergeTree fundamentals (parts, merges, ORDER BY, PARTITION BY, indexes), then walks through each variant with examples and the workloads it fits.

What MergeTree Actually Does

A MergeTree table stores data in immutable parts on disk. Each INSERT creates a new part containing the rows from that batch, sorted by ORDER BY and indexed by a sparse primary key. In the background, ClickHouse merges parts together, combining their data and applying engine-specific logic (deduplication, summation, collapsing, etc.) as part of the merge.

This design has three important consequences:

Inserts are very fast because they just write a new part. There is no global index to update.
Reads are fast because data is pre-sorted and the sparse index lets the engine skip whole granules (8192 rows by default).
Updates and deletes are expensive because they require rewriting whole parts. MergeTree variants that bake update semantics into the merge avoid this.

Basic MergeTree

CREATE TABLE events
(
    event_time DateTime,
    user_id    UInt64,
    properties String
)
ENGINE = MergeTree
ORDER BY (user_id, event_time)
PARTITION BY toYYYYMM(event_time);

Use plain MergeTree for raw event data, log data, append-only fact tables, and any workload where you do not need deduplication or pre-aggregation at merge time. Pair with TTL for retention, and pick ORDER BY and PARTITION BY carefully (see the CREATE TABLE guide).

Shared Clauses Across the Family

Every MergeTree-family engine supports the same set of clauses:

ENGINE = <engine_family>(<engine-specific args>)
ORDER BY expr        -- mandatory; sparse primary index by default
PARTITION BY expr    -- optional, but usually wanted for time-series
PRIMARY KEY expr     -- optional; defaults to ORDER BY; must be a prefix of it
SAMPLE BY expr       -- optional; enables SAMPLE-based queries
TTL expr             -- optional; per-table or per-column retention
SETTINGS ...         -- engine settings (index_granularity, storage_policy, etc.)

Pick ORDER BY from the columns you actually filter on. The default index_granularity of 8192 is correct for nearly every workload. PARTITION BY toYYYYMM(<date_column>) is the standard choice for time-series. Avoid partitioning on high-cardinality keys; the official docs are explicit about this.

How to Pick ORDER BY / PRIMARY KEY / PARTITION BY

These three clauses do most of the work in deciding how a MergeTree table performs. They are also the hardest to change later, so it pays to get them right at CREATE TABLE time.

ORDER BY: filter columns, low cardinality first. Put the columns you filter on in the WHERE clause into ORDER BY, ordered from lowest cardinality (and most important for filtering) on the left to highest cardinality on the right. A good ORDER BY is usually 3–5 columns. The classic shape is something like (tenant_id, event_type, event_time): the left-most column is a low-cardinality value you almost always filter on, and time goes last. Leading with the low-cardinality column lets the sparse index skip the most granules for the widest range of queries, and it compresses better because adjacent rows share more values.

ENGINE = MergeTree
ORDER BY (tenant_id, event_type, event_time)
PARTITION BY toYYYYMM(event_time);

The one important exception is compression locality. For data with many repeated attributes per entity (clickstream, sessionized events), ordering by the entity key (session_id) first can compress dramatically better and cut disk reads, even though session_id is high cardinality. Putting a tiny low-cardinality column (like event_type) first in that case makes both compression and query time worse. Measure compressed size with system.parts if you are unsure.

PRIMARY KEY: a prefix of ORDER BY, only when you need it. By default PRIMARY KEY equals ORDER BY, and for most tables that is correct — leave it out. Specify a separate PRIMARY KEY only when ORDER BY has many columns (high-dimensionality tables, or collapsing/replacing tables whose sort key includes a row identifier used only for deduplication). A shorter primary key keeps the sparse index small and fast to scan while the longer sort order still gives you compression and fine-grained filtering. The primary key must be a prefix of the sort key.

ENGINE = ReplacingMergeTree(updated_at)
ORDER BY (tenant_id, user_id, device_id)  -- full sort/dedup key
PRIMARY KEY (tenant_id, user_id);          -- lean index, prefix of ORDER BY

PARTITION BY: coarse pruning and data lifecycle, not query speed. Partitioning is for dropping old data cheaply (DROP PARTITION, TTL) and for coarse pruning, not as a substitute for a good ORDER BY. The sparse index prunes at the granule level (fine-grained); partitions prune at the partition level (coarse). Aim for a table with dozens to low hundreds of partitions, never thousands, and queries that touch only a few dozen partitions. toYYYYMM(date_column) (monthly) is the safe default; use daily only for very high volume. Tables under a few GB often need no partitioning at all. Never partition on a high-cardinality column like user_id — that produces millions of tiny partitions, breaks background merges, slows inserts, and triggers too-many-parts errors.

For the mechanics of the sparse primary index and how granule skipping works, see ClickHouse indexes. For partitioning, TTL, and full CREATE TABLE syntax, see the CREATE TABLE guide. If a single ORDER BY cannot serve two very different query shapes, projections let you maintain an alternate sort order of the same data.

ReplicatedMergeTree

CREATE TABLE events ON CLUSTER my_cluster
(
    event_time DateTime,
    user_id    UInt64,
    properties String
)
ENGINE = ReplicatedMergeTree(
    '/clickhouse/tables/{shard}/events',
    '{replica}'
)
ORDER BY (user_id, event_time)
PARTITION BY toYYYYMM(event_time);

ReplicatedMergeTree adds Keeper-based replication. Every other family member has a Replicated cousin: ReplicatedReplacingMergeTree, ReplicatedSummingMergeTree, and so on. The two parameters are the Keeper path (where coordination state lives) and the replica name; both typically use {shard} and {replica} macros so the same DDL works on every node.

On ClickHouse Cloud, the platform handles replication transparently and you usually just write ENGINE = MergeTree; the cloud substitutes the replicated form. The full set of replicated variants is documented in the docs hub.

ReplacingMergeTree: Deduplication at Merge

CREATE TABLE users
(
    user_id      UInt64,
    name         String,
    updated_at   DateTime,
    is_deleted   UInt8 DEFAULT 0
)
ENGINE = ReplacingMergeTree(updated_at, is_deleted)
ORDER BY user_id;

When two rows share the same ORDER BY key, ReplacingMergeTree keeps only the one with the highest version (the first engine argument). The optional second argument is a UInt8 deletion flag; rows where it is 1 mark the row as deleted, but by default the marker is retained — the row is not physically removed. Physical removal requires enabling allow_experimental_replacing_merge_with_cleanup on the table and then either running OPTIMIZE TABLE ... FINAL CLEANUP manually or configuring age-based automatic cleanup. The catch: deduplication happens only when ClickHouse merges parts together, which is asynchronous. Until the merge happens, both rows are visible. Use FINAL on queries to force deduplication at query time, or design your application to read the latest version explicitly.

FINAL is expensive on large tables. For frequent point lookups, an alternative pattern is to maintain a small "latest snapshot" table updated by a materialized view.

SummingMergeTree: Pre-Aggregating Numeric Columns

CREATE TABLE sales_daily
(
    date    Date,
    sku     UInt64,
    units   UInt32,
    revenue Float64
)
ENGINE = SummingMergeTree((units, revenue))
ORDER BY (date, sku);

SummingMergeTree sums the named numeric columns when rows share the same ORDER BY key. The result of merging two (2026-05-13, 42, 5, 99.95) and (2026-05-13, 42, 3, 59.97) rows is (2026-05-13, 42, 8, 159.92). Useful for cumulative counters and OLAP cubes where you only ever need the sum per dimension.

Same caveat as Replacing: merges are asynchronous. Query with FINAL if you need exact sums right now, or use sum(units), sum(revenue) GROUP BY date, sku and let the GROUP BY collapse any unmerged rows.

AggregatingMergeTree: Pre-Aggregating Anything

CREATE TABLE user_metrics
(
    user_id    UInt64,
    sessions   AggregateFunction(uniq, UInt64),
    last_seen  AggregateFunction(max, DateTime),
    total_rev  AggregateFunction(sum, Float64)
)
ENGINE = AggregatingMergeTree
ORDER BY user_id;

AggregatingMergeTree stores aggregate states (the partial state of uniq, max, sum, etc.) and merges them together at merge time. To read, use the matching Merge combinator: uniqMerge(sessions), maxMerge(last_seen), sumMerge(total_rev).

The pattern of choice is to feed this engine from a materialized view over a raw MergeTree table. The MV emits uniqState(...), maxState(...), sumState(...) per insert; the AggregatingMergeTree merges them. Query results are exact (no missing data), and the rolling aggregate state grows orders of magnitude slower than the raw events.

CollapsingMergeTree: State Changes Without Rewrites

CREATE TABLE order_state
(
    order_id  UInt64,
    status    String,
    updated_at DateTime,
    sign      Int8
)
ENGINE = CollapsingMergeTree(sign)
ORDER BY order_id;

CollapsingMergeTree tracks state changes by appending pairs of rows: one with sign = -1 to "cancel" the old state and one with sign = +1 for the new. At merge time, matching pairs cancel out, leaving only the current state.

The pattern is well suited to mutable lookup tables where you control the writer and can emit the cancel-and-replace pair atomically. The downside is the writer complexity; if you cannot emit both rows, the table accumulates uncancelled state.

For a less brittle alternative, consider ReplacingMergeTree (simpler) or VersionedCollapsingMergeTree (which handles out-of-order cancels via a version column).

VersionedCollapsingMergeTree

CREATE TABLE order_state
(
    order_id    UInt64,
    status      String,
    version     UInt32,
    sign        Int8
)
ENGINE = VersionedCollapsingMergeTree(sign, version)
ORDER BY order_id;

Like CollapsingMergeTree, but cancels are matched using a version number instead of strict insert order. This is the right tool when your producer pipeline cannot guarantee ordered delivery (Kafka with multiple partitions per key, replicated logs).

CollapsingMergeTree vs ReplacingMergeTree

Both engines deduplicate rows that share an ORDER BY key, but they solve different problems and trade off differently. ReplacingMergeTree is an upsert engine: write the new version of a row and the old one is discarded at merge time. CollapsingMergeTree is a change-accounting engine: write a sign = -1 row to cancel the old state and a sign = +1 row for the new, and matching pairs collapse at merge.

	ReplacingMergeTree	CollapsingMergeTree
Model	Keep latest version per key	Cancel old row, append new row (sign ±1)
Write complexity	Low — insert the new row only	Higher — must emit a cancel row carrying the old values
Need previous row state?	No	Yes — the `-1` row must repeat the columns being cancelled
Deletes	Only via optional `is_deleted` flag column	Native — write a `-1` row with no matching `+1`
Correct query without `FINAL`	No — duplicates visible until merge; use `argMax(col, version)` or `FINAL`	Yes for aggregates — `sum(metric * sign)` and `count()` via `sum(sign)` are correct regardless of merge state
Aggregates in materialized views	Limited (essentially `uniq`-style)	Counts and sums work directly via the sign trick
Out-of-order ingestion	Tolerant (version column decides the winner)	Fragile — a cancel arriving before its state collapses incorrectly; use `VersionedCollapsingMergeTree`

When to pick ReplacingMergeTree: straightforward upserts and slowly-changing dimensions where you only ever need the current row, you do not want to track the prior state, and you can tolerate FINAL/argMax on reads. This is the right default for change-data-capture (CDC) pipelines — lowest write complexity for eventually-correct results.

When to pick CollapsingMergeTree: you need exact aggregates (sums, counts) without paying for FINAL, you need real deletes, and your writer already knows the prior row state and can emit the cancel/replace pair in order. The price is writer complexity and brittleness under out-of-order delivery.

A practical migration note: you cannot convert an engine in place. To move from one to the other, CREATE a new table with the target engine and INSERT SELECT (deriving the sign/version columns as needed), then swap with EXCHANGE TABLES. Because both engines deduplicate only at merge time, a rebuild also benefits from OPTIMIZE TABLE ... FINAL once data is loaded, though that is expensive on large tables — see the impact of forced merges and mutations in UPDATE/DELETE performance.

For deeper standalone references, see ReplacingMergeTree, CollapsingMergeTree, and VersionedCollapsingMergeTree.

GraphiteMergeTree

CREATE TABLE metrics
(
    path      String,
    value     Float64,
    `time`    DateTime,
    `date`    Date,
    version   UInt32
)
ENGINE = GraphiteMergeTree('graphite_rollup_config')
ORDER BY (path, `time`);

GraphiteMergeTree is tailored to Graphite-style metrics: it rolls up older data points into coarser aggregates according to a retention policy defined in the configuration. Useful if you are running a Graphite-compatible monitoring stack on ClickHouse. Otherwise, plain MergeTree with TTL is simpler.

Picking an Engine

A pragmatic decision tree:

Raw append-only data (events, logs, requests): MergeTree (or ReplicatedMergeTree). Add TTL for retention.
Slowly-changing dimensions, mutable lookups: ReplacingMergeTree with an updated_at column. Use FINAL selectively.
Numeric pre-aggregates per dimension (daily sales, hourly counters): SummingMergeTree. Always read via sum() GROUP BY to be safe.
Complex pre-aggregates (uniq, percentiles, custom): AggregatingMergeTree, fed by a materialized view.
Mutable state with strict ordering: CollapsingMergeTree.
Mutable state without strict ordering: VersionedCollapsingMergeTree.
Graphite-compatible metrics with rollups: GraphiteMergeTree.

Always pick the simpler engine if you have a choice. Most production tables are plain MergeTree.

How Pulse Helps With MergeTree Health

The MergeTree family is forgiving in development and unforgiving at scale. The same table that ran fine on 10 GB starts producing too-many-parts errors at 1 TB if the partition key is wrong, the merge backlog snowballs if inserts are too small, and Replacing/Summing/Aggregating variants quietly accumulate unmerged data if the merge throughput cannot keep up. Pulse continuously tracks every MergeTree table on a cluster and surfaces partition explosions, merge backlog, Replicated tables stuck on a replica, AggregatingMergeTree views falling behind their source, and inadequate ORDER BY choices that produce expensive query plans. Connect your ClickHouse cluster to Pulse and catch MergeTree drift before query latency does.

Frequently Asked Questions

Q: What is the difference between ORDER BY and PRIMARY KEY?

ORDER BY defines how data is sorted on disk. PRIMARY KEY defines the sparse index. If you omit PRIMARY KEY, ClickHouse uses ORDER BY. If you specify both, the primary key must be a prefix of the sort key. The most common reason to specify them separately is to have a short index (fast to scan) and a longer sort order (better compression and faster filters).

Q: Should I use FINAL on every Replacing/Summing query?

No. FINAL is a "force merge at query time" instruction and is expensive on large tables. Use it for occasional point lookups; for analytics, rely on GROUP BY + sum() (for SummingMergeTree) or argMax(value, version) (for ReplacingMergeTree) instead. The official docs cover the patterns under each engine.

Q: How often does ClickHouse merge parts?

Continuously, in the background, when there are eligible parts. The merge frequency depends on part count and size. If you see merge backlog growing, the most common causes are too-small inserts and CPU starvation on the merge thread pool.

Q: Can I switch a table from MergeTree to a variant later?

Not in place. You create a new table with the desired engine and INSERT SELECT from the old one. Plan for the rebuild as part of the design phase if you anticipate needing the variant later.

Q: Are Replicated* engines required for high availability?

Yes, on self-managed clusters. They synchronize parts across replicas through ClickHouse Keeper. ClickHouse Cloud handles replication automatically; you write MergeTree and the platform substitutes the replicated equivalent.

Q: Should I use ReplacingMergeTree or CollapsingMergeTree for upserts?

For most upsert and CDC workloads, ReplacingMergeTree is the better default: you only insert the new row, you never need to know the prior state, and out-of-order inserts are resolved by the version column. Reach for CollapsingMergeTree only when you need exact sum()/count() aggregates without FINAL (via the sign trick) or native deletes, and your writer can emit the cancel-and-replace pair in order. See the side-by-side comparison above.

Q: What's the relationship between MergeTree and the Distributed engine?

Distributed is not a storage engine; it is a query router. A Distributed table sits in front of a set of MergeTree (or Replicated*MergeTree) tables on shards and fans queries out. Storage always happens in a MergeTree-family table.

Q: How is data physically organized inside a part?

Each part is a directory of column files, each compressed with the configured codecs, plus a sparse index file. ClickHouse reads only the column files needed by a query, which is what makes columnar analytics fast. The unit of skip is a granule (index_granularity rows, 8192 by default).