The MergeTree family is the answer to almost every ClickHouse table design question. MergeTree is the foundational engine; the variants (ReplacingMergeTree, SummingMergeTree, AggregatingMergeTree, CollapsingMergeTree, VersionedCollapsingMergeTree, GraphiteMergeTree) add specific behaviors at merge time, and the Replicated prefix layers Keeper-based replication on top of any of them. This guide covers the shared MergeTree fundamentals (parts, merges, ORDER BY, PARTITION BY, indexes), then walks through each variant with examples and the workloads it fits.
What MergeTree Actually Does
A MergeTree table stores data in immutable parts on disk. Each INSERT creates a new part containing the rows from that batch, sorted by ORDER BY and indexed by a sparse primary key. In the background, ClickHouse merges parts together, combining their data and applying engine-specific logic (deduplication, summation, collapsing, etc.) as part of the merge.
This design has three important consequences:
- Inserts are very fast because they just write a new part. There is no global index to update.
- Reads are fast because data is pre-sorted and the sparse index lets the engine skip whole granules (8192 rows by default).
- Updates and deletes are expensive because they require rewriting whole parts. MergeTree variants that bake update semantics into the merge avoid this.
Basic MergeTree
CREATE TABLE events
(
event_time DateTime,
user_id UInt64,
properties String
)
ENGINE = MergeTree
ORDER BY (user_id, event_time)
PARTITION BY toYYYYMM(event_time);
Use plain MergeTree for raw event data, log data, append-only fact tables, and any workload where you do not need deduplication or pre-aggregation at merge time. Pair with TTL for retention, and pick ORDER BY and PARTITION BY carefully (see the CREATE TABLE guide).
Shared Clauses Across the Family
Every MergeTree-family engine supports the same set of clauses:
ENGINE = <engine_family>(<engine-specific args>)
ORDER BY expr -- mandatory; sparse primary index by default
PARTITION BY expr -- optional, but usually wanted for time-series
PRIMARY KEY expr -- optional; defaults to ORDER BY; must be a prefix of it
SAMPLE BY expr -- optional; enables SAMPLE-based queries
TTL expr -- optional; per-table or per-column retention
SETTINGS ... -- engine settings (index_granularity, storage_policy, etc.)
Pick ORDER BY from the columns you actually filter on. The default index_granularity of 8192 is correct for nearly every workload. PARTITION BY toYYYYMM(<date_column>) is the standard choice for time-series. Avoid partitioning on high-cardinality keys; the official docs are explicit about this.
ReplicatedMergeTree
CREATE TABLE events ON CLUSTER my_cluster
(
event_time DateTime,
user_id UInt64,
properties String
)
ENGINE = ReplicatedMergeTree(
'/clickhouse/tables/{shard}/events',
'{replica}'
)
ORDER BY (user_id, event_time)
PARTITION BY toYYYYMM(event_time);
ReplicatedMergeTree adds Keeper-based replication. Every other family member has a Replicated cousin: ReplicatedReplacingMergeTree, ReplicatedSummingMergeTree, and so on. The two parameters are the Keeper path (where coordination state lives) and the replica name; both typically use {shard} and {replica} macros so the same DDL works on every node.
On ClickHouse Cloud, the platform handles replication transparently and you usually just write ENGINE = MergeTree; the cloud substitutes the replicated form. The full set of replicated variants is documented in the docs hub.
ReplacingMergeTree: Deduplication at Merge
CREATE TABLE users
(
user_id UInt64,
name String,
updated_at DateTime,
is_deleted UInt8 DEFAULT 0
)
ENGINE = ReplacingMergeTree(updated_at, is_deleted)
ORDER BY user_id;
When two rows share the same ORDER BY key, ReplacingMergeTree keeps only the one with the highest version (the first engine argument). The optional second argument is a UInt8 deletion flag; rows where it is 1 are removed at merge time. The catch: deduplication happens only when ClickHouse merges parts together, which is asynchronous. Until the merge happens, both rows are visible. Use FINAL on queries to force deduplication at query time, or design your application to read the latest version explicitly.
FINAL is expensive on large tables. For frequent point lookups, an alternative pattern is to maintain a small "latest snapshot" table updated by a materialized view.
SummingMergeTree: Pre-Aggregating Numeric Columns
CREATE TABLE sales_daily
(
date Date,
sku UInt64,
units UInt32,
revenue Float64
)
ENGINE = SummingMergeTree((units, revenue))
ORDER BY (date, sku);
SummingMergeTree sums the named numeric columns when rows share the same ORDER BY key. The result of merging two (2026-05-13, 42, 5, 99.95) and (2026-05-13, 42, 3, 59.97) rows is (2026-05-13, 42, 8, 159.92). Useful for cumulative counters and OLAP cubes where you only ever need the sum per dimension.
Same caveat as Replacing: merges are asynchronous. Query with FINAL if you need exact sums right now, or use sum(units), sum(revenue) GROUP BY date, sku and let the GROUP BY collapse any unmerged rows.
AggregatingMergeTree: Pre-Aggregating Anything
CREATE TABLE user_metrics
(
user_id UInt64,
sessions AggregateFunction(uniq, UInt64),
last_seen AggregateFunction(max, DateTime),
total_rev AggregateFunction(sum, Float64)
)
ENGINE = AggregatingMergeTree
ORDER BY user_id;
AggregatingMergeTree stores aggregate states (the partial state of uniq, max, sum, etc.) and merges them together at merge time. To read, use the matching Merge combinator: uniqMerge(sessions), maxMerge(last_seen), sumMerge(total_rev).
The pattern of choice is to feed this engine from a materialized view over a raw MergeTree table. The MV emits uniqState(...), maxState(...), sumState(...) per insert; the AggregatingMergeTree merges them. Query results are exact (no missing data), and the rolling aggregate state grows orders of magnitude slower than the raw events.
CollapsingMergeTree: State Changes Without Rewrites
CREATE TABLE order_state
(
order_id UInt64,
status String,
updated_at DateTime,
sign Int8
)
ENGINE = CollapsingMergeTree(sign)
ORDER BY order_id;
CollapsingMergeTree tracks state changes by appending pairs of rows: one with sign = -1 to "cancel" the old state and one with sign = +1 for the new. At merge time, matching pairs cancel out, leaving only the current state.
The pattern is well suited to mutable lookup tables where you control the writer and can emit the cancel-and-replace pair atomically. The downside is the writer complexity; if you cannot emit both rows, the table accumulates uncancelled state.
For a less brittle alternative, consider ReplacingMergeTree (simpler) or VersionedCollapsingMergeTree (which handles out-of-order cancels via a version column).
VersionedCollapsingMergeTree
CREATE TABLE order_state
(
order_id UInt64,
status String,
version UInt32,
sign Int8
)
ENGINE = VersionedCollapsingMergeTree(sign, version)
ORDER BY order_id;
Like CollapsingMergeTree, but cancels are matched using a version number instead of strict insert order. This is the right tool when your producer pipeline cannot guarantee ordered delivery (Kafka with multiple partitions per key, replicated logs).
GraphiteMergeTree
CREATE TABLE metrics
(
path String,
value Float64,
`time` DateTime,
`date` Date,
version UInt32
)
ENGINE = GraphiteMergeTree('graphite_rollup_config')
ORDER BY (path, `time`);
GraphiteMergeTree is tailored to Graphite-style metrics: it rolls up older data points into coarser aggregates according to a retention policy defined in the configuration. Useful if you are running a Graphite-compatible monitoring stack on ClickHouse. Otherwise, plain MergeTree with TTL is simpler.
Picking an Engine
A pragmatic decision tree:
- Raw append-only data (events, logs, requests):
MergeTree(orReplicatedMergeTree). Add TTL for retention. - Slowly-changing dimensions, mutable lookups:
ReplacingMergeTreewith anupdated_atcolumn. UseFINALselectively. - Numeric pre-aggregates per dimension (daily sales, hourly counters):
SummingMergeTree. Always read viasum() GROUP BYto be safe. - Complex pre-aggregates (
uniq, percentiles, custom):AggregatingMergeTree, fed by a materialized view. - Mutable state with strict ordering:
CollapsingMergeTree. - Mutable state without strict ordering:
VersionedCollapsingMergeTree. - Graphite-compatible metrics with rollups:
GraphiteMergeTree.
Always pick the simpler engine if you have a choice. Most production tables are plain MergeTree.
How Pulse Helps With MergeTree Health
The MergeTree family is forgiving in development and unforgiving at scale. The same table that ran fine on 10 GB starts producing too-many-parts errors at 1 TB if the partition key is wrong, the merge backlog snowballs if inserts are too small, and Replacing/Summing/Aggregating variants quietly accumulate unmerged data if the merge throughput cannot keep up. Pulse continuously tracks every MergeTree table on a cluster and surfaces partition explosions, merge backlog, Replicated tables stuck on a replica, AggregatingMergeTree views falling behind their source, and inadequate ORDER BY choices that produce expensive query plans. Connect your ClickHouse cluster to Pulse and catch MergeTree drift before query latency does.
Frequently Asked Questions
Q: What is the difference between ORDER BY and PRIMARY KEY?
ORDER BY defines how data is sorted on disk. PRIMARY KEY defines the sparse index. If you omit PRIMARY KEY, ClickHouse uses ORDER BY. If you specify both, the primary key must be a prefix of the sort key. The most common reason to specify them separately is to have a short index (fast to scan) and a longer sort order (better compression and faster filters).
Q: Should I use FINAL on every Replacing/Summing query?
No. FINAL is a "force merge at query time" instruction and is expensive on large tables. Use it for occasional point lookups; for analytics, rely on GROUP BY + sum() (for SummingMergeTree) or argMax(value, version) (for ReplacingMergeTree) instead. The official docs cover the patterns under each engine.
Q: How often does ClickHouse merge parts?
Continuously, in the background, when there are eligible parts. The merge frequency depends on part count, size, and the merge_with_ttl_timeout setting. If you see merge backlog growing, the most common causes are too-small inserts and CPU starvation on the merge thread pool.
Q: Can I switch a table from MergeTree to a variant later?
Not in place. You create a new table with the desired engine and INSERT SELECT from the old one. Plan for the rebuild as part of the design phase if you anticipate needing the variant later.
Q: Are Replicated* engines required for high availability?
Yes, on self-managed clusters. They synchronize parts across replicas through ClickHouse Keeper. ClickHouse Cloud handles replication automatically; you write MergeTree and the platform substitutes the replicated equivalent.
Q: What's the relationship between MergeTree and the Distributed engine?
Distributed is not a storage engine; it is a query router. A Distributed table sits in front of a set of MergeTree (or Replicated*MergeTree) tables on shards and fans queries out. Storage always happens in a MergeTree-family table.
Q: How is data physically organized inside a part?
Each part is a directory of column files, each compressed with the configured codecs, plus a sparse index file. ClickHouse reads only the column files needed by a query, which is what makes columnar analytics fast. The unit of skip is a granule (index_granularity rows, 8192 by default).