ClickHouse MergeTree Engine: How It Stores and Merges Data

What is MergeTree?

MergeTree is the core table engine in ClickHouse and the base for the entire MergeTree family (ReplacingMergeTree, SummingMergeTree, AggregatingMergeTree, CollapsingMergeTree, and the Replicated* variants). It writes data in immutable, columnar parts sorted by the ORDER BY expression, builds a sparse primary-key index, and continuously merges small parts into larger ones in the background. The engine is optimized for high ingest rates (millions of rows per second per node), large analytical scans, and time-series workloads where data is appended and rarely updated.

How MergeTree Works

Each INSERT produces one or more new "parts" - directories on disk containing column files, an index_granularity-aligned primary key index file (default index_granularity = 8192 rows per granule), and checksums.txt. Parts are immutable once written. A background scheduler merges small parts into larger ones according to size and age policies, which is what gives the engine its name. Reads use the primary index to skip granules that cannot contain matching rows, then decompress and scan only the column files referenced by the query.

CREATE TABLE events
(
    event_time DateTime,
    event_type LowCardinality(String),
    user_id    UInt64,
    payload    String,
    INDEX idx_user user_id TYPE minmax GRANULARITY 4
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_time)
ORDER BY (event_time, user_id)
PRIMARY KEY (event_time)           -- prefix of ORDER BY; defaults to ORDER BY
TTL event_time + INTERVAL 90 DAY DELETE
SETTINGS index_granularity = 8192;

Three concepts to internalize:

  • Sorting key (ORDER BY): the order in which rows are physically stored within a part. Drives compression efficiency and which queries skip data well.
  • Primary key (PRIMARY KEY, optional): a prefix of the sorting key that is held in memory as the sparse index. If you omit it, the primary key equals the sorting key. Specifying a shorter primary key reduces RAM usage on tables with very long sorting keys.
  • Partition (PARTITION BY): a logical grouping (typically by month) that lets you drop, attach, and ALTER data per slice. Parts from different partitions are never merged together.

MergeTree vs Other MergeTree Variants

Engine Special behavior on merge When to use
MergeTree None - rows kept as-is Raw events, append-only analytics
ReplacingMergeTree Keeps the latest row per sorting key Upserts / CDC ingest
SummingMergeTree Sums numeric columns per sorting key Pure-sum rollups
AggregatingMergeTree Merges aggregate function states Arbitrary aggregate rollups (uniq, quantile, etc.)
CollapsingMergeTree Cancels sign=1/-1 row pairs Incremental update workflows
VersionedCollapsingMergeTree Like Collapsing, but version-aware Out-of-order CDC
GraphiteMergeTree Rolls up time-series with retention rules Graphite metric storage
ReplicatedMergeTree Adds Keeper-coordinated replication to any of the above HA / multi-replica deployments

Replicated variants share the same merge semantics; they only add a coordination layer.

Key MergeTree Settings

Setting Default Effect
index_granularity 8192 Rows per granule in the primary index
min_bytes_for_wide_part 10 MiB Above this, parts use wide format (one file per column)
min_rows_for_wide_part 0 (disabled) Row-count threshold for wide format
merge_max_block_size 8192 Rows per block during a merge
parts_to_delay_insert 150 Inserts are slowed when a partition has this many active parts
parts_to_throw_insert 3000 Inserts fail with Too Many Parts at this threshold
max_parts_in_total 100000 Hard ceiling on active parts across all partitions
merge_with_ttl_timeout 14400 (seconds) Minimum gap between TTL merges
enable_mixed_granularity_parts 1 Allow adaptive granularity per part

These thresholds are the most common ones to tune in production: parts_to_delay_insert and parts_to_throw_insert for high-velocity ingest, and index_granularity for very narrow tables.

Common Pitfalls

  1. Choosing an ORDER BY that does not match query filters. The primary index is useless if your WHERE clauses don't reference a prefix of the sorting key.
  2. Over-partitioning. Per-day partitions on a five-year retention is 1,800 partitions - merges and system.parts queries get expensive. Per-month is the standard.
  3. Many small inserts. Each INSERT creates at least one part; thousands of small inserts overwhelm the merge scheduler. Batch on the client or use the Buffer engine / async inserts.
  4. Treating ALTER TABLE ... UPDATE as cheap. Mutations rewrite affected parts in full. Prefer ReplacingMergeTree or AggregatingMergeTree for update-heavy workloads.
  5. Forgetting to set TTL. Tables grow until the disk fills up; the engine has no built-in expiry without a TTL clause.

Operating MergeTree in Production

The signal that matters most is the active part count - too many parts means merges are falling behind, ingest will start to throttle, and eventually INSERT fails:

-- Active parts per table/partition; sort descending to spot trouble
SELECT database, table, partition, count() AS parts, sum(rows) AS rows,
       formatReadableSize(sum(bytes_on_disk)) AS size
FROM system.parts
WHERE active = 1
GROUP BY database, table, partition
ORDER BY parts DESC LIMIT 20;

-- Active merges
SELECT table, elapsed, progress, num_parts, formatReadableSize(memory_usage)
FROM system.merges;

Pulse tracks part counts, merge throughput, and TTL backlog per MergeTree table in real time. Its agentic SRE engine flags when ingest patterns are about to trip parts_to_delay_insert, identifies the root cause (small-batch inserts, slow disk, missing partition pruning), and in some cases auto-tunes merge settings or temporarily relaxes the threshold while it resolves the upstream batching problem.

Frequently Asked Questions

Q: What is the difference between the primary key and the sorting key in MergeTree?
A: The sorting key (ORDER BY) defines the physical row order within each part - it drives compression and data-skipping. The primary key (PRIMARY KEY, optional) is the in-memory sparse index, and it must be a prefix of the sorting key. If you do not specify PRIMARY KEY, it defaults to the full sorting key.

Q: What is the default index_granularity in MergeTree?
A: The default is 8192 rows per granule. Smaller values give finer-grained data skipping at the cost of a larger primary-key index in memory; larger values are useful for very wide tables where most reads are full-column scans.

Q: How often does ClickHouse merge MergeTree parts?
A: There is no fixed schedule. A background pool continuously evaluates which small parts to merge based on size and age, governed by the merge_tree settings group. New parts are typically merged within seconds to minutes; large parts (hundreds of GB) are merged less frequently.

Q: Why does my MergeTree table have so many active parts?
A: The most common cause is many small INSERT operations - each insert creates at least one part. Other causes are inserts spread across many partitions, a high max_partitions_per_insert_block, or merges throttled by I/O contention. Watch system.parts and system.merges to confirm and batch inserts client-side.

Q: Can I update or delete rows in a MergeTree table?
A: Yes, via ALTER TABLE ... UPDATE and ALTER TABLE ... DELETE mutations, but they rewrite entire parts and are not cheap. For update-heavy workloads use ReplacingMergeTree or CollapsingMergeTree. Lightweight DELETE (which marks rows as deleted without rewriting) is available in recent versions but still affects merges.

Q: How do I add replication to a MergeTree table?
A: Use ReplicatedMergeTree instead - it adds a Keeper path and a replica name to the engine arguments. See the ReplicatedMergeTree page for the full pattern, including macros for {shard} and {replica}.

Q: Should I use PARTITION BY toYYYYMMDD instead of toYYYYMM?
A: Almost never. Daily partitions on a multi-year retention create thousands of small partitions and bloat system.parts. Monthly (toYYYYMM) is the standard; use daily only if you specifically need partition-level operations (DROP PARTITION) on a daily granularity for a short retention.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.