What is AggregatingMergeTree?
AggregatingMergeTree is a specialized ClickHouse MergeTree-family engine that replaces rows sharing the same sorting key with a single row holding aggregate function states. Instead of storing raw values, columns are typed as AggregateFunction(name, type) or SimpleAggregateFunction(name, type), and the engine combines states during background merges. The pattern is used to maintain pre-aggregated rollups - hourly counters, distinct-user sketches, quantile digests - that are cheap to read because most of the aggregation work has already happened at write time. It is most effective when the rollup reduces row counts by orders of magnitude.
How AggregatingMergeTree Works
Each aggregate column stores a partial state (not the final value). You write states with -State combinator functions (sumState, uniqState, quantileState) and read them by combining stored states with -Merge combinator functions (sumMerge, uniqMerge, quantileMerge). During background merges, ClickHouse calls the aggregate function's merge() method to combine same-key states, so the on-disk size shrinks over time.
-- Destination rollup table: hourly per-tenant counters and distinct users
CREATE TABLE events_hourly_agg
(
tenant_id UInt32,
event_hour DateTime,
events_count AggregateFunction(sum, UInt64),
unique_users AggregateFunction(uniq, UInt64)
)
ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMM(event_hour)
ORDER BY (tenant_id, event_hour);
-- Materialized view that feeds the rollup as rows arrive in events_raw
CREATE MATERIALIZED VIEW events_hourly_mv TO events_hourly_agg AS
SELECT
tenant_id,
toStartOfHour(event_time) AS event_hour,
sumState(1::UInt64) AS events_count,
uniqState(user_id) AS unique_users
FROM events_raw
GROUP BY tenant_id, event_hour;
-- Reader: combine remaining partial states at query time
SELECT
tenant_id,
event_hour,
sumMerge(events_count) AS events,
uniqMerge(unique_users) AS uniques
FROM events_hourly_agg
WHERE event_hour >= now() - INTERVAL 7 DAY
GROUP BY tenant_id, event_hour
ORDER BY event_hour;
The GROUP BY on the read side is required because merges are asynchronous - multiple parts may still hold partial states for the same key. Using FINAL forces a synchronous merge at read time but is significantly slower.
AggregatingMergeTree vs Related Engines
| Engine | What it stores | Read-time work | Best for |
|---|---|---|---|
| MergeTree | Raw rows | Full aggregation | Wide-column raw event storage |
| SummingMergeTree | Summed numeric columns per key | Sum the per-part residues | Simple sum() rollups only |
| AggregatingMergeTree | Partial aggregate states per key | -Merge combinator |
Arbitrary aggregate functions (uniq, quantile, argMax, etc.) |
| ReplacingMergeTree | Latest row per key by version | FINAL or argMax |
Upsert / CDC patterns |
| CollapsingMergeTree | Sign-encoded rows | SUM(sign * col) |
Incremental updates with insert/delete pairs |
Choose AggregatingMergeTree when SummingMergeTree is not expressive enough (you need uniq, quantile, argMax, groupArray, etc.) and you can tolerate the operational cost of explicit -State / -Merge plumbing.
Common Pitfalls with AggregatingMergeTree
- Mixing raw values and aggregate states in the same column - the engine only merges
AggregateFunction/SimpleAggregateFunctioncolumns. PlainUInt64columns are silently kept "as is" only withSimpleAggregateFunction(any, ...)semantics; otherwise they are left to the engine's default behavior. - Forgetting
GROUP BYon the read side. Without it you get one row per part per key, not a final aggregate. - Using
sum()instead ofsumMerge()on anAggregateFunction(sum, ...)column - returns binary state bytes interpreted as a number, producing garbage values. - Aggregating on columns not in
ORDER BY. Only columns in the sorting key act as the merge key; everything else accumulates raw. - Over-using
FINALin production queries. It materializes the merge at read time and disables several optimizations; preferGROUP BYwith-Mergecombinators.
Monitoring AggregatingMergeTree Tables
Watch the part count and merge depth to confirm the engine is doing its job:
-- Active parts per partition; high counts suggest merges are falling behind
SELECT partition, count() AS parts, sum(rows) AS rows
FROM system.parts
WHERE table = 'events_hourly_agg' AND active = 1
GROUP BY partition ORDER BY parts DESC;
-- Ongoing merges
SELECT table, elapsed, progress, num_parts, source_part_names
FROM system.merges WHERE table = 'events_hourly_agg';
If parts pile up faster than merges can collapse them, you'll eventually hit the parts_to_throw_insert threshold (default 3000) and see a Too Many Parts error. Pulse tracks AggregatingMergeTree merge backlogs across replicas and surfaces the root cause when rollups stop converging - typically slow disk on one replica, an unbalanced partition key, or a runaway materialized view producing too many small inserts.
Frequently Asked Questions
Q: What is the difference between AggregatingMergeTree and SummingMergeTree?
A: SummingMergeTree only sums numeric columns per primary key during merges. AggregatingMergeTree stores arbitrary aggregate function states (uniq, quantile, argMax, groupArray, sum, etc.) and merges them with the function's own merge method. AggregatingMergeTree is strictly more general; SummingMergeTree is simpler when you only need sums.
Q: Do I need to use FINAL when querying AggregatingMergeTree?
A: No - the recommended pattern is to add an explicit GROUP BY with -Merge combinator functions on the read side, which combines the partial states ClickHouse has not yet merged in the background. FINAL works but materializes the merge at read time and is typically slower than the GROUP BY approach.
Q: Which aggregate functions work with AggregatingMergeTree?
A: Any aggregate function that has a -State and -Merge combinator works - this covers nearly every ClickHouse aggregate including sum, count, avg, uniq, uniqExact, quantile, argMax, argMin, groupArray, groupUniqArray, topK, and min/max. Use SimpleAggregateFunction(sum, UInt64) for sums when you want to read the column as a plain number without -Merge.
Q: How do I feed an AggregatingMergeTree table?
A: The canonical pattern is a materialized view: CREATE MATERIALIZED VIEW mv TO agg_table AS SELECT key, sumState(value) FROM raw GROUP BY key. The MV runs the -State aggregation per insert into raw and routes the partial states into the aggregating table.
Q: When should I avoid AggregatingMergeTree?
A: Avoid it when the aggregation does not reduce row count meaningfully (the engine's overhead outweighs the savings), when you need raw rows for ad-hoc analysis, or when only simple sums are needed (SummingMergeTree is operationally simpler). Also avoid it for upsert/CDC workloads, where ReplacingMergeTree fits better.
Q: Can I replicate an AggregatingMergeTree table?
A: Yes - use ReplicatedAggregatingMergeTree('/clickhouse/tables/{shard}/agg_table', '{replica}'). It combines the aggregation semantics with ZooKeeper / ClickHouse Keeper-coordinated replication described in ReplicatedMergeTree.
Related Reading
- MergeTree Engine - the base engine this one extends
- ReplicatedMergeTree - how to add replication on top
- SummingMergeTree - simpler engine for pure-sum rollups
- ReplacingMergeTree - upsert-style alternative
- ClickHouse Materialized View - the typical feeder for AggregatingMergeTree
- Too Many Parts Error - common failure when MV inserts outpace merges
- ClickHouse Documentation Hub - index of all ClickHouse KB pages