Backfilling a column means populating it for rows that already exist — typically after adding a new column, fixing a data-quality issue, or denormalizing a value that used to live in another table. When the backfill values come from a reference, dimension, or mapping table, the idiomatic ClickHouse approach is to wrap that table in a dictionary and call dictGet inside an ALTER TABLE ... UPDATE mutation.
This guide focuses on the operational side of running such a backfill at scale: building the right kind of lookup dictionary, executing the mutation safely (often partition by partition), choosing between ALTER UPDATE and MATERIALIZE COLUMN, and monitoring a long-running rewrite. For the conceptual basics of dictionary-driven updates — including why you can't JOIN inside a mutation — see ClickHouse dictionary-driven updates and what is a ClickHouse dictionary.
Why a Dictionary for Backfill
A mutation expression is evaluated row by row over the data parts being rewritten, and ClickHouse has no UPDATE ... FROM or JOIN inside ALTER TABLE ... UPDATE. A dictionary sidesteps this: it exposes a second dataset as pure functions (dictGet, dictHas, dictGetOrDefault) keyed by value, and those functions are valid inside a mutation.
For a one-time backfill specifically, the dictionary has two extra advantages over a correlated subquery like WHERE key IN (SELECT ...):
- Bounded memory. A subquery against a large source table can blow up mutation memory. A dictionary is loaded once into a predictable footprint and reused for every row.
- Fast point lookups. Cache and hashed layouts answer per-key lookups in roughly constant time, which matters when you are calling it billions of times.
The Core Backfill Pattern
1. Create the lookup dictionary
The dictionary's source is the table that holds the values you want to write. For a single-column key, a FLAT or HASHED layout is simplest:
CREATE DICTIONARY update_dict
(
key UInt64,
value String
)
PRIMARY KEY key
SOURCE(CLICKHOUSE(TABLE 'test_update_source'))
LIFETIME(MIN 0 MAX 10)
LAYOUT(FLAT);
2. Run the backfill mutation
Use dictGet to read the attribute and dictHas to restrict the rewrite to rows that actually have a matching key — this avoids overwriting rows with a default/empty value when no match exists:
ALTER TABLE test_update
UPDATE value = dictGet('default.update_dict', 'value', toUInt64(key))
WHERE dictHas('default.update_dict', toUInt64(key));
Casting the key to the dictionary's key type (toUInt64(key) for a UInt64/cache-keyed dictionary) avoids type-mismatch errors. The ALTER TABLE ... UPDATE syntax is:
ALTER TABLE [db.]table [ON CLUSTER cluster]
UPDATE col = expr [, ...]
[IN PARTITION partition_id]
WHERE filter_expr
The filter_expr must be UInt8. You cannot update columns used in the primary key or partition key.
Backfilling on Composite (Multi-Column) Keys
When the join between fact and reference data is on several columns, use a dictionary with a complex key and pass the keys as a tuple. For very large reference sets the complex_key_cache layout keeps only hot keys in memory:
CREATE DICTIONARY item_dict
(
key1 UInt64,
key2 String,
key3 String,
S String
)
PRIMARY KEY key1, key2, key3
SOURCE(CLICKHOUSE(TABLE dict_table DB 'test' USER 'default'))
LAYOUT(complex_key_cache(size_in_cells 50000000))
LIFETIME(60000);
ALTER TABLE test.fact
UPDATE S = dictGetString('test.item_dict', 'S', tuple(key1, key2, key3))
WHERE 1;
Pass complex keys as tuple(...). dictGetString is the typed variant of dictGet for a String attribute; plain dictGet('dict', 'attr', key) works for any attribute type. Use dictGetOrDefault('dict', 'attr', key, fallback) when you want an explicit value for missing keys instead of filtering them out with dictHas.
Required Settings
allow_nondeterministic_mutations
A dictGet call is considered non-deterministic (the dictionary can change between part rewrites), so ClickHouse refuses it inside a mutation on ReplicatedMergeTree tables by default — different replicas could compute different values and diverge. Enable it for the session or user running the backfill:
SET allow_nondeterministic_mutations = 1;
For ON CLUSTER backfills the setting must be present in the profile of the user the operation runs as on every node — setting it only in your interactive session is not enough.
Throttling concurrent mutations
A backfill is heavy I/O. To stop it from saturating the merge/mutation pool, raise number_of_free_entries_in_pool_to_execute_mutation so fewer mutations run at once. Setting it to background_pool_size - 1 means mutation parts only execute when nearly all pool slots are free, leaving minimal room for mutations to run alongside other tasks:
ALTER TABLE test.fact
MODIFY SETTING number_of_free_entries_in_pool_to_execute_mutation = 15;
Execute Partition by Partition
A single WHERE 1 mutation rewrites every affected part in one queued operation, which on a large table can run for many hours and can't be paused. Altinity reports a backfill over 12 billion rows taking ~26 hours with ~8 GB of dictionary memory — so plan for the rewrite to be long and resource-intensive.
Scoping the mutation with IN PARTITION (or a WHERE on the partition column) lets you backfill in controlled batches, observe the effect on one partition before committing to all of them, and recover more gracefully if something goes wrong:
ALTER TABLE test.fact
UPDATE S = dictGetString('test.item_dict', 'S', tuple(key1, key2, key3))
IN PARTITION '202401'
WHERE 1;
Run partitions sequentially (or a few at a time), checking system.mutations between batches.
Monitoring the Backfill
Mutations are asynchronous by default; the mutations_sync setting controls whether your client waits. Track progress and failures in system.mutations:
SELECT
database, table, mutation_id, command,
parts_to_do, is_done, latest_fail_reason
FROM system.mutations
WHERE table = 'fact' AND is_done = 0
ORDER BY create_time;
parts_to_do counts remaining parts; latest_fail_reason surfaces errors (a common one is the missing allow_nondeterministic_mutations setting). To cancel a backfill that is misbehaving:
KILL MUTATION WHERE table = 'fact' AND mutation_id = '<id>';
For the broader cost model of mutations — write amplification, queueing, and impact on concurrent queries — see mutation performance impact.
ALTER UPDATE vs. MATERIALIZE COLUMN
If the new column already has a DEFAULT/MATERIALIZED expression that contains the dictGet call, you don't need an UPDATE at all — you can let the column definition do the work and MATERIALIZE it for existing parts.
ALTER TABLE test.fact
ADD COLUMN S String MATERIALIZED dictGetString('test.item_dict', 'S', tuple(key1, key2, key3));
ALTER TABLE test.fact
MATERIALIZE COLUMN S;
| Aspect | ALTER ... UPDATE with dictGet |
MATERIALIZE COLUMN |
|---|---|---|
| Column definition | Plain column, value set explicitly by mutation | Column carries a DEFAULT/MATERIALIZED dictGet expression |
| New inserts | Not auto-populated unless app supplies the value | Auto-computed from the expression on every insert |
| One-off correction | Good — arbitrary expression, no schema change to the column | Requires the expression to live in the schema |
| Mechanism | Mutation (rewrites affected parts) | Mutation (rewrites the column for existing parts) |
| Best for | Backfilling a value that won't be recomputed going forward | A derived column you want populated for old and future rows |
Both are mutations under the hood, so the same settings and monitoring apply. MATERIALIZE COLUMN is covered alongside other operations in the ALTER TABLE guide.
Replicated and Sharded Clusters
The dictionary and its source table must exist on every node, with exactly the same data, or different replicas/shards will write different values for the same key. In practice:
- Make the dictionary's source table
ReplicatedMergeTreeso all nodes see identical reference data. - Create the dictionary on every node (or via
ON CLUSTER). - Ensure the dictionary is fully loaded before starting the backfill, especially with cache layouts that load lazily.
Common Issues
- Mutation fails with a non-determinism error. Set
allow_nondeterministic_mutations = 1for the executing user/profile, not just your session, forON CLUSTERruns. - Rows overwritten with empty/default values. You filtered with
WHERE 1but some keys aren't in the dictionary. Gate the update withdictHas(...)or usedictGetOrDefaultto control the fallback. - Key type mismatch. Cast the key to the dictionary's key type (e.g.
toUInt64(key)); pass composite keys astuple(...). - Backfill never finishes / piles up. Too many concurrent mutations or merges. Throttle with
number_of_free_entries_in_pool_to_execute_mutationand batch by partition. - Inconsistent values across replicas. The dictionary source isn't identical on all nodes — replicate it.
Best Practices
- Snapshot the reference data. Point the dictionary at a stable source (or a frozen copy) so the lookup doesn't shift mid-backfill as the dictionary reloads on its
LIFETIME. - Filter with
dictHas. Only rewrite rows you actually have a value for; leave the rest untouched. - Batch by partition. Use
IN PARTITIONto make a multi-hour rewrite observable and recoverable. - Throttle the pool. Limit concurrent mutations so the backfill doesn't starve queries and merges.
- Right-size the layout. Use
flat/hashedfor dense integer keys,complex_key_hashedfor multi-column keys that fit in RAM, andcomplex_key_cachewhen the reference set is too large to hold fully. - Watch
system.mutations. Confirmis_done = 1and checklatest_fail_reasonbefore moving to the next batch.
How Pulse Helps
A dictionary-driven backfill is easy to start and hard to babysit: a mutation that rewrites billions of rows can run for a day, quietly stall the merge pool, or diverge across replicas if the dictionary isn't identical everywhere. Pulse monitors system.mutations for stuck or failing backfills, tracks the impact of long-running mutations on merge throughput and query latency, and flags replica-level inconsistencies in dictionary state before they corrupt your data. If you're planning a large backfill, Pulse can help you size the dictionary, choose the layout, and stage the rewrite so it lands without taking the cluster down with it.
Frequently Asked Questions
Q: Why do I need allow_nondeterministic_mutations for a dictionary backfill?
dictGet is treated as non-deterministic because the dictionary's contents can change (on reload) between part rewrites. ClickHouse blocks non-deterministic functions in mutations by default to avoid inconsistent results, so you must explicitly opt in for the user or profile running the backfill.
Q: Should I use ALTER UPDATE or MATERIALIZE COLUMN?
Use ALTER ... UPDATE for a one-off backfill of a plain column whose value won't be recomputed later. Use a MATERIALIZED dictGet column plus MATERIALIZE COLUMN when you want the value derived automatically for both existing and future rows. Both are mutations and carry the same cost.
Q: How do I backfill on a multi-column key?
Define a dictionary with a complex PRIMARY KEY (e.g. key1, key2, key3) and a complex_key_* layout, then pass the keys as a tuple: dictGet('dict', 'attr', tuple(key1, key2, key3)).
Q: How long will a large backfill take?
It depends on rows affected, columns rewritten, and disk throughput, but it is a heavy operation — a real-world 12-billion-row backfill ran ~26 hours. Treat it as a planned maintenance task, batch by partition, and monitor system.mutations.
Q: Will the backfill auto-populate new inserts going forward?
No. An ALTER ... UPDATE mutation only touches existing data. For continuous population, either supply the value at insert time, use a MATERIALIZED dictGet column, or enrich at query time. See dictionary-driven updates for those patterns.
Q: Can I avoid overwriting rows that have no matching key?
Yes. Add WHERE dictHas('dict', key) so only rows with a match are rewritten, or use dictGetOrDefault('dict', 'attr', key, fallback) to set an explicit fallback instead of the column default.