Backfill Columns Using ALTER/UPDATE with Dictionary Lookups

Backfilling a column means populating it for rows that already exist — typically after adding a new column, fixing a data-quality issue, or denormalizing a value that used to live in another table. When the backfill values come from a reference, dimension, or mapping table, the idiomatic ClickHouse approach is to wrap that table in a dictionary and call dictGet inside an ALTER TABLE ... UPDATE mutation.

This guide focuses on the operational side of running such a backfill at scale: building the right kind of lookup dictionary, executing the mutation safely (often partition by partition), choosing between ALTER UPDATE and MATERIALIZE COLUMN, and monitoring a long-running rewrite. For the conceptual basics of dictionary-driven updates — including why you can't JOIN inside a mutation — see ClickHouse dictionary-driven updates and what is a ClickHouse dictionary.

Why a Dictionary for Backfill

A mutation expression is evaluated row by row over the data parts being rewritten, and ClickHouse has no UPDATE ... FROM or JOIN inside ALTER TABLE ... UPDATE. A dictionary sidesteps this: it exposes a second dataset as pure functions (dictGet, dictHas, dictGetOrDefault) keyed by value, and those functions are valid inside a mutation.

For a one-time backfill specifically, the dictionary has two extra advantages over a correlated subquery like WHERE key IN (SELECT ...):

  • Bounded memory. A subquery against a large source table can blow up mutation memory. A dictionary is loaded once into a predictable footprint and reused for every row.
  • Fast point lookups. Cache and hashed layouts answer per-key lookups in roughly constant time, which matters when you are calling it billions of times.

The Core Backfill Pattern

1. Create the lookup dictionary

The dictionary's source is the table that holds the values you want to write. For a single-column key, a FLAT or HASHED layout is simplest:

CREATE DICTIONARY update_dict
(
    key   UInt64,
    value String
)
PRIMARY KEY key
SOURCE(CLICKHOUSE(TABLE 'test_update_source'))
LIFETIME(MIN 0 MAX 10)
LAYOUT(FLAT);

2. Run the backfill mutation

Use dictGet to read the attribute and dictHas to restrict the rewrite to rows that actually have a matching key — this avoids overwriting rows with a default/empty value when no match exists:

ALTER TABLE test_update
    UPDATE value = dictGet('default.update_dict', 'value', toUInt64(key))
    WHERE dictHas('default.update_dict', toUInt64(key));

Casting the key to the dictionary's key type (toUInt64(key) for a UInt64/cache-keyed dictionary) avoids type-mismatch errors. The ALTER TABLE ... UPDATE syntax is:

ALTER TABLE [db.]table [ON CLUSTER cluster]
    UPDATE col = expr [, ...]
    [IN PARTITION partition_id]
    WHERE filter_expr

The filter_expr must be UInt8. You cannot update columns used in the primary key or partition key.

Backfilling on Composite (Multi-Column) Keys

When the join between fact and reference data is on several columns, use a dictionary with a complex key and pass the keys as a tuple. For very large reference sets the complex_key_cache layout keeps only hot keys in memory:

CREATE DICTIONARY item_dict
(
    key1 UInt64,
    key2 String,
    key3 String,
    S    String
)
PRIMARY KEY key1, key2, key3
SOURCE(CLICKHOUSE(TABLE dict_table DB 'test' USER 'default'))
LAYOUT(complex_key_cache(size_in_cells 50000000))
LIFETIME(60000);
ALTER TABLE test.fact
    UPDATE S = dictGetString('test.item_dict', 'S', tuple(key1, key2, key3))
    WHERE 1;

Pass complex keys as tuple(...). dictGetString is the typed variant of dictGet for a String attribute; plain dictGet('dict', 'attr', key) works for any attribute type. Use dictGetOrDefault('dict', 'attr', key, fallback) when you want an explicit value for missing keys instead of filtering them out with dictHas.

Required Settings

allow_nondeterministic_mutations

A dictGet call is considered non-deterministic (the dictionary can change between part rewrites), so ClickHouse refuses it inside a mutation on ReplicatedMergeTree tables by default — different replicas could compute different values and diverge. Enable it for the session or user running the backfill:

SET allow_nondeterministic_mutations = 1;

For ON CLUSTER backfills the setting must be present in the profile of the user the operation runs as on every node — setting it only in your interactive session is not enough.

Throttling concurrent mutations

A backfill is heavy I/O. To stop it from saturating the merge/mutation pool, raise number_of_free_entries_in_pool_to_execute_mutation so fewer mutations run at once. Setting it to background_pool_size - 1 means mutation parts only execute when nearly all pool slots are free, leaving minimal room for mutations to run alongside other tasks:

ALTER TABLE test.fact
    MODIFY SETTING number_of_free_entries_in_pool_to_execute_mutation = 15;

Execute Partition by Partition

A single WHERE 1 mutation rewrites every affected part in one queued operation, which on a large table can run for many hours and can't be paused. Altinity reports a backfill over 12 billion rows taking ~26 hours with ~8 GB of dictionary memory — so plan for the rewrite to be long and resource-intensive.

Scoping the mutation with IN PARTITION (or a WHERE on the partition column) lets you backfill in controlled batches, observe the effect on one partition before committing to all of them, and recover more gracefully if something goes wrong:

ALTER TABLE test.fact
    UPDATE S = dictGetString('test.item_dict', 'S', tuple(key1, key2, key3))
    IN PARTITION '202401'
    WHERE 1;

Run partitions sequentially (or a few at a time), checking system.mutations between batches.

Monitoring the Backfill

Mutations are asynchronous by default; the mutations_sync setting controls whether your client waits. Track progress and failures in system.mutations:

SELECT
    database, table, mutation_id, command,
    parts_to_do, is_done, latest_fail_reason
FROM system.mutations
WHERE table = 'fact' AND is_done = 0
ORDER BY create_time;

parts_to_do counts remaining parts; latest_fail_reason surfaces errors (a common one is the missing allow_nondeterministic_mutations setting). To cancel a backfill that is misbehaving:

KILL MUTATION WHERE table = 'fact' AND mutation_id = '<id>';

For the broader cost model of mutations — write amplification, queueing, and impact on concurrent queries — see mutation performance impact.

ALTER UPDATE vs. MATERIALIZE COLUMN

If the new column already has a DEFAULT/MATERIALIZED expression that contains the dictGet call, you don't need an UPDATE at all — you can let the column definition do the work and MATERIALIZE it for existing parts.

ALTER TABLE test.fact
    ADD COLUMN S String MATERIALIZED dictGetString('test.item_dict', 'S', tuple(key1, key2, key3));

ALTER TABLE test.fact
    MATERIALIZE COLUMN S;
Aspect ALTER ... UPDATE with dictGet MATERIALIZE COLUMN
Column definition Plain column, value set explicitly by mutation Column carries a DEFAULT/MATERIALIZED dictGet expression
New inserts Not auto-populated unless app supplies the value Auto-computed from the expression on every insert
One-off correction Good — arbitrary expression, no schema change to the column Requires the expression to live in the schema
Mechanism Mutation (rewrites affected parts) Mutation (rewrites the column for existing parts)
Best for Backfilling a value that won't be recomputed going forward A derived column you want populated for old and future rows

Both are mutations under the hood, so the same settings and monitoring apply. MATERIALIZE COLUMN is covered alongside other operations in the ALTER TABLE guide.

Replicated and Sharded Clusters

The dictionary and its source table must exist on every node, with exactly the same data, or different replicas/shards will write different values for the same key. In practice:

  • Make the dictionary's source table ReplicatedMergeTree so all nodes see identical reference data.
  • Create the dictionary on every node (or via ON CLUSTER).
  • Ensure the dictionary is fully loaded before starting the backfill, especially with cache layouts that load lazily.

Common Issues

  • Mutation fails with a non-determinism error. Set allow_nondeterministic_mutations = 1 for the executing user/profile, not just your session, for ON CLUSTER runs.
  • Rows overwritten with empty/default values. You filtered with WHERE 1 but some keys aren't in the dictionary. Gate the update with dictHas(...) or use dictGetOrDefault to control the fallback.
  • Key type mismatch. Cast the key to the dictionary's key type (e.g. toUInt64(key)); pass composite keys as tuple(...).
  • Backfill never finishes / piles up. Too many concurrent mutations or merges. Throttle with number_of_free_entries_in_pool_to_execute_mutation and batch by partition.
  • Inconsistent values across replicas. The dictionary source isn't identical on all nodes — replicate it.

Best Practices

  1. Snapshot the reference data. Point the dictionary at a stable source (or a frozen copy) so the lookup doesn't shift mid-backfill as the dictionary reloads on its LIFETIME.
  2. Filter with dictHas. Only rewrite rows you actually have a value for; leave the rest untouched.
  3. Batch by partition. Use IN PARTITION to make a multi-hour rewrite observable and recoverable.
  4. Throttle the pool. Limit concurrent mutations so the backfill doesn't starve queries and merges.
  5. Right-size the layout. Use flat/hashed for dense integer keys, complex_key_hashed for multi-column keys that fit in RAM, and complex_key_cache when the reference set is too large to hold fully.
  6. Watch system.mutations. Confirm is_done = 1 and check latest_fail_reason before moving to the next batch.

How Pulse Helps

A dictionary-driven backfill is easy to start and hard to babysit: a mutation that rewrites billions of rows can run for a day, quietly stall the merge pool, or diverge across replicas if the dictionary isn't identical everywhere. Pulse monitors system.mutations for stuck or failing backfills, tracks the impact of long-running mutations on merge throughput and query latency, and flags replica-level inconsistencies in dictionary state before they corrupt your data. If you're planning a large backfill, Pulse can help you size the dictionary, choose the layout, and stage the rewrite so it lands without taking the cluster down with it.

Frequently Asked Questions

Q: Why do I need allow_nondeterministic_mutations for a dictionary backfill?

dictGet is treated as non-deterministic because the dictionary's contents can change (on reload) between part rewrites. ClickHouse blocks non-deterministic functions in mutations by default to avoid inconsistent results, so you must explicitly opt in for the user or profile running the backfill.

Q: Should I use ALTER UPDATE or MATERIALIZE COLUMN?

Use ALTER ... UPDATE for a one-off backfill of a plain column whose value won't be recomputed later. Use a MATERIALIZED dictGet column plus MATERIALIZE COLUMN when you want the value derived automatically for both existing and future rows. Both are mutations and carry the same cost.

Q: How do I backfill on a multi-column key?

Define a dictionary with a complex PRIMARY KEY (e.g. key1, key2, key3) and a complex_key_* layout, then pass the keys as a tuple: dictGet('dict', 'attr', tuple(key1, key2, key3)).

Q: How long will a large backfill take?

It depends on rows affected, columns rewritten, and disk throughput, but it is a heavy operation — a real-world 12-billion-row backfill ran ~26 hours. Treat it as a planned maintenance task, batch by partition, and monitor system.mutations.

Q: Will the backfill auto-populate new inserts going forward?

No. An ALTER ... UPDATE mutation only touches existing data. For continuous population, either supply the value at insert time, use a MATERIALIZED dictGet column, or enrich at query time. See dictionary-driven updates for those patterns.

Q: Can I avoid overwriting rows that have no matching key?

Yes. Add WHERE dictHas('dict', key) so only rows with a match are rewritten, or use dictGetOrDefault('dict', 'attr', key, fallback) to set an explicit fallback instead of the column default.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.