ClickHouse Dictionary-Driven Updates: Enrichment, Deduplication, and Dimension Lookups

Q: Why can't I just write `ALTER TABLE t UPDATE col = other.val FROM other JOIN ...`?

ClickHouse mutations don't support JOIN or FROM in the update expression. The expression is evaluated per row against the current row plus functions only. Wrapping the second table in a dictionary and calling dictGet is the supported way to reference external data in a mutation.

Q: Do I need `allow_nondeterministic_mutations`?

Only if your table uses a Replicated*MergeTree engine. Since ClickHouse 20.4, mutations that call dictGet on replicated tables are rejected by default, because replicas could resolve the dictionary to different values and produce checksum mismatches. Pass SETTINGS allow_nondeterministic_mutations = 1 to opt in. For plain (non-replicated) MergeTree tables the setting is not required.

Q: How do I avoid overwriting rows whose key isn't in the dictionary?

Guard the update with WHERE dictHas('dict', key) , or use dictGetOrDefault('dict', 'attr', key, fallback) so missing keys get an explicit value instead of the dictionary's null value.

Q: Why is my dictionary-based update so slow?

Mutations rewrite whole parts and the dictGet lookups add read amplification proportional to the granules scanned, not the rows changed. Tighten the WHERE clause to prune partitions, or move the enrichment to insert time / query time to avoid rewriting parts entirely.

ClickHouse does not support UPDATE ... FROM or JOIN inside an ALTER TABLE ... UPDATE mutation. When you need to populate or correct a column using values from another table — a dimension lookup, a deduplication map, or an enrichment table — the idiomatic solution is to wrap that other table in a dictionary and call dictGet inside the update expression.

This guide covers the dictionary-driven update pattern end to end: building the lookup dictionary, writing the ALTER TABLE UPDATE with dictGet/dictHas, the settings the server requires, the alternative of enriching at insert time with a MATERIALIZED column, and the performance and replication traps that bite people in production. For dictionary fundamentals (sources, layouts, lifetimes) start with what is a ClickHouse dictionary and external dictionaries.

Why a Dictionary Instead of a JOIN

A mutation expression is evaluated row by row over the data parts being rewritten. ClickHouse's mutation engine has no facility to join the table being mutated against an arbitrary second table, so the expression can only reference columns of the current row plus functions. A dictionary bridges that gap: it presents a second dataset as a set of pure functions (dictGet, dictHas, dictGetOrDefault) keyed by value, which are valid inside a mutation expression.

The same property makes dictionaries the standard way to avoid expensive JOINs at query time — load the dimension data into memory once, then look it up inline with dictGet. The difference here is that we use it to physically rewrite a stored column.

The Core Pattern: ALTER TABLE UPDATE with dictGet

1. Build the lookup dictionary

The dictionary's source is usually a ClickHouse table holding the reference/dimension data. Define the key and the attribute columns you want to read:

CREATE DICTIONARY update_dict
(
    key   UInt64,
    value String
)
PRIMARY KEY key
SOURCE(CLICKHOUSE(TABLE 'update_source'))
LIFETIME(MIN 0 MAX 10)
LAYOUT(FLAT());

LAYOUT(FLAT()) is fine for dense integer keys up to ~500K (it backs the dictionary with an array). For sparse or large key spaces use LAYOUT(HASHED()); for non-integer or multi-column keys use LAYOUT(COMPLEX_KEY_HASHED()).
LIFETIME controls reload frequency from the source. LIFETIME(0) disables background reloads — useful when you want the dictionary frozen for the duration of a one-time backfill.

2. Run the mutation

ALTER TABLE target_table
    UPDATE value = dictGet('update_dict', 'value', toUInt64(key))
    WHERE dictHas('update_dict', toUInt64(key));

The dictHas(...) guard in the WHERE clause is important: it restricts the update to rows whose key actually exists in the dictionary, so rows with no matching dimension entry are left untouched rather than overwritten with the dictionary's null/default value.

If you do want a fallback for missing keys, drop the guard and use dictGetOrDefault instead:

ALTER TABLE target_table
    UPDATE region = dictGetOrDefault('geo_dict', 'region', toUInt64(country_id), 'Unknown')
    WHERE 1;

3. Function signatures you will use

Function	Signature	Returns
`dictGet`	`dictGet('dict', attr_names, id_expr)`	Attribute value, or the dictionary's configured null value if the key is missing
`dictGetOrDefault`	`dictGetOrDefault('dict', attr_names, id_expr, default)`	Attribute value, or `default` if the key is missing
`dictGetOrNull`	`dictGetOrNull('dict', 'attr', id_expr)`	Attribute value, or `NULL` if the key is missing
`dictHas`	`dictHas('dict', id_expr)`	`1` if the key exists, else `0`

attr_names may be a tuple to fetch several columns at once, e.g. dictGet('dict', ('city', 'region'), id_expr). For composite keys, pass a tuple as id_expr: dictGet('dict', 'value', (key1, key2)).

Required Setting: allow_nondeterministic_mutations

dictGet is treated as non-deterministic because the dictionary can be reloaded mid-mutation. On Replicated*MergeTree tables ClickHouse rejects such mutations by default (since 20.4), because replicas could resolve the dictionary to different values and produce checksum mismatches. You must opt in explicitly for replicated tables:

ALTER TABLE target_table
    UPDATE value = dictGet('update_dict', 'value', toUInt64(key))
    WHERE dictHas('update_dict', toUInt64(key))
SETTINGS allow_nondeterministic_mutations = 1;

For plain (non-replicated) MergeTree tables this restriction does not apply and the setting is not required.

Because the dictionary really can change between when the mutation starts and when it finishes rewriting all parts, freeze the source during a backfill — set LIFETIME(0) (or a window long enough to outlast the mutation) so every part is enriched against the same snapshot.

Enriching at Insert Time Instead

If the enrichment value is stable per key and the dimension already exists in a dictionary before the data lands, prefer enriching at insert time over running mutations later. A MATERIALIZED column computes the value once on insert and stores it physically:

CREATE TABLE events
(
    user_id  UInt64,
    event    String,
    location String MATERIALIZED dictGet('users_dict', 'Location', user_id)
)
ENGINE = MergeTree
ORDER BY user_id;

Every insert resolves location through the dictionary and writes it to disk — no mutation, no read amplification, and the column behaves like any other stored column afterward. This is almost always cheaper than a later ALTER UPDATE and is the recommended approach when the enrichment is known at write time.

The trade-off: a MATERIALIZED column is computed only at insert. If the dictionary later changes, already-stored rows are not refreshed automatically — you would still need a mutation (or a re-insert) to update historical rows. See mutations and their performance impact for the cost of rewriting existing parts.

Query-Time Enrichment (No Storage Change)

When you don't need to persist the enriched value at all, skip both the mutation and the materialized column and just call dictGet in the SELECT:

SELECT
    e.event_time,
    e.user_id,
    dictGet('users_dict', 'Location', e.user_id) AS location
FROM events AS e;

This keeps the base table small and avoids write amplification entirely, at the cost of a dictionary lookup per output row. It is the right choice when the dimension changes frequently or when only a fraction of queries need the enrichment. For continuously enriching a separate destination table as data arrives, a materialized view that calls dictGet in its SELECT is often the cleanest design.

Deduplication and Dimension-Lookup Variations

The same mechanism handles a few related jobs:

Canonicalization / dedup map. Build a dictionary mapping each raw key to its canonical key (for example, merging duplicate customer IDs), then UPDATE canonical_id = dictGet('dedup_dict', 'canonical', toUInt64(raw_id)). Downstream aggregations group by the canonical column.
Backfilling a foreign key's attributes. Denormalize a dimension into the fact table once with a single mutation, so later queries don't pay for the lookup at all.
Correcting bad values. Ship a small correction table as a dictionary keyed by the rows to fix, guard with dictHas, and update only the affected rows.

Performance and Read Amplification

The dictionary-driven UPDATE is convenient but not free. A mutation rewrites every affected data part in full, and the dictGet call is evaluated for the rows in scope. In practice, dictionary-based updates can be slow on large tables because the mutation effectively probes the dictionary across the granules it scans, producing high read amplification relative to the number of rows that actually change.

Practical guidance:

Narrow the WHERE clause as much as possible so the mutation rewrites the fewest parts. A predicate on the table's primary key / partition key lets ClickHouse skip whole parts; dictHas alone does not prune by partition.
Prefer insert-time enrichment (MATERIALIZED column) or query-time dictGet over a mutation whenever the use case allows — both avoid rewriting existing parts.
Size the dictionary to fit in RAM. FLAT/HASHED layouts hold the whole dataset in memory; an oversized dictionary causes memory pressure. DIRECT layout avoids caching by querying the source per lookup, which is the wrong choice inside a mutation.
Watch mutation progress in system.mutations (the is_done and latest_fail_reason columns) — a failed or stuck mutation blocks subsequent ones on the same table.

joinGet as an alternative

For lookups against frequently-changing data you can also use the Join table engine with joinGet('join_table', 'value', key), which works much like dictGet inside a mutation. Note that Join-engine tables are not distributed — each node holds its own copy — and the engine is not designed for high-frequency updates, so a dictionary backed by a ReplicatedMergeTree source is usually the more operationally robust choice on a cluster.

Common Issues

Mutation ... non-deterministic error. You forgot SETTINGS allow_nondeterministic_mutations = 1. Add it to the ALTER.
Rows overwritten with empty/zero values. You used dictGet without a dictHas guard, so missing keys returned the dictionary's null value. Add the guard or switch to dictGetOrDefault/dictGetOrNull.
Key type mismatch. dictGet keys for single-key dictionaries are UInt64; cast explicitly with toUInt64(...). A String or signed key passed directly can fail or silently miss.
Stale results mid-mutation. The dictionary reloaded while the mutation ran. Freeze it with LIFETIME(0) during backfills.
Works on one node, fails on another (replicated cluster). The dictionary must be created on every node, and its source table should be ReplicatedMergeTree so all replicas resolve the same values. Mutations replicate, but the dictionary definition does not travel with them automatically.

Best Practices

Reach for insert-time MATERIALIZED enrichment or query-time dictGet first; use an ALTER UPDATE mutation only when you must physically rewrite existing rows.
Always pair dictGet with a dictHas guard (or dictGetOrDefault) so non-matching rows are handled deliberately.
Freeze the dictionary (LIFETIME(0)) for the duration of a backfill to keep the update deterministic in practice.
Choose the layout for your key shape: FLAT for dense small integer keys, HASHED for large/sparse keys, COMPLEX_KEY_HASHED for composite or string keys.
Keep the mutation's WHERE clause selective on partition/primary-key columns to minimize part rewrites.
On replicated setups, create the dictionary everywhere and back it with a replicated source table.

How Pulse Helps

Dictionary-driven updates fail quietly in ways that are hard to spot after the fact: a mutation that overwrote good values because a dictHas guard was missing, a backfill that ran against a dictionary mid-reload, or a stuck mutation in system.mutations blocking everything behind it. Pulse continuously monitors ClickHouse mutation queues, dictionary load status and memory footprint, and replication consistency across nodes, and surfaces the specific misconfigurations — non-deterministic mutation settings, oversized in-memory dictionaries, dictionaries missing on a replica — before they corrupt data or stall ingestion. It is built and run by engineers who operate ClickHouse in production, so the guidance is concrete rather than generic.

Frequently Asked Questions

Q: Why can't I just write ALTER TABLE t UPDATE col = other.val FROM other JOIN ...?

ClickHouse mutations don't support JOIN or FROM in the update expression. The expression is evaluated per row against the current row plus functions only. Wrapping the second table in a dictionary and calling dictGet is the supported way to reference external data in a mutation.

Q: Do I need allow_nondeterministic_mutations?

Only if your table uses a Replicated*MergeTree engine. Since ClickHouse 20.4, mutations that call dictGet on replicated tables are rejected by default, because replicas could resolve the dictionary to different values and produce checksum mismatches. Pass SETTINGS allow_nondeterministic_mutations = 1 to opt in. For plain (non-replicated) MergeTree tables the setting is not required.

Q: How do I avoid overwriting rows whose key isn't in the dictionary?

Guard the update with WHERE dictHas('dict', key), or use dictGetOrDefault('dict', 'attr', key, fallback) so missing keys get an explicit value instead of the dictionary's null value.

Q: Should I enrich during the mutation or at insert time?

If the enrichment is known when data is written and is stable per key, prefer a MATERIALIZED dictGet(...) column — it computes once on insert with no part rewrite. Use a mutation only to fix or backfill rows that already exist. Note that materialized columns are not refreshed when the dictionary later changes.

Q: Is this safe on a replicated cluster?

Only if the dictionary is created on every node and its source is a ReplicatedMergeTree table. The mutation replicates, but each replica resolves dictGet against its own local dictionary, so inconsistent dictionaries produce inconsistent data.

Q: Why is my dictionary-based update so slow?

Mutations rewrite whole parts and the dictGet lookups add read amplification proportional to the granules scanned, not the rows changed. Tighten the WHERE clause to prune partitions, or move the enrichment to insert time / query time to avoid rewriting parts entirely.