Partial Dictionary Refreshes and Incremental Updates

Q: What is the difference between `update_field` and `update_lag`?

update_field names the timestamp column used to detect changes. update_lag is an optional number of seconds subtracted from the previous update time to widen the overlap window, protecting against clock skew and late-committing transactions at the cost of re-fetching slightly more rows.

Q: Which dictionary layouts support incremental updates?

The in-memory layouts flat , hashed , hashed_array , and complex_key_hashed support update_field . CACHE and SSD_CACHE dictionaries use on-demand per-key loading instead and do not use update_field .

Q: Can I combine `update_field` with `invalidate_query`?

Yes. invalidate_query decides whether a refresh is needed at all (by detecting a changed sentinel value), and update_field minimizes how much data is fetched when a refresh does run. Together they cut both refresh frequency and refresh size.

Q: Does `SYSTEM RELOAD DICTIONARY` force a full reload?

No. If update_field is configured, a manual reload still applies the incremental logic and only fetches changed rows. To force a complete rebuild, drop and recreate the dictionary.

By default, a ClickHouse dictionary re-reads its entire source every time its LIFETIME elapses. For a large reference table that changes only a little between refreshes, that is wasteful: it hammers the source database, burns memory and CPU, and slows every refresh. ClickHouse supports partial (incremental) updates through the update_field source parameter, which makes the dictionary fetch only the rows that changed since the last successful refresh.

This guide covers how update_field and update_lag work, which dictionary layouts support them, how the behavior differs across source types, and the pitfalls that cause stale or missing data. For background on dictionaries themselves, see What is a ClickHouse Dictionary and ClickHouse External Dictionary.

Full Reload vs. Incremental Update

Without update_field, every refresh is a full reload. With update_field, ClickHouse remembers the time of its last successful update and asks the source only for rows modified at or after that time, merging them into the in-memory dictionary.

Aspect	Full reload (default `LIFETIME`)	Incremental update (`update_field`)
Rows fetched per refresh	The entire source dataset	Only rows changed since the last refresh
Source load	High — full scan every cycle	Low — filtered by a timestamp range
Requires source schema change	No	Yes — a last-modified timestamp column
Handles deletes	Yes (rows simply disappear on reload)	No — deleted source rows stay in the dictionary
Best for	Small or rarely-refreshed dictionaries	Large, append/update-heavy reference data

The key trade-off: incremental updates are dramatically cheaper, but they only add and overwrite keys. They never remove keys, because a deleted source row no longer appears in the changed-rows query. See Handling Deletes below.

How `update_field` Works

You add update_field to the dictionary's SOURCE definition, naming a column on the source that holds the row's last-modified time (typically a DateTime like updated_at).

On each refresh, the dictionary:

Saves the wall-clock time (now()) of the start of a successful update.
On the next refresh, queries the source for rows whose update_field value is greater than or equal to the previous update time, minus a 1-second shift.
Merges the returned rows into the existing in-memory data, overwriting matching keys and inserting new ones.

The 1-second shift exists so that rows written during the brief window of the previous fetch are not missed. The very first load after the dictionary is created (or reloaded) fetches everything, since there is no previous update time yet.

CREATE DICTIONARY customers_dict
(
    id          UInt64,
    name        String,
    tier        String,
    updated_at  DateTime
)
PRIMARY KEY id
SOURCE(CLICKHOUSE(
    TABLE 'customers'
    DB 'reference'
    UPDATE_FIELD 'updated_at'
))
LAYOUT(HASHED())
LIFETIME(MIN 30 MAX 60);

With this definition, ClickHouse refreshes roughly every 30–60 seconds, but each refresh only pulls customer rows whose updated_at advanced since the last cycle. The LIFETIME(MIN ... MAX ...) range randomizes the interval so that multiple servers do not all hit the source at the same instant.

The same idea expressed in XML configuration:

<source>
    <clickhouse>
        <db>reference</db>
        <table>customers</table>
        <update_field>updated_at</update_field>
    </clickhouse>
</source>
<lifetime>
    <min>30</min>
    <max>60</max>
</lifetime>

Adding Overlap with `update_lag`

update_field alone uses a fixed 1-second shift. If your source can commit rows with a timestamp slightly in the past (for example, the timestamp is set in application code before a slow transaction commits, or clocks are skewed), one second may not be enough overlap and you can miss rows.

The optional update_lag parameter widens that window: its value is subtracted from the previous update time before building the query. With UPDATE_LAG 60, each refresh re-reads everything from the last 60 seconds plus anything newer.

SOURCE(CLICKHOUSE(
    TABLE 'customers'
    DB 'reference'
    UPDATE_FIELD 'updated_at'
    UPDATE_LAG 60
))

A larger update_lag is safer against clock skew and late-committing transactions, but it re-fetches more rows each cycle, which slightly increases source load. Pick the smallest lag that reliably covers your worst-case write-to-commit delay.

Behavior by Source Type

update_field is implemented differently depending on the source, but the intent is identical — pass the previous update time to the source so it returns only newer rows.

Source	How `update_field` is applied
ClickHouse, MySQL, PostgreSQL, ODBC	An additional `WHERE update_field >= <last_update_time>` clause is appended to the source query
HTTP / HTTPS	The value is sent as a query parameter (e.g. `&updated_at=2026-01-01%2000:00:01`) for your endpoint to filter on
Executable	The value is passed as an argument to the script, which must filter accordingly

For HTTP and executable sources, ClickHouse only sends the timestamp — your endpoint or script is responsible for actually returning just the changed rows. If it ignores the parameter and returns everything, you lose the benefit (and may even get incorrect merges).

Supported Layouts

Incremental updates apply to dictionary layouts that load and hold their data in memory and can merge in new rows. According to the ClickHouse documentation, update_field is supported for the flat, hashed, hashed_array, and complex_key_hashed layouts.

CACHE and SSD_CACHE dictionaries do not use update_field: they already load data on demand per key and expire individual entries, so they have their own freshness model rather than a periodic merge. If a cache dictionary cannot reach its source during a lookup, you get a different failure mode entirely — see Cache dictionary update failed.

Alternative: Skip Reloads with `invalidate_query`

update_field reduces the size of each refresh. A complementary mechanism, invalidate_query, reduces the frequency of refreshes: it runs a cheap query (for example, SELECT max(updated_at) FROM customers) on each LIFETIME cycle and only triggers a reload if the returned value changed since last time.

SOURCE(CLICKHOUSE(
    TABLE 'customers'
    DB 'reference'
    INVALIDATE_QUERY 'SELECT max(updated_at) FROM reference.customers'
))
LAYOUT(HASHED())
LIFETIME(MIN 30 MAX 60);

The two can be combined: invalidate_query decides whether to refresh, and update_field decides how little to fetch when a refresh does happen.

Manual and On-Demand Refreshes

Regardless of LIFETIME, you can force a reload at any time:

-- Reload a single dictionary
SYSTEM RELOAD DICTIONARY reference.customers_dict;

-- Reload every dictionary on the server
SYSTEM RELOAD DICTIONARIES;

A manual SYSTEM RELOAD DICTIONARY re-applies the same incremental logic if update_field is configured — it does not force a full reload of all rows. To force a complete rebuild, drop and recreate the dictionary, or use LIFETIME(MIN 0 MAX 0) semantics. A plain dictGet against an unloaded dictionary will also trigger its initial load as a side effect.

Monitoring Refreshes

Use system.dictionaries to confirm refreshes are happening and succeeding:

SELECT
    name,
    status,
    element_count,
    last_successful_update_time,
    loading_duration,
    last_exception
FROM system.dictionaries
WHERE name = 'customers_dict';

Watch last_successful_update_time advance over time and loading_duration stay small — a large loading_duration after enabling update_field is a sign the source is not actually filtering on the timestamp. A populated last_exception points to a source or query problem.

Handling Deletes

This is the single biggest gotcha with incremental dictionaries: update_field never removes keys. Because the refresh query only returns rows that still exist and were recently modified, a row deleted in the source simply stops appearing — the dictionary keeps serving its last-known value forever.

Common workarounds:

Soft deletes: add an is_deleted flag (and bump updated_at) instead of hard-deleting. The deleted row is then fetched as a changed row, and your queries filter on the flag. The key still occupies memory, but lookups return the correct state.
Periodic full reload: schedule an occasional full rebuild (drop/recreate the dictionary, or a separate dictionary without update_field) to purge tombstoned keys. Run it far less frequently than the incremental cycle.
Tombstone-aware design: keep deletions out of the dictionary entirely and resolve them at query time against another source.

Common Issues

Stale rows after deletes. Expected behavior — see Handling Deletes. Use soft deletes or periodic full reloads.
Missing recently-changed rows. Usually clock skew or late-committing transactions. Increase update_lag to widen the overlap window.
Refresh still scans the whole table. The update_field column is not indexed on the source, so the WHERE update_field >= ... filter forces a full scan anyway. Index that column on the source database.
HTTP/executable source returns everything. ClickHouse passes the timestamp, but the endpoint ignores it. Fix the endpoint to honor the parameter.
Timezone or type mismatch on the timestamp. The update_field column and the comparison value must agree on timezone and type, or the filter can silently exclude or include the wrong rows. Prefer UTC DateTime.
First load is huge. The initial load always fetches everything; only subsequent refreshes are incremental. Size memory for the full dataset.

Best Practices

Use a dedicated, indexed updated_at column on the source, set in the same transaction that writes the row, and stored in UTC.
Start update_lag conservatively (e.g. a few seconds to a minute) and reduce it only after confirming no rows are missed.
Combine with invalidate_query so quiet periods skip refreshes entirely instead of issuing pointless changed-row queries.
Plan for deletes from day one — pick soft deletes or a periodic full reload before you ship, not after you notice stale keys.
Monitor system.dictionaries for last_successful_update_time, loading_duration, and last_exception.
Randomize LIFETIME with a MIN/MAX range so a fleet of servers does not synchronize their refreshes against the source.

How Pulse Helps

Pulse monitors your ClickHouse dictionaries alongside the rest of your cluster, surfacing failed or stalled refreshes from system.dictionaries before they turn into stale lookups or query errors. It flags dictionaries whose loading_duration is creeping up (a classic sign that an update_field filter is silently doing a full scan), spots growing memory from tombstoned keys after deletes, and ties dictionary refresh load back to pressure on the underlying source database. If you are designing an incremental refresh strategy or debugging missing rows, the team behind Pulse can review your dictionary configuration and source schema. Learn more at pulse.support.

Frequently Asked Questions

Q: Does update_field reload the whole dictionary or just changed rows?

Just the changed rows. ClickHouse queries the source for rows whose update_field value is at or after the previous successful update time (minus a 1-second shift), then merges them into the in-memory data. The exception is the very first load, which always fetches everything.

Q: Why are deleted rows still showing up in my dictionary?

Because incremental updates only add or overwrite keys — a deleted source row no longer appears in the changed-rows query, so the dictionary keeps its last value. Use soft deletes (an is_deleted flag that bumps the timestamp) or schedule a periodic full reload to purge them.

Q: What is the difference between update_field and update_lag?

update_field names the timestamp column used to detect changes. update_lag is an optional number of seconds subtracted from the previous update time to widen the overlap window, protecting against clock skew and late-committing transactions at the cost of re-fetching slightly more rows.

Q: Which dictionary layouts support incremental updates?

The in-memory layouts flat, hashed, hashed_array, and complex_key_hashed support update_field. CACHE and SSD_CACHE dictionaries use on-demand per-key loading instead and do not use update_field.

Q: Can I combine update_field with invalidate_query?

Yes. invalidate_query decides whether a refresh is needed at all (by detecting a changed sentinel value), and update_field minimizes how much data is fetched when a refresh does run. Together they cut both refresh frequency and refresh size.

Q: Does SYSTEM RELOAD DICTIONARY force a full reload?

No. If update_field is configured, a manual reload still applies the incremental logic and only fetches changed rows. To force a complete rebuild, drop and recreate the dictionary.