A dimension table holds the descriptive attributes you join against facts — a customer's tier, a product's category, a store's region. The hard part is what happens when those attributes change: do you overwrite the old value, or keep a history so you can still answer "what tier was this customer in when the order was placed?" That is the slowly-changing dimension (SCD) problem.
ClickHouse has no built-in SCD construct, but it gives you two strong primitives for building one: ReplacingMergeTree for storing versioned dimension rows, and the range_hashed dictionary layout for fast point-in-time lookups by an effective date range. This guide maps the classic SCD types onto those primitives.
The SCD Types in One Table
The data-warehousing literature defines several SCD strategies. These three cover almost every ClickHouse use case:
| Type | What it does | When to use | ClickHouse primitive |
|---|---|---|---|
| Type 1 | Overwrite the old value; no history kept | You only ever need the current attribute | ReplacingMergeTree keyed by the dimension key |
| Type 2 | Add a new row per change, with effective-from/to dates | You need point-in-time / historical accuracy | Versioned rows + range_hashed dictionary |
| Type 3 | Keep a limited number of prior values as extra columns (e.g. previous_region) |
You only care about the immediately preceding value | Plain columns on a Type 1 table |
Type 2 is the one most people mean when they say "SCD," because it is what lets a fact joined to a dimension reflect the attribute as it was at event time, not as it is today.
SCD Type 1: Overwrite With ReplacingMergeTree
Type 1 keeps only the latest version of each dimension row. ReplacingMergeTree is the natural fit: it deduplicates rows that share the same ORDER BY key during background merges, keeping the row with the highest version.
CREATE TABLE customer_dim
(
customer_id UInt64,
name String,
tier LowCardinality(String),
region LowCardinality(String),
updated_at DateTime -- used as the version column
)
ENGINE = ReplacingMergeTree(updated_at)
ORDER BY customer_id;
Each time a customer changes, you insert a new row with a newer updated_at. After a merge, only the latest row per customer_id survives. Because merges happen in the background at unpredictable times, you must read with FINAL (or aggregate explicitly) to guarantee deduplicated results:
SELECT customer_id, tier, region
FROM customer_dim FINAL
WHERE customer_id = 42;
The version column matters: without it, ReplacingMergeTree keeps the last inserted row, which is order-dependent and unsafe with out-of-order or replicated inserts. Passing a version column (updated_at here) makes the choice deterministic — the maximum version wins. See ReplacingMergeTree for the full semantics, and the ReplacingMergeTree duplicate gotcha for the most common mistake (forgetting FINAL and seeing duplicates).
If you would rather avoid FINAL, you can emulate it at query time with argMax, which picks the attribute value associated with the largest version:
SELECT
customer_id,
argMax(tier, updated_at) AS tier,
argMax(region, updated_at) AS region
FROM customer_dim
GROUP BY customer_id;
See the argMax function for details on this pattern.
SCD Type 3: Keep the Previous Value as a Column
Type 3 is the lightest pattern: instead of full history, you keep one or two prior values inline. It is just extra columns on a Type 1 table.
CREATE TABLE customer_dim_t3
(
customer_id UInt64,
region LowCardinality(String),
previous_region LowCardinality(String),
region_changed_at DateTime,
updated_at DateTime
)
ENGINE = ReplacingMergeTree(updated_at)
ORDER BY customer_id;
Type 3 answers "what was the region before the current one?" but nothing deeper. Reach for it only when a single step of history is genuinely all you need; otherwise Type 2 is more honest about your requirements.
SCD Type 2: Versioned Rows With Effective Date Ranges
Type 2 is where ClickHouse shines. You store one row per version of a dimension key, each stamped with the date range during which it was valid:
CREATE TABLE customer_dim_history
(
customer_id UInt64,
tier LowCardinality(String),
region LowCardinality(String),
valid_from Date,
valid_to Nullable(Date) -- NULL = still current
)
ENGINE = MergeTree
ORDER BY (customer_id, valid_from);
A customer who moved regions produces two rows:
INSERT INTO customer_dim_history VALUES
(42, 'gold', 'EU', '2023-01-01', '2024-05-31'),
(42, 'gold', 'US', '2024-06-01', NULL);
You can answer point-in-time questions with a range join, but ClickHouse joins are not built for this, and a fact-by-fact range join over millions of events is slow. The idiomatic, fast solution is to expose the history table as a range_hashed dictionary.
The range_hashed Dictionary
The range_hashed layout stores, for each key, an ordered set of ranges with their values, and resolves a lookup to the range that contains a supplied date — in memory, at O(1)-ish cost:
CREATE DICTIONARY customer_scd_dict
(
customer_id UInt64,
tier String,
region String,
valid_from Date,
valid_to Date
)
PRIMARY KEY customer_id
SOURCE(CLICKHOUSE(TABLE 'customer_dim_history'))
LIFETIME(MIN 300 MAX 600)
LAYOUT(RANGE_HASHED())
RANGE(MIN valid_from MAX valid_to);
Now a point-in-time lookup takes the key and a date, and returns the value valid on that date:
SELECT dictGet('customer_scd_dict', 'region', toUInt64(42), toDate('2024-03-15'));
-- → 'EU' (the row valid from 2023-01-01 to 2024-05-31)
SELECT dictGet('customer_scd_dict', 'region', toUInt64(42), today());
-- → 'US' (the current row)
To enrich a fact table with the attribute as it was at event time, pass the event date as the range argument:
SELECT
o.order_id,
o.event_date,
dictGet('customer_scd_dict', 'region', o.customer_id, o.event_date) AS region_at_order
FROM orders AS o;
This is the payoff of SCD Type 2 in ClickHouse: a single dictionary lookup per fact row, served from memory, instead of a range join.
Open-Ended and Current Ranges
The "still current" version has no upper bound. In the source table, a NULL range_max is treated as the maximum possible value, so the open-ended row matches any future date. Likewise an epoch 1970-01-01 (or 0) range_min leaves the lower bound open. Note that a range_hashed dictionary's range_min/range_max values must fit in Int64. With Nullable source columns:
-- valid_to NULL in the source ⇒ the range stays open on the upper end
INSERT INTO customer_dim_history VALUES
(43, 'silver', 'APAC', '2024-06-01', NULL);
Overlapping Ranges: range_lookup_strategy
If a key can have overlapping ranges, control which one wins with range_lookup_strategy:
LAYOUT(RANGE_HASHED(range_lookup_strategy 'max'))
'min'(default): among matching ranges, returns the one with the smallestrange_min(then smallestrange_max).'max': returns the one with the largestrange_min(then largestrange_max) — i.e. the most recent applicable version, which is usually what you want when versions overlap.
For composite dimension keys, use the COMPLEX_KEY_RANGE_HASHED() layout with multiple PRIMARY KEY columns.
Refreshing the Dimension
The dictionary is a cached, in-memory copy of the history table, so it must be refreshed when the source changes. LIFETIME(MIN 300 MAX 600) reloads it on an interval randomized between 300 and 600 seconds (the randomization spreads reload load across replicas). You can also reload on demand:
SYSTEM RELOAD DICTIONARY customer_scd_dict;
How you write new versions into the source table matters more than the reload:
- Append-only is ideal. Inserting a new versioned row (and, for Type 2, closing the previous row's
valid_to) avoids mutations entirely. - Avoid
ALTER TABLE ... UPDATEto close ranges. Mutations rewrite whole parts and are expensive at scale. Prefer modeling the close-out as anINSERTof a corrected row into aReplacingMergeTree, or batch the close-outs. See mutation performance impact for why per-changeUPDATEs do not scale.
For enrichment-at-insert-time patterns and keeping dictionary-backed columns current, see the dictionary enrichment update pattern.
Best Practices
- Pick the lowest SCD type that meets the requirement. Type 2 carries real storage and operational cost; do not pay it if you only ever query the current value (Type 1) or one prior value (Type 3).
- Always use a version column with
ReplacingMergeTree, and read withFINALorargMax. Without it, deduplication is insertion-order-dependent. - Keep dimensions small enough to fit in memory if you want to serve them as dictionaries.
range_hashedloads the full dimension into RAM; size it deliberately and useLowCardinality/narrow types. See ClickHouse dictionaries and external dictionaries. - Model version close-outs as inserts, not mutations. Append a new row and let a
ReplacingMergeTreeor the range strategy resolve the latest version. - Use
range_lookup_strategy 'max'when overlapping ranges are possible and you want the most recent applicable version. - Refresh deliberately. Tune
LIFETIMEto your change frequency, and callSYSTEM RELOAD DICTIONARYafter bulk dimension loads rather than waiting for the interval.
Common Issues
- Forgetting
FINALon a Type 1 table and seeing duplicate dimension rows because background merges have not run yet. UseFINALorargMax. - Wrong point-in-time answers from an unbounded current row. If
valid_tois notNULL(or epoch) for the current version, future-dated lookups miss it. Make the current row explicitly open-ended. range_min/range_maxoverflow. Range boundary values must fit inInt64; out-of-range or invalid dates produce surprising results.- Stale lookups after a dimension load because the dictionary has not reloaded. Reload it explicitly after batch updates.
- Per-change
ALTER ... UPDATEmutations to close out ranges, which rewrite parts and pile up. Switch to append-only versioning.
How Pulse Helps
Slowly-changing dimensions are easy to get subtly wrong: a missing FINAL, an unbounded current range, or a dictionary that quietly serves stale data after a load. Pulse monitors ClickHouse production deployments and surfaces these issues before they corrupt analytics — flagging dictionaries that fail to reload, tables accumulating unmerged ReplacingMergeTree parts, and mutation backlogs from per-change dimension updates. Pulse is run by ClickHouse experts who can review your dimension modeling and point-in-time query patterns, so your SCD design stays correct and fast as data grows.
Frequently Asked Questions
Q: Which SCD type should I use in ClickHouse?
Use Type 1 (ReplacingMergeTree, overwrite) if you only need the current value. Use Type 2 (versioned rows + range_hashed dictionary) if you need point-in-time history — for example, the customer's region at the time of each order. Use Type 3 (extra "previous value" columns) only when a single step of history suffices.
Q: How do I look up a dimension value as of a specific past date?
Store effective-from/to dates in the dimension table, expose it as a range_hashed (or complex_key_range_hashed) dictionary with RANGE(MIN valid_from MAX valid_to), and call dictGet('dict', 'attr', key, date). ClickHouse returns the value whose range contains that date.
Q: Do I need FINAL when reading a ReplacingMergeTree dimension?
Yes, unless you aggregate with argMax instead. Deduplication only happens during background merges, which run at unpredictable times, so without FINAL you may see multiple versions of the same key.
Q: How do I represent the "current" version with no end date?
Leave valid_to as NULL in the source table. In a range_hashed dictionary a NULL upper bound is treated as the maximum possible value, so the open-ended row matches any current or future date.
Q: What happens when two date ranges for the same key overlap?
The range_lookup_strategy setting decides: 'min' (the default) returns the range with the smallest range_min, 'max' returns the one with the largest range_min. Use 'max' to get the most recent applicable version.
Q: Should I close out old versions with ALTER TABLE ... UPDATE?
Avoid it at scale. Mutations rewrite whole parts and accumulate quickly. Model version changes as appended rows in a ReplacingMergeTree (or rely on the range strategy), which keeps writes cheap. See mutation performance impact.