Slowly-Changing Dimensions (SCD) Patterns in ClickHouse Dictionaries

Q: Which SCD type should I use in ClickHouse?

Use Type 1 ( ReplacingMergeTree , overwrite) if you only need the current value. Use Type 2 (versioned rows + range_hashed dictionary) if you need point-in-time history — for example, the customer's region at the time of each order . Use Type 3 (extra "previous value" columns) only when a single step of history suffices.

Q: Do I need `FINAL` when reading a ReplacingMergeTree dimension?

Yes, unless you aggregate with argMax instead. Deduplication only happens during background merges, which run at unpredictable times, so without FINAL you may see multiple versions of the same key.

Q: How do I represent the "current" version with no end date?

Leave valid_to as NULL in the source table. In a range_hashed dictionary a NULL upper bound is treated as the maximum possible value, so the open-ended row matches any current or future date.

Q: What happens when two date ranges for the same key overlap?

The range_lookup_strategy setting decides: 'min' (the default) returns the range with the smallest range_min , 'max' returns the one with the largest range_min . Use 'max' to get the most recent applicable version.

A dimension table holds the descriptive attributes you join against facts — a customer's tier, a product's category, a store's region. The hard part is what happens when those attributes change: do you overwrite the old value, or keep a history so you can still answer "what tier was this customer in when the order was placed?" That is the slowly-changing dimension (SCD) problem.

ClickHouse has no built-in SCD construct, but it gives you two strong primitives for building one: ReplacingMergeTree for storing versioned dimension rows, and the range_hashed dictionary layout for fast point-in-time lookups by an effective date range. This guide maps the classic SCD types onto those primitives.

The SCD Types in One Table

The data-warehousing literature defines several SCD strategies. These three cover almost every ClickHouse use case:

Type	What it does	When to use	ClickHouse primitive
Type 1	Overwrite the old value; no history kept	You only ever need the current attribute	`ReplacingMergeTree` keyed by the dimension key
Type 2	Add a new row per change, with effective-from/to dates	You need point-in-time / historical accuracy	Versioned rows + `range_hashed` dictionary
Type 3	Keep a limited number of prior values as extra columns (e.g. `previous_region`)	You only care about the immediately preceding value	Plain columns on a Type 1 table

Type 2 is the one most people mean when they say "SCD," because it is what lets a fact joined to a dimension reflect the attribute as it was at event time, not as it is today.

SCD Type 1: Overwrite With ReplacingMergeTree

Type 1 keeps only the latest version of each dimension row. ReplacingMergeTree is the natural fit: it deduplicates rows that share the same ORDER BY key during background merges, keeping the row with the highest version.

CREATE TABLE customer_dim
(
    customer_id   UInt64,
    name          String,
    tier          LowCardinality(String),
    region        LowCardinality(String),
    updated_at    DateTime  -- used as the version column
)
ENGINE = ReplacingMergeTree(updated_at)
ORDER BY customer_id;

Each time a customer changes, you insert a new row with a newer updated_at. After a merge, only the latest row per customer_id survives. Because merges happen in the background at unpredictable times, you must read with FINAL (or aggregate explicitly) to guarantee deduplicated results:

SELECT customer_id, tier, region
FROM customer_dim FINAL
WHERE customer_id = 42;

The version column matters: without it, ReplacingMergeTree keeps the last inserted row, which is order-dependent and unsafe with out-of-order or replicated inserts. Passing a version column (updated_at here) makes the choice deterministic — the maximum version wins. See ReplacingMergeTree for the full semantics, and the ReplacingMergeTree duplicate gotcha for the most common mistake (forgetting FINAL and seeing duplicates).

If you would rather avoid FINAL, you can emulate it at query time with argMax, which picks the attribute value associated with the largest version:

SELECT
    customer_id,
    argMax(tier, updated_at)   AS tier,
    argMax(region, updated_at) AS region
FROM customer_dim
GROUP BY customer_id;

See the argMax function for details on this pattern.

SCD Type 3: Keep the Previous Value as a Column

Type 3 is the lightest pattern: instead of full history, you keep one or two prior values inline. It is just extra columns on a Type 1 table.

CREATE TABLE customer_dim_t3
(
    customer_id      UInt64,
    region           LowCardinality(String),
    previous_region  LowCardinality(String),
    region_changed_at DateTime,
    updated_at       DateTime
)
ENGINE = ReplacingMergeTree(updated_at)
ORDER BY customer_id;

Type 3 answers "what was the region before the current one?" but nothing deeper. Reach for it only when a single step of history is genuinely all you need; otherwise Type 2 is more honest about your requirements.

SCD Type 2: Versioned Rows With Effective Date Ranges

Type 2 is where ClickHouse shines. You store one row per version of a dimension key, each stamped with the date range during which it was valid:

CREATE TABLE customer_dim_history
(
    customer_id   UInt64,
    tier          LowCardinality(String),
    region        LowCardinality(String),
    valid_from    Date,
    valid_to      Nullable(Date)   -- NULL = still current
)
ENGINE = MergeTree
ORDER BY (customer_id, valid_from);

A customer who moved regions produces two rows:

INSERT INTO customer_dim_history VALUES
    (42, 'gold', 'EU', '2023-01-01', '2024-05-31'),
    (42, 'gold', 'US', '2024-06-01', NULL);

You can answer point-in-time questions with a range join, but ClickHouse joins are not built for this, and a fact-by-fact range join over millions of events is slow. The idiomatic, fast solution is to expose the history table as a range_hashed dictionary.

The range_hashed Dictionary

The range_hashed layout stores, for each key, an ordered set of ranges with their values, and resolves a lookup to the range that contains a supplied date — in memory, at O(1)-ish cost:

CREATE DICTIONARY customer_scd_dict
(
    customer_id   UInt64,
    tier          String,
    region        String,
    valid_from    Date,
    valid_to      Date
)
PRIMARY KEY customer_id
SOURCE(CLICKHOUSE(TABLE 'customer_dim_history'))
LIFETIME(MIN 300 MAX 600)
LAYOUT(RANGE_HASHED())
RANGE(MIN valid_from MAX valid_to);

Now a point-in-time lookup takes the key and a date, and returns the value valid on that date:

SELECT dictGet('customer_scd_dict', 'region', toUInt64(42), toDate('2024-03-15'));
-- → 'EU'   (the row valid from 2023-01-01 to 2024-05-31)

SELECT dictGet('customer_scd_dict', 'region', toUInt64(42), today());
-- → 'US'   (the current row)

To enrich a fact table with the attribute as it was at event time, pass the event date as the range argument:

SELECT
    o.order_id,
    o.event_date,
    dictGet('customer_scd_dict', 'region', o.customer_id, o.event_date) AS region_at_order
FROM orders AS o;

This is the payoff of SCD Type 2 in ClickHouse: a single dictionary lookup per fact row, served from memory, instead of a range join.

Open-Ended and Current Ranges

The "still current" version has no upper bound. In the source table, a NULL range_max is treated as the maximum possible value, so the open-ended row matches any future date. Likewise an epoch 1970-01-01 (or 0) range_min leaves the lower bound open. Note that a range_hashed dictionary's range_min/range_max values must fit in Int64. With Nullable source columns:

-- valid_to NULL in the source ⇒ the range stays open on the upper end
INSERT INTO customer_dim_history VALUES
    (43, 'silver', 'APAC', '2024-06-01', NULL);

Overlapping Ranges: range_lookup_strategy

If a key can have overlapping ranges, control which one wins with range_lookup_strategy:

LAYOUT(RANGE_HASHED(range_lookup_strategy 'max'))

'min' (default): among matching ranges, returns the one with the smallest range_min (then smallest range_max).
'max': returns the one with the largest range_min (then largest range_max) — i.e. the most recent applicable version, which is usually what you want when versions overlap.

For composite dimension keys, use the COMPLEX_KEY_RANGE_HASHED() layout with multiple PRIMARY KEY columns.

Refreshing the Dimension

The dictionary is a cached, in-memory copy of the history table, so it must be refreshed when the source changes. LIFETIME(MIN 300 MAX 600) reloads it on an interval randomized between 300 and 600 seconds (the randomization spreads reload load across replicas). You can also reload on demand:

SYSTEM RELOAD DICTIONARY customer_scd_dict;

How you write new versions into the source table matters more than the reload:

Append-only is ideal. Inserting a new versioned row (and, for Type 2, closing the previous row's valid_to) avoids mutations entirely.
Avoid ALTER TABLE ... UPDATE to close ranges. Mutations rewrite whole parts and are expensive at scale. Prefer modeling the close-out as an INSERT of a corrected row into a ReplacingMergeTree, or batch the close-outs. See mutation performance impact for why per-change UPDATEs do not scale.

For enrichment-at-insert-time patterns and keeping dictionary-backed columns current, see the dictionary enrichment update pattern.

Best Practices

Pick the lowest SCD type that meets the requirement. Type 2 carries real storage and operational cost; do not pay it if you only ever query the current value (Type 1) or one prior value (Type 3).
Always use a version column with ReplacingMergeTree, and read with FINAL or argMax. Without it, deduplication is insertion-order-dependent.
Keep dimensions small enough to fit in memory if you want to serve them as dictionaries. range_hashed loads the full dimension into RAM; size it deliberately and use LowCardinality/narrow types. See ClickHouse dictionaries and external dictionaries.
Model version close-outs as inserts, not mutations. Append a new row and let a ReplacingMergeTree or the range strategy resolve the latest version.
Use range_lookup_strategy 'max' when overlapping ranges are possible and you want the most recent applicable version.
Refresh deliberately. Tune LIFETIME to your change frequency, and call SYSTEM RELOAD DICTIONARY after bulk dimension loads rather than waiting for the interval.

Common Issues

Forgetting FINAL on a Type 1 table and seeing duplicate dimension rows because background merges have not run yet. Use FINAL or argMax.
Wrong point-in-time answers from an unbounded current row. If valid_to is not NULL (or epoch) for the current version, future-dated lookups miss it. Make the current row explicitly open-ended.
range_min/range_max overflow. Range boundary values must fit in Int64; out-of-range or invalid dates produce surprising results.
Stale lookups after a dimension load because the dictionary has not reloaded. Reload it explicitly after batch updates.
Per-change ALTER ... UPDATE mutations to close out ranges, which rewrite parts and pile up. Switch to append-only versioning.

How Pulse Helps

Slowly-changing dimensions are easy to get subtly wrong: a missing FINAL, an unbounded current range, or a dictionary that quietly serves stale data after a load. Pulse monitors ClickHouse production deployments and surfaces these issues before they corrupt analytics — flagging dictionaries that fail to reload, tables accumulating unmerged ReplacingMergeTree parts, and mutation backlogs from per-change dimension updates. Pulse is run by ClickHouse experts who can review your dimension modeling and point-in-time query patterns, so your SCD design stays correct and fast as data grows.

Frequently Asked Questions

Q: Which SCD type should I use in ClickHouse?

Use Type 1 (ReplacingMergeTree, overwrite) if you only need the current value. Use Type 2 (versioned rows + range_hashed dictionary) if you need point-in-time history — for example, the customer's region at the time of each order. Use Type 3 (extra "previous value" columns) only when a single step of history suffices.

Q: How do I look up a dimension value as of a specific past date?

Store effective-from/to dates in the dimension table, expose it as a range_hashed (or complex_key_range_hashed) dictionary with RANGE(MIN valid_from MAX valid_to), and call dictGet('dict', 'attr', key, date). ClickHouse returns the value whose range contains that date.

Q: Do I need FINAL when reading a ReplacingMergeTree dimension?

Yes, unless you aggregate with argMax instead. Deduplication only happens during background merges, which run at unpredictable times, so without FINAL you may see multiple versions of the same key.

Q: How do I represent the "current" version with no end date?

Leave valid_to as NULL in the source table. In a range_hashed dictionary a NULL upper bound is treated as the maximum possible value, so the open-ended row matches any current or future date.

Q: What happens when two date ranges for the same key overlap?

The range_lookup_strategy setting decides: 'min' (the default) returns the range with the smallest range_min, 'max' returns the one with the largest range_min. Use 'max' to get the most recent applicable version.

Q: Should I close out old versions with ALTER TABLE ... UPDATE?

Avoid it at scale. Mutations rewrite whole parts and accumulate quickly. Model version changes as appended rows in a ReplacingMergeTree (or rely on the range strategy), which keeps writes cheap. See mutation performance impact.