Flattened Table Design Pattern: When and How to Denormalize

Q: Can I flatten a one-to-many relationship without multiplying rows?

Yes. Use groupArray() (or Nested / Array(Tuple) columns) to fold the "many" side into arrays on the "one" row, then read them back with ARRAY JOIN . Keep the per-row array size small — thousands of child entries per parent row is an anti-pattern.

The flattened table design pattern in ClickHouse means storing data as a single wide table with dimension attributes copied inline, instead of keeping normalized fact and dimension tables that you JOIN at query time. The goal is to move the cost of the JOIN from every read to a single write, so analytical queries scan one table with no joins at all.

This is one of the oldest and most effective optimizations in column-oriented analytics, but it is not free: flattening trades cheap, flexible writes for fast reads, and it makes updates expensive. This guide covers when the pattern pays off, when it backfires, and the concrete techniques for building and maintaining a flattened table.

Tall (Normalized) vs. Wide (Flattened) Schemas

A normalized ("tall") schema keeps each entity in its own table and references it by key — a classic star schema with a fact table and several dimension tables. A flattened ("wide") schema folds the dimension columns directly into the fact rows, duplicating them on every event.

Aspect	Tall / normalized (star schema)	Wide / flattened (denormalized)
Query shape	Fact table JOIN dimension tables	Single table, no JOIN
Read latency	Pays JOIN cost on every query	Fastest — plain scan + filter
Storage	Compact; each dimension stored once	Larger; dimension values repeated per row
Write path	Insert fact rows only	Must resolve dimensions at insert time
Updating a dimension	Update one row in the dimension table	Rewrite every fact row that copied it
Best for	Frequently changing dimensions, flexible ad-hoc queries	Stable dimensions, known query patterns, ultra-low latency

ClickHouse compresses repeated dimension values well (especially with LowCardinality), so the storage cost of flattening is often smaller than it looks. The real cost is on the write and update path, not on disk.

A common misconception is that real-time analytics on ClickHouse requires aggressive denormalization. That is no longer true: ClickHouse's vectorized execution, modern join algorithms, and dictionaries mean a star schema is frequently fast enough. Flatten because your access pattern justifies it — not by reflex. Benchmark native JOIN performance first.

When to Flatten

Denormalization shifts the JOIN from query time to ingestion time. That is a good trade when:

The dimensions change infrequently, or a delay before data becomes queryable is acceptable (you can reload in batch).
The query patterns are well known and read-heavy — the same dimension columns are filtered and grouped repeatedly across many queries.
You need single-digit-millisecond latency for customer-facing dashboards where even a fast JOIN is too much overhead.
The JOIN is expensive to run repeatedly — for example, a large dimension that forces a costly hash build, or a distributed shuffle join you cannot avoid.

When NOT to Flatten

Avoid the flattened pattern when:

The relationship is many-to-many. A single source-row change can require updating a huge number of fact rows. The write amplification rarely pays off.
Dimensions are updated frequently and must be current in near real time. Keeping a wide table in sync with a fast-changing dimension means constant, expensive rewrites.
The "dimension" is high-cardinality nested data. Folding more than ~1,000 tuples per row into arrays is not recommended; the row becomes unwieldy and merges suffer.
Queries are ad-hoc and exploratory. A flattened table is optimized for known patterns; a normalized schema keeps you flexible.

If the dimension is small and slowly changing, a dictionary is usually a better answer than a physically flattened table — it gives you JOIN-free dictGet lookups without the update-rewrite problem (see below).

Building a Flattened Table

There are three practical ways to flatten, depending on how the data arrives and how fresh it must be.

1. Pre-join at insert time with `INSERT ... SELECT`

If you control the ingestion pipeline, resolve the dimensions in the insert itself. This is the simplest mental model: do the JOIN once, write the result.

CREATE TABLE orders_flat
(
    order_id     UInt64,
    order_date   Date,
    amount       Decimal(10, 2),
    customer_id  UInt64,
    -- dimension columns copied inline
    customer_name    String,
    customer_country LowCardinality(String),
    customer_tier    LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY (order_date, customer_id);

INSERT INTO orders_flat
SELECT
    o.order_id,
    o.order_date,
    o.amount,
    o.customer_id,
    c.name,
    c.country,
    c.tier
FROM orders_raw AS o
LEFT JOIN customers AS c ON c.customer_id = o.customer_id;

See the CREATE TABLE reference for column and engine options. Use LowCardinality(String) for repeated dimension values — it is the single biggest win for storage and filter speed in a flattened table.

2. Flatten continuously with a materialized view

To flatten as rows arrive rather than in a batch job, attach an incremental materialized view to the fact-table inserts. The MV's SELECT enriches each new block with dimension columns and writes the wide rows to a target table:

CREATE TABLE orders_flat
(
    order_id     UInt64,
    order_date   Date,
    amount       Decimal(10, 2),
    customer_id  UInt64,
    customer_name    String,
    customer_country LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY (order_date, customer_id);

CREATE MATERIALIZED VIEW orders_flat_mv TO orders_flat AS
SELECT
    o.order_id,
    o.order_date,
    o.amount,
    o.customer_id,
    c.name    AS customer_name,
    c.country AS customer_country
FROM orders_raw AS o
LEFT JOIN customers AS c ON c.customer_id = o.customer_id;

A key gotcha: an incremental MV only sees the rows from the insert that triggered it (the left side). The dimension values are resolved using whatever is in customers at insert time and are frozen into the wide row. If a customer's country later changes, already-flattened rows keep the old value. For most event/log data that "value as of the event" semantics is exactly what you want; if it is not, prefer a refreshable MV or dictionary approach.

3. Batch reload with a refreshable materialized view

When the whole flattened object should be periodically rebuilt — and small staleness is acceptable — a refreshable materialized view runs the denormalizing query on a schedule and replaces the target:

CREATE MATERIALIZED VIEW orders_flat_rmv
REFRESH EVERY 10 MINUTE
ENGINE = MergeTree
ORDER BY (order_date, customer_id)
AS
SELECT
    o.order_id,
    o.order_date,
    o.amount,
    o.customer_id,
    c.name    AS customer_name,
    c.country AS customer_country
FROM orders_raw AS o
LEFT JOIN customers AS c ON c.customer_id = o.customer_id;

Refreshable MVs are the cleanest fit for the canonical denormalization use case: dimensions that change slowly and a flattened table that can tolerate a few minutes of lag. External orchestration (dbt, Airflow) populating a new table and swapping it in atomically with EXCHANGE TABLES achieves the same effect with more control.

Flattening with Dictionaries Instead of Copying Columns

A lighter-weight variant keeps the fact table narrow but exposes dimension attributes through a dictionary, so queries still avoid a JOIN. You can either call dictGet at query time or materialize the lookup into a column whose DEFAULT is a dictionary call:

CREATE DICTIONARY customer_dict
(
    customer_id UInt64,
    country     String,
    tier        String
)
PRIMARY KEY customer_id
SOURCE(CLICKHOUSE(TABLE 'customers'))
LIFETIME(MIN 300 MAX 600)
LAYOUT(HASHED());

ALTER TABLE orders_raw
    ADD COLUMN customer_country String
    DEFAULT dictGetString('customer_dict', 'country', customer_id);

This gives JOIN-free reads while storing the dimension once (in memory, in the dictionary) rather than copying it onto every row. The catch is updates: dictGet-as-default is only evaluated on insert, so refreshing already-stored rows after a dimension change requires SYSTEM RELOAD DICTIONARY followed by an ALTER TABLE ... UPDATE, which rewrites the whole column and should not be run frequently. The full mechanics and a safer update workflow are covered in the dictionary enrichment update pattern.

One-to-Many: Flattening with Arrays

For one-to-many relationships you do not have to explode rows. ClickHouse can fold the "many" side into array (or Array(Tuple) / Nested) columns on the "one" row, using groupArray() during the insert:

INSERT INTO customers_flat
SELECT
    c.customer_id,
    c.name,
    groupArray(o.order_id)   AS order_ids,
    groupArray(o.amount)     AS order_amounts
FROM customers AS c
LEFT JOIN orders AS o ON o.customer_id = c.customer_id
GROUP BY c.customer_id, c.name;

Keep array sizes bounded — folding thousands of child rows into a single parent row hurts compression and merges. For querying these structures back out, see ARRAY JOIN.

Common Issues

Update write-amplification. The headline risk of any flattened table: one dimension change can force a rewrite of every fact row (or whole column) that copied it. Design for batch reloads or "value as of event" semantics so you rarely have to update in place.
Stale dimension values. Insert-time flattening freezes the dimension value. That is correct for historical/event data but wrong if you expect the wide table to reflect the current dimension state. Choose refreshable MVs or dictionaries when you need freshness.
Skipping LowCardinality. A flattened table repeats dimension values on every row. Without LowCardinality(String) on those columns, storage and filter cost are far higher than necessary.
Deduplication on reload. If you flatten with ReplacingMergeTree to absorb re-inserts, remember that deduplication is asynchronous and queries need FINAL (or aggregation) to see a consistent result. See the ReplacingMergeTree duplicate gotcha.

Best Practices

Try the alternatives first. Benchmark native JOINs and dictionaries before physically flattening. Modern ClickHouse joins are fast, and a star schema is more maintainable.
Flatten only stable dimensions. The pattern shines for slowly-changing or append-mostly data; it fights you on fast-moving, frequently-updated dimensions.
Reload in batch, don't update in place. Periodically rebuilding the whole flattened object (refreshable MV or atomic table swap) is almost always simpler and cheaper than chasing individual updates.
Use LowCardinality and a query-aligned ORDER BY. Sort the wide table by the columns your queries filter on most; the primary key matters more once there is no JOIN to hide behind. See the MergeTree guide.
Avoid flattening many-to-many and high-cardinality nested data. Keep those normalized or use arrays only when the child count per row is small.

How Pulse Helps

Choosing between a star schema, a flattened table, dictionaries, and materialized views is one of the highest-leverage data-modeling decisions in a ClickHouse deployment — and the wrong choice surfaces later as slow dashboards or runaway mutation load. Pulse is a managed ClickHouse service whose engineers review your schemas and query patterns, identify where denormalization will actually pay off (and where it will hurt), and help implement the right flattening approach — insert-time pre-joins, refreshable MVs, or dictionary enrichment — with monitoring to catch update write-amplification before it becomes a production problem.

Frequently Asked Questions

Q: Does flattening always make queries faster?

For read-heavy queries against stable dimensions, yes — eliminating the JOIN removes a real cost. But it shifts that cost to writes and makes updates expensive, and modern ClickHouse JOINs are fast enough that a normalized star schema is often within your latency budget. Measure both before committing.

Q: Flattened table or dictionary — which should I use?

If the dimension is small and slowly changing, prefer a dictionary: it gives JOIN-free dictGet lookups while storing each dimension value once, and it is far easier to update than a physically flattened column. Reach for a fully flattened table when you need every column inline for the lowest possible scan latency or when the dimension is too large to hold in memory as a dictionary.

Q: How do I keep a flattened table in sync when a dimension changes?

The cleanest answer is to not update in place. Periodically rebuild the flattened object with a refreshable materialized view or an atomic table swap. If you must update, an ALTER TABLE ... UPDATE mutation rewrites whole data parts — every row in any part that contains a matching row is rewritten — correct, but heavy, so run it rarely.

Q: Won't repeating dimension values everywhere blow up my storage?

Less than you would expect. ClickHouse's columnar compression handles repeated values very well, and wrapping repeated dimension columns in LowCardinality(String) stores them as a dictionary-encoded column. The dominant cost of flattening is usually the write/update path, not disk.

Q: Can I flatten a one-to-many relationship without multiplying rows?

Yes. Use groupArray() (or Nested / Array(Tuple) columns) to fold the "many" side into arrays on the "one" row, then read them back with ARRAY JOIN. Keep the per-row array size small — thousands of child entries per parent row is an anti-pattern.

Q: Is a flattened wide table the same as a ClickHouse projection?

No. A projection stores an alternative sort order or aggregation of the same table's columns and is chosen automatically by the optimizer. A flattened table physically combines columns from multiple tables to avoid JOINs. They solve different problems and can be used together.