Using Projections with ClickHouse: Pre-Aggregation and Alternative Sort Orders

Projections in ClickHouse let you store the same data in multiple sort orders or pre-aggregated forms within a single table. When the query optimizer detects that a projection can serve a query more efficiently than the base table, it uses the projection automatically — without any changes to your SQL.

Think of projections as hidden, automatically maintained alternative views of your table data. They're one of ClickHouse's most powerful optimization features, but they come with trade-offs you need to understand.

When Projections Help

ClickHouse tables have a single primary key that determines sort order. Queries that filter or sort by columns not in the primary key must scan significantly more data. Projections solve two specific problems:

1. Queries That Filter on Non-Primary-Key Columns

If your table is sorted by (event_date, user_id) but you frequently query by country, those queries scan the entire table. A projection sorted by country enables efficient range scans for country-based filters.

2. Queries That Aggregate Data Repeatedly

If your dashboard runs SELECT date, SUM(amount) FROM sales GROUP BY date thousands of times per day, each execution re-scans and re-aggregates the raw data. A pre-aggregation projection stores the grouped result and serves it directly.

Creating Projections

In a Table Definition

You can define projections when creating a table:

CREATE TABLE events
(
    event_date Date,
    user_id UInt64,
    country String,
    event_type String,
    amount Decimal(10, 2),
    PROJECTION events_by_country
    (
        SELECT * ORDER BY country
    ),
    PROJECTION daily_totals
    (
        SELECT
            event_date,
            event_type,
            sum(amount) AS total_amount,
            count() AS event_count
        GROUP BY event_date, event_type
    )
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, user_id);

Projections defined at table creation time automatically apply to all inserted data.

Adding to an Existing Table

You can add projections to tables that already contain data:

-- Add the projection definition
ALTER TABLE events
    ADD PROJECTION events_by_country
    (
        SELECT * ORDER BY country
    );

-- Materialize for existing data
ALTER TABLE events
    MATERIALIZE PROJECTION events_by_country;

The ADD PROJECTION statement only adds the metadata — new inserts will include the projection, but existing data won't have it. You must run MATERIALIZE PROJECTION to build the projection for existing parts. This is a background operation and can take time on large tables.

Removing Projections

-- Remove projection and delete files
ALTER TABLE events
    DROP PROJECTION events_by_country;

-- Delete projection files but keep metadata
ALTER TABLE events
    CLEAR PROJECTION events_by_country;

How Projections Work Internally

Projections are stored within each data part. When ClickHouse writes a new part during an insert, it simultaneously builds the projection data for that part. Internally, a projection is essentially a hidden table nested inside the part directory.

During query execution:

The query optimizer analyzes the query
For each table part, it evaluates whether a projection can serve the query with less data scanning
If a projection wins, ClickHouse reads from the projection's data instead of the base table
Different parts can use different projections (or no projection) — the optimizer decides per-part

This per-part decision is important: if you add a projection to an existing table but don't materialize it, only newly inserted parts will have the projection. Old parts fall back to the base table data.

Projection vs. Materialized View

Both projections and materialized views can pre-aggregate data or provide alternative access paths. Here's how to choose:

Aspect	Projection	Materialized View
Storage	Embedded within the source table's parts	Separate target table
Query routing	Automatic — optimizer selects transparently	Manual — you query the MV's target table directly
Schema flexibility	Must use columns from the source table	Can JOIN, transform, and write to any target schema
Maintenance	Automatically kept in sync during inserts and merges	Triggered on insert; can diverge if source data is mutated
Backfill	`MATERIALIZE PROJECTION` for existing data	Must re-insert or rebuild manually
Best for	Alternative sort orders, simple aggregations	Complex transformations, cross-table aggregations, feeding separate tables

Rule of thumb: Use projections when you need a different sort order or a simple GROUP BY on the same table. Use materialized views when you need transformations, JOINs, or a fully separate destination table.

Practical Examples

Alternative Sort Order for Filtering

Your table is sorted by (timestamp, service_name) but you frequently search by trace_id:

ALTER TABLE traces
    ADD PROJECTION traces_by_trace_id
    (
        SELECT * ORDER BY trace_id
    );

ALTER TABLE traces
    MATERIALIZE PROJECTION traces_by_trace_id;

Now queries like SELECT * FROM traces WHERE trace_id = 'abc123' use the projection automatically instead of scanning the entire table.

Pre-Aggregation for Dashboard Queries

Your dashboard repeatedly queries hourly metrics:

ALTER TABLE raw_metrics
    ADD PROJECTION hourly_stats
    (
        SELECT
            toStartOfHour(timestamp) AS hour,
            metric_name,
            avg(value) AS avg_value,
            max(value) AS max_value,
            min(value) AS min_value,
            count() AS sample_count
        GROUP BY hour, metric_name
    );

ALTER TABLE raw_metrics
    MATERIALIZE PROJECTION hourly_stats;

Queries that match this GROUP BY pattern will read pre-aggregated data — potentially reducing data scanned by orders of magnitude.

Checking if Projections Are Being Used

Use EXPLAIN to verify the optimizer is selecting your projection:

EXPLAIN
SELECT
    toStartOfHour(timestamp) AS hour,
    metric_name,
    avg(value)
FROM raw_metrics
GROUP BY hour, metric_name;

Look for ReadFromMergeTree (Projection: hourly_stats) in the output. If you see ReadFromMergeTree without a projection name, the optimizer chose the base table instead.

Limitations and Gotchas

MergeTree family only: Projections require MergeTree-family table engines.
Storage overhead: Each projection stores a copy of the data (in its own sort order or aggregated form). A projection that selects all columns with a different ORDER BY roughly doubles storage for the table.
Insert overhead: Every insert must build all defined projections, which increases write latency and CPU usage. Don't create projections you don't actively need.
Merge behavior: Since ClickHouse v24.8, the deduplicate_merge_projection_mode setting controls what happens to projections during background merges. The default (throw) can cause merge failures if projections become inconsistent. Consider setting this to rebuild for production:
```
ALTER TABLE events
    MODIFY SETTING deduplicate_merge_projection_mode = 'rebuild';
```
Mutation interaction: Lightweight deletes and mutations can invalidate projection data. Since v24.7, lightweight_mutation_projection_mode controls this behavior — rebuild keeps projections consistent, drop discards them.
No partial column projections for aggregation: An aggregation projection must include all GROUP BY columns. The optimizer won't use a projection if the query's GROUP BY doesn't match.
Optimizer may not always choose the projection: The optimizer uses heuristics based on estimated data size. In some cases, it may incorrectly prefer the base table. Test with EXPLAIN and consider SETTINGS force_optimize_projection = 1 for debugging (not production).

Best Practices

Start with primary key optimization: Projections are a secondary optimization. Make sure your primary key already matches your most common query patterns before adding projections.
Monitor storage impact: Check projection sizes with SELECT name, formatReadableSize(bytes_on_disk) FROM system.projection_parts WHERE active.
Limit the number of projections: Each projection adds write-time overhead. Two to three projections per table is reasonable; more than five should raise questions.
Materialize after adding: Always run MATERIALIZE PROJECTION after ADD PROJECTION on existing tables, otherwise the projection only covers new data.
Verify with EXPLAIN: After creating a projection, run your target query with EXPLAIN to confirm the optimizer is using it. Don't assume — test.
Use aggregation projections for hot dashboards: If a specific GROUP BY query runs hundreds of times per day, an aggregation projection can reduce load dramatically.

Frequently Asked Questions

Q: Do projections automatically stay in sync with table data?

Yes, for inserts. New data written to the table automatically includes projection data. However, mutations (UPDATE/DELETE) may invalidate projections depending on your lightweight_mutation_projection_mode setting. After mutations, you may need to re-materialize.

Q: Can I use projections with ReplicatedMergeTree?

Yes. Projection operations (ADD, DROP, MATERIALIZE, CLEAR) replicate across nodes via ClickHouse Keeper or ZooKeeper, just like other metadata changes.

Q: How much extra storage do projections use?

An ORDER BY projection on all columns roughly doubles storage for the covered data. An aggregation projection is typically much smaller — often 1–5% of the source data size, depending on the aggregation's reduction factor.

Q: Do projections work with ClickHouse Cloud's SharedMergeTree?

Yes. Projections are supported on SharedMergeTree in ClickHouse Cloud and work the same way as on standard MergeTree.

Q: Can projections replace indexes?

Not exactly. Projections provide alternative sort orders (similar to covering indexes in traditional databases), but they store full data copies rather than lightweight index structures. Data skipping indexes (bloom_filter, minmax, set) are lighter-weight alternatives for simple filter acceleration.

Q: Why isn't the optimizer using my projection?

Common reasons: the query's GROUP BY or ORDER BY doesn't exactly match the projection definition, the projection hasn't been materialized for existing parts, or the optimizer estimates the base table is faster. Use EXPLAIN to diagnose, and test with SETTINGS force_optimize_projection = 1 to force projection usage for comparison.