ClickHouse CREATE TABLE: Engines, ORDER BY, and Best Practices

Q: Can I change `ORDER BY` after a table exists?

You can ALTER TABLE ... MODIFY ORDER BY to extend the key, but only with columns that are being added by an ADD COLUMN clause in the same ALTER statement and that have no default value. You cannot add existing columns to the key, and you cannot remove or reorder the existing key columns without rebuilding the table. For a real reorder, the pattern is to create a new table and INSERT INTO new SELECT FROM old .

Q: Do I need `PARTITION BY` if I have `ORDER BY`?

They serve different purposes. ORDER BY is mandatory for MergeTree (the sort key drives the index). PARTITION BY is optional but strongly recommended for time-series and retention workloads, because it makes deleting old data efficient (drop a whole partition rather than running a mutation).

Q: What's the difference between `MergeTree` and `ReplicatedMergeTree`?

MergeTree stores data on a single replica. ReplicatedMergeTree synchronizes data across replicas via ClickHouse Keeper. On ClickHouse Cloud, every table is implicitly replicated, so you usually write MergeTree and the platform substitutes the replicated form.

Q: How do I create a table that mirrors data from Kafka?

You use two tables and a materialized view: a Kafka engine table that reads the stream, a MergeTree table that stores the data, and a MATERIALIZED VIEW that pipes from one to the other. See the materialized view guide for the pattern.

Q: Can TTL aggregate old data instead of deleting it?

Yes. TTL expr GROUP BY ... SET ... downsamples aged rows in place during merges, so you keep raw data for the recent window and rollups for the old one in a single table. The GROUP BY columns must be a prefix of the ORDER BY . Use RECOMPRESS CODEC(...) in the same way if you only want to shrink old data rather than aggregate it.

Q: Can I create a table from an INSERT-style query in one go?

Yes. CREATE TABLE x ENGINE = MergeTree ORDER BY id AS SELECT ... infers the schema from the SELECT and immediately materializes the result. Combine with EMPTY AS to create the schema without the data.

To create a table in ClickHouse, write CREATE TABLE [db.]name (columns) ENGINE = <engine> .... The engine determines storage format, indexing, replication, and how the data is queried, so it is the most important decision you make at creation time. For analytical workloads the default and most useful engine is MergeTree or one of its variants. This guide covers the full CREATE TABLE syntax, the variations (CREATE AS, CLONE AS, CREATE OR REPLACE), the column modifiers that matter in production, ORDER BY and PARTITION BY choices, codecs and TTL, and how distributed creation works.

The Minimum Useful Statement

CREATE TABLE analytics.events
(
    event_time   DateTime64(3),
    user_id      UInt64,
    event_type   LowCardinality(String),
    properties   String
)
ENGINE = MergeTree
ORDER BY (event_time, user_id);

ENGINE and ORDER BY are the two clauses you almost always want to set explicitly. The MergeTree engine has no parameters; ORDER BY is the sort order used both for the primary key index and for on-disk layout. If you genuinely do not need ordering, write ORDER BY tuple(), which is the documented way to declare a heap.

Full CREATE TABLE Syntax

CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster_name]
(
    name1 [type1] [DEFAULT|MATERIALIZED|EPHEMERAL|ALIAS expr] [CODEC(codec)] [TTL expr] [COMMENT '...'],
    name2 ...,
    [INDEX idx_name expr TYPE type GRANULARITY n],
    [PROJECTION proj_name (SELECT ...)],
    [CONSTRAINT name CHECK expr]
)
ENGINE = engine
[ORDER BY expr]
[PARTITION BY expr]
[PRIMARY KEY expr]
[SAMPLE BY expr]
[TTL expr [DELETE|TO DISK 'name'|TO VOLUME 'name'|GROUP BY ...] [WHERE ...]]
[SETTINGS name = value, ...]
[COMMENT '...'];

Every storage clause after ENGINE is optional, but in practice you want ORDER BY on any MergeTree table, and PARTITION BY on any large one.

Variants

-- Empty table with the same schema as another
CREATE TABLE backup AS production.events;

-- Same schema, copy the data too (hard-links segments, very fast)
CREATE TABLE backup CLONE AS production.events;

-- From a SELECT result. Schema inferred from the query.
CREATE TABLE high_value_users
ENGINE = MergeTree
ORDER BY user_id
AS SELECT user_id, sum(revenue) AS total FROM events GROUP BY user_id;

-- From a table function (S3, URL, MySQL, etc.)
CREATE TABLE s3_imports AS s3('https://bucket.s3.amazonaws.com/data.parquet', 'Parquet');

-- Replace if it exists, atomic
CREATE OR REPLACE TABLE production.events ( ... ) ENGINE = MergeTree ORDER BY (...);

CREATE TABLE AS table copies the schema, not the data. CREATE TABLE CLONE AS copies both, very efficiently because it hard-links the underlying data parts rather than reading and re-writing them.

Column Modifiers

ClickHouse has four column types that go beyond plain name Type:

CREATE TABLE example
(
    id           UInt64,
    -- DEFAULT: filled in if not provided at INSERT time
    created_at   DateTime DEFAULT now(),
    -- MATERIALIZED: always computed at write time; cannot be set by INSERT
    day          Date MATERIALIZED toDate(created_at),
    -- EPHEMERAL: usable in DEFAULT/MATERIALIZED expressions but never stored or returned
    raw_payload  String EPHEMERAL '',
    -- ALIAS: computed at query time, never stored
    age_seconds  UInt32 ALIAS dateDiff('second', created_at, now())
)
ENGINE = MergeTree
ORDER BY id;

The trade-offs:

DEFAULT is the usual choice when a value is sometimes supplied by the writer.
MATERIALIZED is great for derived storage that should always be consistent (day MATERIALIZED toDate(ts)).
EPHEMERAL lets clients send a value that influences other columns without being stored.
ALIAS is a query-time convenience with zero storage cost. Use it for derived columns you only sometimes select.

SELECT * does not include MATERIALIZED or ALIAS columns by default. You have to name them explicitly or enable asterisk_include_materialized_columns / asterisk_include_alias_columns.

Compression Codecs

CREATE TABLE compressed_example
(
    event_time DateTime64(3) CODEC(DoubleDelta, ZSTD(3)),
    user_id    UInt64 CODEC(Delta, LZ4),
    payload    String CODEC(ZSTD(6))
)
ENGINE = MergeTree
ORDER BY (event_time, user_id);

Codecs are applied left to right and the last one is almost always a general-purpose compressor (LZ4 for speed, ZSTD for ratio). The first codec is often a domain-specific transform (Delta, DoubleDelta, Gorilla, T64) that shrinks the data before general compression. ClickHouse also supports encryption codecs (AES_128_GCM_SIV, AES_256_GCM_SIV) for encrypting columns at rest.

A useful default for time-series columns is CODEC(DoubleDelta, ZSTD). Numeric metric columns often benefit from CODEC(Gorilla, ZSTD).

Choosing a Codec

Codecs come in two groups: specialized transforms that exploit the shape of your data, and general-purpose compressors that run last. The mental model is "transform first, compress second": a transform turns the raw bytes into something a general compressor squeezes far better.

Codec	Best for	What it does
`Delta(n)`	Monotonic or slowly-moving integers (counters, IDs)	Stores the difference between neighbouring values. A preprocessing codec only; always pair it with `LZ4`/`ZSTD`.
`DoubleDelta`	Timestamps and evenly-spaced sequences	Stores the delta of deltas, so a constant-stride series compresses to almost nothing.
`GCD`	Integers/Decimals that move in multiples of a common factor (e.g. prices in cents, fixed-step sensors)	Divides each value by the greatest common divisor of the column before the next codec runs.
`Gorilla`	Floating-point gauges that change slowly (temperatures, utilization)	XORs each float against the previous one and packs the result.
`FPC`	Floating-point series, often faster than Gorilla for 64-bit doubles	Predicts the next value and XORs against the prediction.
`T64`	Bounded integers, `Date`, `DateTime`, `Enum` that do not use the full type width	Crops unused high bits across a block of 64 values.

A practical decision path:

Timestamps / DateTime / DateTime64 → CODEC(DoubleDelta, ZSTD(1)). The deltas between rows are usually constant or near-constant.
Monotonic integer IDs or counters → CODEC(Delta, ZSTD(1)), or CODEC(T64, ZSTD) if the values are bounded and not strictly monotonic.
Slowly-changing float metrics → CODEC(Gorilla, ZSTD) (or FPC for wide doubles).
High-cardinality String / JSON / blobs → CODEC(ZSTD(3)) and up; there is no useful pre-transform, so lean on ZSTD level. Reserve high levels (6+) for cold or rarely-read columns, because they cost CPU on every read.
LowCardinality(String) and small enums → leave the default. The dictionary already does most of the work; an extra codec rarely pays off.

CREATE TABLE metrics
(
    ts          DateTime64(3) CODEC(DoubleDelta, ZSTD(1)),
    sensor_id   UInt32        CODEC(T64, ZSTD(1)),
    seq         UInt64        CODEC(Delta, ZSTD(1)),
    temperature Float64       CODEC(Gorilla, ZSTD(1)),
    raw_event   String        CODEC(ZSTD(3))
)
ENGINE = MergeTree
ORDER BY (sensor_id, ts);

Two rules of thumb: the general-purpose codec almost always goes last, and a specialized transform on its own (e.g. CODEC(Delta) with nothing after it) usually expands data rather than shrinking it, because it only reshapes the bytes without entropy-coding them. Codecs are per-column and can be changed later with ALTER TABLE ... MODIFY COLUMN ... CODEC(...), which rewrites the column on the next merge — see ALTER TABLE. To measure the payoff, compare data_compressed_bytes against data_uncompressed_bytes per column in system.columns.

ORDER BY and PARTITION BY

ORDER BY defines how data is sorted on disk and what the primary key index looks like. ClickHouse uses a sparse index, by default one entry per 8192 rows (index_granularity = 8192), so the index is small even for very large tables.

The standard guidance: put the most common filter columns first, ordered from low to high cardinality, then any time column last. For an events table queried with WHERE customer_id = X AND event_time > Y, ORDER BY (customer_id, event_time) is far better than ORDER BY (event_time, customer_id).

PARTITION BY controls how data is split into manageable on-disk parts. Common patterns:

PARTITION BY toYYYYMM(event_time)  -- monthly partitions; recommended for time-series
PARTITION BY toDate(event_time)    -- daily; only if individual days are huge

The official guidance is explicit: do not partition by a high-cardinality identifier. "Don't partition your data by client identifiers or names (instead, make client identifier or name the first column in the ORDER BY expression)." A partition per customer turns every query into a fan-out across thousands of tiny parts and crushes performance.

PRIMARY KEY is optional. If omitted, ClickHouse uses the ORDER BY as the primary key. If specified, it must be a prefix of the ORDER BY (so the on-disk sort still satisfies the index).

TTL: Automatic Retention

CREATE TABLE events
(
    event_time DateTime,
    user_id    UInt64,
    payload    String
)
ENGINE = MergeTree
ORDER BY (user_id, event_time)
TTL event_time + INTERVAL 90 DAY DELETE,
    event_time + INTERVAL 30 DAY TO VOLUME 'cold';

Table-level TTL drops old data or moves it between storage tiers. Column-level TTL can blank out a single column after a deadline (payload String TTL event_time + INTERVAL 7 DAY), useful for PII retention. ClickHouse evaluates TTLs in the background during merges, so the actual cleanup lags a little behind the threshold. A TTL clause supports four actions — DELETE (the default), TO DISK/TO VOLUME, GROUP BY, and RECOMPRESS — and you can list several of them in one statement, separated by commas.

TTL GROUP BY: Downsampling Old Data

Instead of deleting aged rows, TTL ... GROUP BY ... SET ... rolls them up in place during merges. This is how you keep raw granularity for recent data and coarse aggregates for old data in a single table, without an external pipeline.

CREATE TABLE metrics_rollup
(
    sensor_id  UInt32,
    ts         DateTime,
    value      Float64,
    value_max  Float64 DEFAULT value
)
ENGINE = MergeTree
ORDER BY (sensor_id, toStartOfHour(ts), ts)
TTL ts + INTERVAL 30 DAY
    GROUP BY sensor_id, toStartOfHour(ts)
    SET value     = avg(value),
        value_max = max(value_max),
        ts        = min(ts);

After 30 days, every group of rows that shares (sensor_id, toStartOfHour(ts)) collapses into one row, with the SET expressions defining how each remaining column is aggregated. The critical constraint: the GROUP BY columns must be a prefix of the table's ORDER BY (or PRIMARY KEY if you defined one separately). You cannot skip a key column in the middle. You can chain multiple TTL rules to build a cascading rollup — for example hourly buckets after a day, then daily buckets after a month — by listing several expr GROUP BY ... clauses separated by commas.

TTL RECOMPRESS: Cheaper Storage for Cold Data

RECOMPRESS re-encodes aging parts with a higher-ratio codec. Recent data stays on a fast codec for write/read speed; old data is squeezed harder to save disk.

CREATE TABLE logs
(
    ts      DateTime,
    level   LowCardinality(String),
    message String CODEC(ZSTD(1))
)
ENGINE = MergeTree
ORDER BY ts
TTL ts + INTERVAL 7 DAY  RECOMPRESS CODEC(ZSTD(6)),
    ts + INTERVAL 30 DAY RECOMPRESS CODEC(ZSTD(12))
SETTINGS merge_with_recompression_ttl_timeout = 14400;

Parts crossing each threshold get rewritten with the named codec on the next eligible merge. merge_with_recompression_ttl_timeout (default 14400 seconds, i.e. 4 hours) is the minimum delay between recompression merges; the parallel setting for delete/move merges is merge_with_ttl_timeout. Do not set either below ~300 seconds or you trigger a storm of off-schedule merges.

Modifying and Removing TTL

TTL is metadata you can change on a live table with ALTER TABLE:

-- Replace the table TTL
ALTER TABLE events MODIFY TTL event_time + INTERVAL 180 DAY DELETE;

-- Change a single column's TTL
ALTER TABLE events MODIFY COLUMN payload String TTL event_time + INTERVAL 14 DAY;

-- Drop the table TTL entirely
ALTER TABLE events REMOVE TTL;

MODIFY TTL only updates the stored expression; it does not rewrite existing parts immediately. To force the new rule to apply to data already on disk, run ALTER TABLE events MATERIALIZE TTL. By default, MATERIALIZE TTL triggers a full re-merge of all affected parts (materialize_ttl_recalculate_only = 0). Set materialize_ttl_recalculate_only = 1 to limit the operation to recalculating TTL bookkeeping without re-merging, which is faster but does not immediately remove or move expired data. See ALTER TABLE for the full mechanics of modifying and materializing TTL. If you mostly want to expire whole partitions rather than individual rows, ttl_only_drop_parts = 1 makes TTL drop entire parts once every row in them has expired, which is far cheaper than rewriting parts to remove rows — and it pairs naturally with a time-based PARTITION BY. TTL is generally lighter-weight than mutation-based deletes for retention.

Checking TTL Settings

To see what TTL a table has, read the DDL:

SHOW CREATE TABLE events;

To see whether parts have actually crossed their thresholds, inspect system.parts, which exposes per-part TTL bookkeeping:

SELECT
    name,
    rows,
    delete_ttl_info_min,
    delete_ttl_info_max,
    move_ttl_info.expression,
    move_ttl_info.min,
    move_ttl_info.max,
    recompression_ttl_info.expression
FROM system.parts
WHERE table = 'events' AND active
ORDER BY delete_ttl_info_max;

delete_ttl_info_max is the timestamp at which a part becomes fully eligible for deletion; the move_ttl_info and recompression_ttl_info nested columns carry the same min/max bookkeeping per move and recompress rule. Comparing these against now() tells you which parts are overdue for a background TTL merge.

ON CLUSTER and Replicated Tables

For a self-managed cluster, the canonical replicated pattern is:

CREATE TABLE analytics.events ON CLUSTER my_cluster
(
    event_time DateTime,
    user_id    UInt64,
    properties String
)
ENGINE = ReplicatedMergeTree(
    '/clickhouse/tables/{shard}/analytics/events',
    '{replica}'
)
ORDER BY (event_time, user_id)
PARTITION BY toYYYYMM(event_time);

{shard} and {replica} are macros resolved per node, so the same statement creates a coherent set of replicas across the cluster. For ClickHouse Cloud the engine alias ReplicatedMergeTree is implicit and you usually write ENGINE = MergeTree (the platform handles replication for you). See the MergeTree guide for the full engine family.

The Replicated database engine handles the ON CLUSTER plumbing for you on a per-database basis; tables created inside a Replicated database are automatically created on every node.

Indexes, Projections, and Constraints

CREATE TABLE events
(
    event_time DateTime,
    user_id    UInt64,
    properties String,
    INDEX idx_user user_id TYPE minmax GRANULARITY 4,
    INDEX idx_props properties TYPE bloom_filter() GRANULARITY 8,
    PROJECTION by_user (
        SELECT * ORDER BY user_id, event_time
    ),
    CONSTRAINT no_future CHECK event_time <= now() + INTERVAL 1 DAY
)
ENGINE = MergeTree
ORDER BY (event_time, user_id);

Skip indexes (minmax, set, bloom_filter, ngrambf_v1, tokenbf_v1) accelerate filters on columns that are not the primary sort key.
Projections are an alternative sort order plus aggregate that ClickHouse picks automatically when a matching query comes in.
Constraints are enforced at INSERT time. A failed constraint rejects the whole insert.

Common Mistakes

No ORDER BY. ClickHouse will refuse to create a MergeTree table without one. Use ORDER BY tuple() if you genuinely do not have a sort key.
High-cardinality PARTITION BY. A partition per customer is the most common production foot-gun. See too-many-parts symptoms.
PARTITION BY and ORDER BY confused. Partitions split data into separate file groups; ordering sorts inside each partition. Both matter, but they do different work.
Forgetting to set replicas on a clustered table. A non-replicated MergeTree on one node of a cluster is invisible to the rest.
Treating MATERIALIZED like DEFAULT. MATERIALIZED columns cannot be set by INSERT, period. If clients should be able to override the value, use DEFAULT.

How Pulse Helps With Table Design

The choices you bake into CREATE TABLE, especially the ORDER BY and PARTITION BY, are extremely hard to change later. The remediation is almost always a rebuild plus migration, which is painful on multi-TB tables. Pulse continuously profiles ClickHouse tables and surfaces ones with bad partitioning (too many parts, partition count exploding), ORDER BY choices that do not match real query patterns, missing codecs, missing TTL on what is clearly a time-series table, and tables that should be replicated but are not. Pulse also tracks merge backlog and part-count growth per table so you can see when a design decision is going to bite before the cluster slows down. Connect your ClickHouse cluster to Pulse and let it watch table-design health automatically.

Frequently Asked Questions

Q: Can I change ORDER BY after a table exists?

You can ALTER TABLE ... MODIFY ORDER BY to extend the key, but only with columns that are being added by an ADD COLUMN clause in the same ALTER statement and that have no default value. You cannot add existing columns to the key, and you cannot remove or reorder the existing key columns without rebuilding the table. For a real reorder, the pattern is to create a new table and INSERT INTO new SELECT FROM old.

Q: Do I need PARTITION BY if I have ORDER BY?

They serve different purposes. ORDER BY is mandatory for MergeTree (the sort key drives the index). PARTITION BY is optional but strongly recommended for time-series and retention workloads, because it makes deleting old data efficient (drop a whole partition rather than running a mutation).

Q: What's the difference between MergeTree and ReplicatedMergeTree?

MergeTree stores data on a single replica. ReplicatedMergeTree synchronizes data across replicas via ClickHouse Keeper. On ClickHouse Cloud, every table is implicitly replicated, so you usually write MergeTree and the platform substitutes the replicated form.

Q: Should I set index_granularity explicitly?

The default (8192) is correct for nearly every workload. Lower it (e.g. 2048) only if you have very selective point queries on a large table and benchmark shows it helps; raising it almost never helps in practice.

Q: How do I create a table that mirrors data from Kafka?

You use two tables and a materialized view: a Kafka engine table that reads the stream, a MergeTree table that stores the data, and a MATERIALIZED VIEW that pipes from one to the other. See the materialized view guide for the pattern.

Q: Can TTL aggregate old data instead of deleting it?

Yes. TTL expr GROUP BY ... SET ... downsamples aged rows in place during merges, so you keep raw data for the recent window and rollups for the old one in a single table. The GROUP BY columns must be a prefix of the ORDER BY. Use RECOMPRESS CODEC(...) in the same way if you only want to shrink old data rather than aggregate it.

Q: What happens if I omit ENGINE?

ClickHouse rejects the statement. Unlike PostgreSQL or MySQL, there is no implicit default engine for CREATE TABLE.

Q: Can I create a table from an INSERT-style query in one go?

Yes. CREATE TABLE x ENGINE = MergeTree ORDER BY id AS SELECT ... infers the schema from the SELECT and immediately materializes the result. Combine with EMPTY AS to create the schema without the data.