If you're coming from PostgreSQL or MySQL, you'll expect a primary key to behave like a B-tree: point lookups in O(log n), row-level index entries, maybe a clustered index on the heap. ClickHouse works nothing like that. Its indexing model is designed around the assumption that you're scanning massive amounts of columnar data sequentially, and the goal is to skip as much of it as possible, not to find individual rows in microseconds. Misunderstanding this distinction is the single most common reason engineers end up with slow ClickHouse queries.
The Sparse Primary Index
Every MergeTree table has an ORDER BY clause. That clause is the primary index - they are the same thing. When you write data, ClickHouse sorts it on disk by those columns within each data part. The PRIMARY KEY clause, if omitted, defaults to the same expression as ORDER BY. If you specify both, PRIMARY KEY must be a prefix of ORDER BY. This lets you sort on more columns than you index, which can matter for compression without inflating index size.
ClickHouse does not create one index entry per row. It creates one entry per granule. A granule is the minimum unit of data that ClickHouse reads from storage - by default 8192 rows. Each granule boundary gets one entry in the primary.idx file, storing the primary key values of the first row in that granule. The result is a small, flat array that ClickHouse keeps entirely in memory. On disk it is stored compressed by default (since ClickHouse 23.5), which reduces storage and I/O on distributed or remote-storage setups. For a billion-row table sorted by (date, user_id), the primary index might be only a few MB.
When a query arrives with a WHERE clause that touches the primary key prefix, ClickHouse performs a binary search over the in-memory index marks to find the range of granules that could contain matching rows. It then reads only those granules from the per-column .bin files, guided by .mrk mark files that record each granule's byte offset within the column data. Columns that appear in SELECT or WHERE but are not part of the primary key are still fetched only for the selected granules, not the whole table. A query filtering on the leading key column eliminates everything outside the matching range with zero I/O to the column data.
Granules, Marks, and What Actually Happens at Query Time
Each column has its own .bin file (compressed columnar data) and a corresponding .mrk file (mark file). A mark is a pair of values: the byte offset into the .bin file where that granule starts, and a decompression offset within the compressed block. When ClickHouse decides to read granules 100 through 150, it jumps directly to the appropriate byte positions in the column files using the marks - no sequential scan from the beginning.
The index_granularity setting (default 8192) controls granule size. Smaller granules mean a more precise index that can skip data at finer granularity, but they also mean a larger primary.idx and more mark file entries. Larger granules reduce index overhead but cause more data to be read per match. Most production setups leave this at 8192. ClickHouse also uses adaptive granularity by default (index_granularity_bytes, default 10 MiB): a new granule starts when either 8192 rows are accumulated or the cumulative row size exceeds ~10 MiB, whichever comes first. This means granules on wide-row tables may be smaller than 8192 rows.
You can inspect index activity using system.query_log. The read_rows field tells you how many rows were actually read; comparing it to the table's total row count reveals how effectively the primary index pruned data. A query that reads 0.1% of rows to return its result is using the index well. One that reads 80% probably isn't.
SELECT
query,
read_rows,
read_bytes,
query_duration_ms
FROM system.query_log
WHERE type = 'QueryFinish'
AND query LIKE '%your_table%'
ORDER BY query_duration_ms DESC
LIMIT 10;
Designing the ORDER BY Key
The choice of ORDER BY columns is the most consequential schema decision you'll make in ClickHouse. Get it wrong and no amount of tuning elsewhere will fix it.
The common advice is to put low-cardinality columns first. This is correct in most cases, but the reasoning matters. If you put a high-cardinality column like user_id first, the binary search can still find the right granules for a query filtering on user_id alone - that's fine. The problem is that high-cardinality leading columns break the index for any other column that follows. With ORDER BY (user_id, date), a query filtering on date without specifying user_id cannot use the primary index at all, because rows with any given date are scattered across the entire sort order. The index can only prune a contiguous range of granules, and high-cardinality leading columns produce no contiguous ranges on trailing columns.
The practical pattern for event/log data: put your most-queried low-cardinality filter first (tenant, region, event_type), then a time column, then any higher-cardinality identifiers. A table storing application logs might look like:
CREATE TABLE events
(
tenant_id UInt32,
event_date Date,
event_time DateTime,
user_id UInt64,
session_id String,
event_type LowCardinality(String),
payload String
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(event_date)
ORDER BY (tenant_id, event_date, event_time, user_id);
With this schema, a query filtering on tenant_id and a date range will read only the granules in that tenant's date window. A query filtering only on user_id will do a full scan - that's the trade-off. If you need efficient lookups on both patterns, skip indexes or projections are the tools to reach for.
One subtlety: compression benefits when low-cardinality columns lead. Sorted data compresses far better than unsorted data, and columns sorted together tend to have long runs of repeated values. An event table with tenant_id first will have contiguous blocks of the same tenant's data, which compresses extremely well. Putting user_id first produces near-random ordering relative to tenant, destroying compression gains.
Skip Indexes
Skip indexes are secondary structures layered on top of the primary index. Where the primary index lets ClickHouse skip granule ranges based on sort order, skip indexes let ClickHouse skip granules based on precomputed metadata - regardless of whether the column appears in ORDER BY.
ClickHouse stores skip index metadata per block of granules, where the block size is controlled by the GRANULARITY parameter on the index definition. GRANULARITY 4 means the skip index entry covers 4 primary granules (32,768 rows at the default granule size). After the primary index has narrowed down the candidate granule ranges, ClickHouse evaluates skip indexes: if the index for a block proves the filter can't match any row in that block, that block is skipped without reading column data.
There are five main skip index types. minmax stores the minimum and maximum values of a column per index block. It's the cheapest to build and evaluate, and works well for numeric columns that are loosely correlated with the sort order - timestamps and sequential IDs often qualify. set(N) stores up to N distinct values per block; if the query's filter value isn't in the set, the block is skipped. It works for low-cardinality columns like status or country where each block holds a small number of distinct values. bloom_filter(p) is a probabilistic structure that can answer "definitely not present" queries on arbitrary columns, including strings and arrays - false positives are possible (controlled by the optional false-positive rate parameter p, default 0.025), but false negatives are not. tokenbf_v1(size_bytes, hashes, seed) splits strings into alphanumeric tokens on non-alphanumeric boundaries and stores them in a bloom filter — useful for exact-token lookups within longer strings such as URL path segments or log keywords. ngrambf_v1(n, size_bytes, hashes, seed) splits strings into n-grams and stores them in a bloom filter, enabling substring matching optimizations. (size_bytes is the bloom filter size in bytes; hashes is the number of hash functions; seed is the hash seed.)
CREATE TABLE http_logs
(
log_date Date,
timestamp DateTime,
host LowCardinality(String),
status_code UInt16,
request_id String,
trace_id String,
url String,
response_ms UInt32,
INDEX idx_status status_code TYPE set(50) GRANULARITY 4,
INDEX idx_trace trace_id TYPE bloom_filter() GRANULARITY 1,
INDEX idx_url url TYPE ngrambf_v1(3, 65536, 2, 0) GRANULARITY 2,
INDEX idx_latency response_ms TYPE minmax GRANULARITY 4
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(log_date)
ORDER BY (host, log_date, timestamp);
You can add skip indexes to existing tables and then materialize them:
ALTER TABLE http_logs ADD INDEX idx_trace trace_id TYPE bloom_filter() GRANULARITY 1;
ALTER TABLE http_logs MATERIALIZE INDEX idx_trace;
Skip Index Trade-offs
Every skip index adds write overhead. Building the index structures happens during insert and merge operations, and complex expressions like ngrambf over long strings can be measurably expensive at high ingest rates. Skip indexes also consume disk space - a bloom filter over a high-cardinality string column across billions of rows adds up.
More critically, a skip index only helps when it can actually skip data. A bloom filter on trace_id is useful if your queries filter on specific trace IDs and those IDs are not uniformly distributed across all blocks. If every block in the table contains every possible value (because your data is random), the index cannot skip anything. This correlation requirement is why skip indexes work best on columns that are loosely correlated with your ORDER BY columns - data written in time order tends to have temporal locality in other attributes too.
The GRANULARITY parameter deserves careful thought. Low granularity (1) means the index is evaluated at the finest possible level and can skip individual primary granules, but produces the largest index files and most overhead. High granularity covers more rows per index block, reducing overhead but coarsening skipping precision. For bloom filters on high-cardinality columns queried sparsely, start with GRANULARITY 1 and measure. For minmax on correlated numeric columns, GRANULARITY 4 or higher is often fine.
Common Indexing Mistakes
The most prevalent mistake is designing the ORDER BY with high-cardinality columns first because it "feels like" a unique identifier should lead. This is the B-tree mental model leaking into ClickHouse schema design. In a B-tree, a unique index on id gives you O(log n) point lookups. In ClickHouse, ORDER BY (id, date) means date-range queries read the entire table.
A related mistake is adding skip indexes to columns that are uniformly distributed relative to the primary sort order - random UUIDs, cryptographic hashes, and independently generated IDs all fall into this trap. The index exists, consumes storage and write CPU, but cannot skip any blocks because every block contains the full range of values.
Expecting skip indexes to behave like B-tree secondary indexes is another recurring problem. A bloom filter on email does not give you efficient lookup by email. It gives you efficient elimination of blocks that definitely don't contain a given email. For a table with 10,000 blocks, if only 5 blocks can possibly contain a specific email, the bloom filter reduces I/O by 99.95%. But if the data distribution means every block might contain any given email, you've paid the write overhead for nothing.
Finally: adding too many skip indexes degrades insert throughput. Each index requires building structures at write time. A table with 8 skip indexes at high granularity on complex expressions will have noticeably slower inserts than one with none. Measure before adding, and remove indexes that aren't being used by actual queries - system.query_log shows which indexes were used via the ProfileEvents map if query profiling is enabled.