MergeTree Internal File Format: .bin, .mrk, and Storage Layout

A MergeTree data part is a self-contained directory of files on disk. Column values live in .bin files, byte offsets that let ClickHouse jump straight to a granule live in mark (.mrk) files, and a sparse primary.idx ties it all together. Understanding this layout explains why ClickHouse reads only the granules it needs, how decompression works, and what the files in a part directory actually mean when you are debugging.

This page documents the on-disk structure in detail. For the conceptual query-time view of the sparse index, see ClickHouse Indexes. For the engine itself, see MergeTree and what is MergeTree.

The Data Part Directory

Every part is a directory whose name encodes its partition, block range, and merge level (for example all_1_1_0). Inside, ClickHouse writes everything required to interpret the part without consulting any central catalog. A typical wide part directory contains:

File Purpose
<column>.bin Compressed column data, one file per column
<column>.mrk2 Mark file per column - byte offsets mapping granules into the .bin file
primary.idx Sparse primary index - one entry per granule
columns.txt Column names and types in the part
count.txt Number of rows in the part
checksums.txt Sizes and hashes of every file, for integrity checks
partition.dat + minmax_<col>.idx Partition value and per-partition min/max (only when PARTITION BY is used)
skp_idx_<name>.idx / .mrk2 Data-skipping (secondary) index files, if any are defined

You can inspect the layout directly. The part directory lives under the table's data_paths:

SELECT name, data_paths
FROM system.tables
WHERE name = 'my_table';

SELECT name, part_type, rows, bytes_on_disk, path
FROM system.parts
WHERE table = 'my_table' AND active;

.bin Files: Compressed Columnar Data

There is one .bin file per column (in wide parts), holding every value for that column in primary-key sort order. The file is not one big compressed blob - it is a sequence of independently compressed blocks.

Each block has a small header (a checksum plus the compressed and uncompressed sizes) followed by the compressed payload. ClickHouse can decompress any single block without touching the rest of the file, which is what makes selective reads cheap. The default codec is LZ4 (favoring fast decompression); ZSTD is the default in ClickHouse Cloud and is common in self-hosted setups where storage cost matters more than CPU. Codecs are configurable per column.

Block sizes are bounded by two settings:

  • min_compress_block_size (default 65536 bytes) - ClickHouse tries to accumulate at least this much data before closing a block, so a block usually spans several granules.
  • max_compress_block_size (default 1048576 bytes) - the upper bound for a block.

Because a compressed block normally contains more than one granule, reading a single granule means decompressing its whole enclosing block. This is why two offsets are needed to locate a granule - see the mark files below.

Mark Files: Mapping Granules to Bytes

A mark file is a small, uncompressed, flat array. It contains one mark per granule, and each mark is a pair of offsets:

  1. Compressed block offset - the byte position in the .bin file where the compressed block containing this granule begins.
  2. Offset inside the decompressed block - where this granule starts once that block has been decompressed into memory.

To read granule N, ClickHouse seeks to the first offset, decompresses that one block, then jumps to the second offset within the decompressed buffer. No sequential scan from the start of the file is involved.

.mrk vs .mrk2 vs .mrk3

The mark-file extension tells you the part format and granularity mode:

Extension Used by Contents per mark
.mrk Wide parts with fixed (non-adaptive) granularity compressed-block offset + in-block offset
.mrk2 Wide parts with adaptive granularity (the modern default) the two offsets plus the granule's row count
.mrk3 Compact parts adaptive-format marks for the single shared data file

With adaptive granularity (on by default), a granule does not always hold exactly 8192 rows - ClickHouse starts a new granule when either 8192 rows accumulate or the accumulated row bytes reach index_granularity_bytes (default 10 MiB), whichever comes first. The .mrk2/.mrk3 formats carry the extra row-count field so the reader knows how many rows each granule actually contains. This is why wide tables with large rows show granules smaller than 8192.

The Sparse Primary Index (primary.idx)

primary.idx is also an uncompressed flat array, with one entry per granule, storing the primary-key column values of that granule's first row. It is small enough to stay resident in memory. (Since ClickHouse 23.5 it can be stored compressed on disk while still being decompressed into memory for use.)

At query time the flow is:

  1. ClickHouse binary-searches (or exclusion-searches) primary.idx for the WHERE-clause key range, producing a set of candidate granule numbers.
  2. For each column it needs, it looks up those granule numbers in the column's mark file to get byte offsets.
  3. It reads and decompresses only the enclosing compressed blocks from the .bin files.

So the index selects granules, the marks translate granules into byte offsets, and the .bin blocks deliver the bytes. The three structures are aligned granule-for-granule.

Wide vs Compact Parts

Small parts (typical for individual inserts) are stored in compact format - all columns share a single data.bin and a single data.mrk3, with per-column offsets recorded inside. Larger parts are stored in wide format, with a separate .bin and .mrk2 per column. The threshold is controlled by table settings:

Setting Default Effect
min_bytes_for_wide_part 10485760 (10 MiB) Parts at or above this size use wide format
min_rows_for_wide_part 0 Row-count threshold (0 = not used by default)

A part becomes wide when it meets or exceeds either threshold; otherwise it is compact. Compact parts make small inserts fast and cheap because they create only a couple of files instead of two per column. Background merges naturally produce larger parts, which cross the threshold and are written in wide format. Both formats are still strictly columnar - compact parts just pack the columns into one physical file.

-- See which format each part uses
SELECT partition, name, part_type, rows, formatReadableSize(bytes_on_disk)
FROM system.parts
WHERE table = 'my_table' AND active
ORDER BY modification_time DESC;

Inspecting Storage in Practice

You rarely need to read the raw files, but several system views expose the same information cleanly:

-- Per-column compressed vs uncompressed size and ratio
SELECT
    column,
    formatReadableSize(sum(column_data_compressed_bytes))   AS compressed,
    formatReadableSize(sum(column_data_uncompressed_bytes)) AS uncompressed,
    round(sum(column_data_uncompressed_bytes)
        / sum(column_data_compressed_bytes), 2)             AS ratio
FROM system.parts_columns
WHERE table = 'my_table' AND active
GROUP BY column
ORDER BY sum(column_data_compressed_bytes) DESC;
-- Marks bytes and primary-key size per part
SELECT name, part_type, marks, marks_bytes, primary_key_bytes_in_memory
FROM system.parts
WHERE table = 'my_table' AND active;

Common Issues

  • "Cannot decompress" / checksum mismatch. A .bin block header or payload is corrupted on disk. ClickHouse validates each block against checksums.txt. See Cannot decompress and too large size compressed.
  • Queries read whole granules, not single rows. The smallest readable unit is a granule (8192 rows by default), and decompression happens at the block level. A point lookup still decompresses at least one block of every projected column. This is expected; it is not a misconfiguration.
  • Wide-row tables show tiny granules. With adaptive granularity, index_granularity_bytes (10 MiB) caps granule byte size, so granules can hold far fewer than 8192 rows. The row count is stored in the .mrk2/.mrk3 marks.
  • Too many small files. Wide parts create two files per column. Tables with hundreds of columns and frequent small inserts benefit from compact parts (keep parts under min_bytes_for_wide_part) plus healthy merging.

Best Practices

  1. Let merges do their job. Compact parts for small inserts, wide parts after merges, is the intended lifecycle. Tune min_bytes_for_wide_part only with a measured reason.
  2. Choose codecs per column. LZ4 for hot, latency-sensitive columns; ZSTD for cold or highly compressible columns. Pair specialized codecs (Delta, DoubleDelta, Gorilla, T64) before the general codec for time-series and counters.
  3. Keep index_granularity at 8192 unless measured otherwise. Smaller granules give finer pruning but larger mark files and primary.idx; larger granules read more data per match.
  4. Use system tables, not raw files. system.parts, system.parts_columns, and system.columns expose sizes, marks, and part types without parsing binaries.
  5. Set the ORDER BY first. The on-disk layout only pays off when granules are sorted by the columns you filter on. See ClickHouse Indexes and CREATE TABLE.

How Pulse Helps

Pulse monitors the storage internals that this page describes - part counts and types, compression ratios per column, mark and primary-key memory, and merge health - and surfaces them as actionable signals rather than raw system-table dumps. When a table drifts toward too many small parts, an inefficient codec choice, or a stalled merge backlog that keeps parts in compact format, Pulse flags it with the context to fix it. Learn more at pulse.support.

Frequently Asked Questions

Q: What is the difference between a .bin and a .mrk file?

The .bin file holds the actual column values, compressed in independent blocks. The .mrk (or .mrk2/.mrk3) file is a small uncompressed index of byte offsets that tells ClickHouse where each granule starts inside the .bin file, so it can seek directly instead of scanning.

Q: Why does each mark store two offsets?

Because a compressed block usually contains several granules. The first offset locates the compressed block in the .bin file; the second locates the granule inside that block after it has been decompressed into memory.

Q: What do .mrk2 and .mrk3 mean?

.mrk2 is the mark format for wide parts with adaptive granularity, and .mrk3 is for compact parts. Both add a per-granule row-count field on top of the two offsets, which is required because adaptive granularity makes granules variable in row count.

Q: When does ClickHouse use compact instead of wide parts?

A part is stored compact when it is below min_bytes_for_wide_part (default 10 MiB) and min_rows_for_wide_part (default 0). Compact parts pack all columns into one data.bin/data.mrk3 pair, which makes small inserts cheaper. Merges produce larger parts that switch to wide format.

Q: Can I read these files directly?

You generally should not. The .bin files are compressed block streams and the mark files are packed binary offsets. Use system.parts, system.parts_columns, and system.columns to inspect sizes, compression, marks, and part types safely.

Q: How big is the primary index in memory?

It holds one entry per granule (roughly one per 8192 rows), storing only the primary-key values of each granule's first row, so it is typically a few MB even for billion-row tables. Check primary_key_bytes_in_memory in system.parts.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.