ClickHouse Data Types: Storage on Disk vs Memory Representation

ClickHouse stores data on disk in a compressed, encoded columnar format, but when it reads that data to answer a query it decompresses and materializes columns into in-memory structures that can look very different from their on-disk footprint. Understanding this disk-vs-RAM distinction explains why a 2 GB column can expand to tens of gigabytes in memory, why Nullable carries a hidden cost, and why LowCardinality is fast as well as small.

This guide walks through how the common data types are laid out on disk and how they are represented in RAM during query execution, so you can make schema choices that are cheap in both places.

Disk vs Memory: The Core Difference

On disk, ClickHouse writes each column as one or more files inside a data part. The bytes are run through column encodings (Delta, DoubleDelta, Gorilla, T64) and a general-purpose compressor (LZ4 or ZSTD), so the on-disk size reflects how well the data compresses. See codecs and compression for how that disk layout is configured.

In memory, ClickHouse works with IColumn objects: contiguous arrays of decompressed, fully materialized values plus any auxiliary structures (offset arrays, null maps, dictionaries). Memory usage is driven by the uncompressed representation and the number of rows pulled into a processing block, not by the compressed disk size. This is why:

  • A column that compresses 20:1 on disk still occupies its full uncompressed width in RAM.
  • Wide types you never trimmed (UInt64 where UInt16 would do, Float64 where Float32 works) cost the same in RAM regardless of how well they compress.
  • Memory pressure during heavy GROUP BY/ORDER BY/JOIN is governed by in-RAM width, which is where the type choices below matter most. For tuning limits, see memory configuration settings.
Aspect On disk In memory (RAM)
Form Compressed + encoded column files Decompressed IColumn arrays
Size driver Compression ratio of the data Uncompressed type width × rows in block
Nullable Separate null-mask file (1 byte/row, compresses) Parallel UInt8 null map (1 byte/row, uncompressed)
LowCardinality Dictionary file(s) + index column Dictionary + positions column, processed as integers
String Length-prefixed bytes, block-compressed Offsets array + contiguous byte buffer
FixedString(N) Exactly N bytes/value, padded N bytes/value, no offsets array

Nullable: The Null-Mask Overhead

Nullable(T) is implemented as two parallel structures: the underlying column of type T, plus a null map that records, per row, whether the value is NULL. On disk this is a separate null-mask file alongside the values file; in memory it is a parallel UInt8 column where 1 means NULL and 0 means a real value.

-- Same logical column, very different cost profile
CREATE TABLE demo
(
    user_id      UInt64,
    last_login   DateTime,            -- non-nullable
    referrer     Nullable(String)     -- adds a null-mask file + in-RAM byte/row
)
ENGINE = MergeTree
ORDER BY user_id;

What this costs you:

  • Storage: one extra byte per row on disk before compression. Null maps compress well when values are mostly all-NULL or all-present, but the second file still adds I/O and another stream to read.
  • Memory: a full extra UInt8 per row materialized alongside the data column during query execution.
  • Read path: every value access must consult the null map, which disqualifies the column from some optimizations.
  • Index restriction: a Nullable field is strongly discouraged in the primary key / ORDER BY index and is blocked by default. It can be enabled via allow_nullable_key=1, but the ClickHouse team advises against it because NULL semantics in sort keys produce surprising results.

The official docs are blunt: "Using Nullable almost always negatively affects performance." When you do not need to distinguish NULL from a default value, use a sentinel instead (empty string, 0, -1, or a far-past/far-future date) and keep the column non-nullable.

To detect NULL-ness cheaply you can read the .null subcolumn, which returns 1 for NULL and 0 otherwise without materializing the full value column:

SELECT count() FROM demo WHERE referrer.null = 0;  -- count non-NULL rows

LowCardinality: Dictionary Encoding On Disk and in RAM

LowCardinality(T) replaces direct value storage with dictionary coding: ClickHouse builds a dictionary of distinct values and stores a compact integer position per row that points into the dictionary. This is true both on disk (a dictionary file plus an index/positions column) and in memory (the dictionary plus a positions column).

CREATE TABLE events
(
    ts          DateTime,
    event_type  LowCardinality(String),   -- e.g. dozens of distinct values
    country     LowCardinality(String),   -- ~200 distinct values
    raw_payload String                    -- high-cardinality, leave plain
)
ENGINE = MergeTree
ORDER BY (event_type, ts);

Why it helps in both places:

  • Disk: instead of repeating the full string on every row, each row stores a small integer index, and the distinct strings are stored once in the dictionary.
  • Memory: ClickHouse keeps the data dictionary-encoded for as much of query processing as possible, comparing integer positions instead of strings and resolving back to actual values as late as it can. Integer comparisons are faster and the smaller positions fit more rows into CPU cache.

Dictionary management is controlled by low_cardinality_max_dictionary_size (default 8192). When a part's distinct values fit under this threshold, ClickHouse uses a single shared dictionary per part; beyond it, ClickHouse monitors the dictionary size and starts writing additional dictionaries for the part. Setting low_cardinality_use_single_dictionary_for_part = 1 prevents this and instead writes overflow values without dictionary encoding.

When LowCardinality wins vs hurts (per the official docs):

Distinct values per part Effect
Fewer than ~10,000 Higher efficiency for both reading and storage — clear win
~10,000–100,000 Gray zone; measure before committing
More than ~100,000 Can perform worse than the plain type — dictionary overhead dominates

Wrapping a genuinely high-cardinality column (like a UUID or a free-text field) in LowCardinality produces a bloated dictionary that helps nothing. ClickHouse guards against the most dangerous cases — for example LowCardinality(UInt64) — and raises an error unless you explicitly opt in. See allow_suspicious_low_cardinality_types for when (and whether) to enable it.

String vs FixedString: Variable vs Fixed Layout

String is variable-length. On disk each value is stored as its raw bytes prefixed by a varint length; in memory a String column is an offsets array plus one contiguous byte buffer holding all values back-to-back. The offsets array (a UInt64 per row) is the in-RAM overhead that lets ClickHouse locate each value.

FixedString(N) is exactly N bytes per value. Shorter inputs are right-padded with null bytes (\0); inputs longer than N are rejected. Because every value is the same width, there is no offsets array — neither on disk nor in memory.

CREATE TABLE addresses
(
    ip_v6     FixedString(16),    -- binary IPv6, exactly 16 bytes/row, no length prefix
    md5_hash  FixedString(16),    -- binary MD5
    country   FixedString(2),     -- ISO country code: 2 bytes vs 3 for String
    notes     String              -- arbitrary length, variable
)
ENGINE = MergeTree
ORDER BY ip_v6;

Trade-offs:

  • For short, truly fixed-width values (IP addresses, hashes, currency/country codes), FixedString(N) removes the per-value length prefix and the in-memory offsets array, giving more compact storage and faster comparisons.
  • For anything variable-length, FixedString wastes space on padding and forces you to handle the null-byte padding in queries. Use plain String.
  • For low-distinct-count text, LowCardinality(String) usually beats both — it stores small integer positions instead of repeated bytes or padded fixed buffers.

Numeric and Date Types: Width Is What You Pay in RAM

Integer, float, decimal, and date/time types have a fixed in-memory width regardless of compression. A UInt64 is 8 bytes per row in RAM even if Delta+ZSTD shrinks it to a fraction of a byte on disk. The disk-vs-RAM gap is largest exactly here, so right-sizing matters:

  • Use the smallest integer that fits the domain (UInt8/UInt16/UInt32 before reaching for UInt64).
  • Prefer Float32 over Float64 when the extra precision is not needed — half the RAM per value.
  • For timestamps, DateTime (4 bytes) is half the width of DateTime64 (8 bytes); only pay for sub-second precision when you need it.
  • Date (2 bytes) covers a wide range at half the width of Date32 (4 bytes).

These choices barely move the disk size after compression, but they directly halve or quarter the memory a column occupies during GROUP BY, ORDER BY, and JOIN build-side hash tables.

Common Issues

  1. "My query OOMs but the table is tiny on disk." Disk size is compressed; the query materializes uncompressed columns and intermediate states. Look at the in-RAM widths of the columns and aggregation keys, not bytes_on_disk. Tune with memory configuration settings.

  2. Nullable everywhere by reflex. Schemas migrated from row stores often mark every column Nullable. Each one adds a null-mask file, blocks the column from the primary key, and adds per-row memory. Make columns non-nullable and use sentinels unless NULL has real meaning.

  3. LowCardinality on high-cardinality data. Wrapping UUIDs, request IDs, or free text in LowCardinality builds a huge dictionary that costs more than it saves. Keep it for columns with up to a few tens of thousands of distinct values.

  4. FixedString for variable-length data. Padding waste plus the surprise of trailing \0 bytes in comparisons. Reserve FixedString for genuinely fixed-width binary data.

  5. Comparing on-disk and in-RAM sizes directly. They are not the same metric. Use system.parts/system.columns for disk footprint and look at query-time memory (memory_usage in system.query_log) for RAM.

Best Practices

  1. Right-size numeric and date types first — it is the cheapest way to cut memory with almost no downside.
  2. Default to non-nullable. Add Nullable only when NULL is semantically distinct from a default, and never on primary-key columns.
  3. Use LowCardinality for low-distinct text (status codes, country, device type, log level) and plain String for high-cardinality text.
  4. Reach for FixedString only for fixed-width binary (IPv6 as FixedString(16), hashes), not for human-readable variable text.
  5. Measure both dimensions. Check disk size and query memory before and after a type change; a change that helps one can hurt the other.
  6. Pair type choices with codecs. Type controls RAM width; codecs control disk size. Optimize them together.

How Pulse Helps

Type-vs-storage problems rarely announce themselves: a schema looks fine, disk usage looks reasonable, and then a dashboard query OOMs under load or a column quietly doubles your part count. Pulse continuously profiles ClickHouse tables and surfaces the design issues that drive memory and storage cost — columns marked Nullable that carry a null-mask penalty, oversized numeric types that inflate RAM during aggregation, LowCardinality wrapped around high-cardinality data, and missing codecs on columns that should compress. It also tracks per-query memory usage and per-table part-count growth, so you can connect a memory spike back to the specific column or schema decision causing it before it slows the cluster down. Connect your ClickHouse cluster to Pulse to get this monitoring automatically.

Frequently Asked Questions

Q: Why does my column use far more memory than its on-disk size?

Disk size reflects compression; memory reflects the uncompressed type width times the number of rows pulled into a processing block, plus auxiliary structures (offsets, null maps, dictionaries). A highly compressible column can occupy 10–50× more in RAM than on disk. Right-size the type to reduce the in-memory footprint.

Q: How much does Nullable actually cost?

One extra byte per row for the null map (on disk before compression, and a full UInt8 per row in RAM), an extra file/stream to read, exclusion from the primary-key index, and a per-access null-map check. Use sentinel values instead of Nullable when NULL is not semantically required.

Q: When should I use LowCardinality instead of plain String?

When the column has a small set of distinct values — roughly under 10,000 per part is a clear win, and above ~100,000 it can be slower than plain String. It stores integer positions into a shared dictionary, which is both smaller on disk and faster in RAM because most processing happens on integers.

Q: Is FixedString always smaller than String?

Only for truly fixed-width values. FixedString(N) drops the length prefix and the in-memory offsets array, but it pads shorter values with null bytes and rejects longer ones. For variable-length data the padding wastes space and String is the right choice.

Q: Do type changes affect disk size, memory, or both?

It depends. Numeric/date width changes (e.g. UInt64UInt32, Float64Float32) mainly cut RAM and barely move compressed disk size. LowCardinality and FixedString affect both. Always measure disk and query memory separately, since improving one can regress the other.

Q: Can a Nullable column be part of the primary key or ORDER BY?

Not by default. ClickHouse blocks Nullable fields in the primary key / ORDER BY key unless you set allow_nullable_key=1. This is strongly discouraged: NULL semantics in sort keys can produce unintuitive results and hurt query planning. If you need the column in the sort key, make it non-nullable and use a sentinel value for the "missing" case instead.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.