Dictionary Hash Layout Variants: HASHED, SPARSE_HASHED, and HASHED_ARRAY

ClickHouse dictionaries that use a hashed family layout keep all of their data in an in-memory hash table for fast key-value lookups. Within that family there are three variants — HASHED, SPARSE_HASHED, and HASHED_ARRAY — that store the same logical data with very different memory and CPU tradeoffs.

Picking the right one can cut a dictionary's RAM footprint by 3-4x with little or no change in lookup latency. This guide compares the three layouts, shows their configuration syntax, and gives concrete selection criteria. For background on what dictionaries are and how to define them, see ClickHouse Dictionary Data Structure and What Is a ClickHouse External Dictionary.

The Three Hashed Variants at a Glance

All three layouts load the entire dictionary into memory and serve lookups directly from RAM. They differ only in how that memory is organized. The numbers below come from the Altinity Knowledge Base test of a dictionary with 5 million UInt64 keys (memory) and 10 million lookups (query time):

Layout Memory (5M keys) Lookup time (10M) Best when
HASHED 256.00 MiB 0.546 s Few attributes, latency-critical lookups
SPARSE_HASHED 76.30 MiB 1.422 s Memory is critically constrained, single attribute
HASHED_ARRAY 68.77 MiB 0.558 s The general-purpose default; many attributes

The headline takeaways:

  • SPARSE_HASHED uses roughly 3x less memory than HASHED but is roughly 3x slower on lookups — it trades CPU for RAM.
  • HASHED_ARRAY is the most memory-efficient of the three while staying nearly as fast as HASHED. In current ClickHouse documentation it is the recommended default for most use cases.

Exact figures depend on key distribution, attribute count, and data types, so always benchmark against your own data. But the direction of the tradeoffs is consistent.

HASHED

HASHED stores the dictionary completely in memory as a classic hash table. It supports any number of elements and requires UInt64 keys (use COMPLEX_KEY_HASHED for composite or non-UInt64 keys). It is the fastest of the three for lookups, but its per-attribute memory overhead grows as you add attributes, because each attribute is held in its own hash-table-aligned structure.

CREATE DICTIONARY country_codes
(
    id UInt64,
    country String,
    region String
)
PRIMARY KEY id
SOURCE(CLICKHOUSE(TABLE 'country_dim'))
LIFETIME(MIN 300 MAX 600)
LAYOUT(HASHED());

Since ClickHouse 22.x, HASHED (and SPARSE_HASHED) accept tuning parameters:

LAYOUT(HASHED(
    SHARDS 1
    SHARD_LOAD_QUEUE_BACKLOG 10000
    MAX_LOAD_FACTOR 0.5
))
  • SHARDS (default 1) splits the hash table into N shards so the dictionary can load in parallel across threads. This speeds up initial load and reloads of large dictionaries; it does not change lookup semantics.
  • SHARD_LOAD_QUEUE_BACKLOG (default 10000) is the size of the per-shard load queue.
  • MAX_LOAD_FACTOR (default 0.5, valid range [0.5, 0.99]) controls how full the hash table is allowed to get before it grows. A higher load factor packs entries more tightly (less memory) at the cost of slower reads due to more collisions.

The equivalent XML configuration:

<layout>
  <hashed>
    <shards>1</shards>
    <shard_load_queue_backlog>10000</shard_load_queue_backlog>
    <max_load_factor>0.5</max_load_factor>
  </hashed>
</layout>

SPARSE_HASHED

SPARSE_HASHED is identical to HASHED in behavior and configuration but uses a more compact (sparse) hash table representation. It uses substantially less memory in exchange for more CPU per lookup. In the Altinity test it cut memory from 256 MiB to ~76 MiB but raised lookup time by ~2.6x.

CREATE DICTIONARY country_codes_sparse
(
    id UInt64,
    country String
)
PRIMARY KEY id
SOURCE(CLICKHOUSE(TABLE 'country_dim'))
LIFETIME(MIN 300 MAX 600)
LAYOUT(SPARSE_HASHED());

Because SPARSE_HASHED is slower, the SHARDS parameter is especially useful here to parallelize loading. Note that current ClickHouse guidance is that SPARSE_HASHED is mainly worthwhile when the dictionary has a single attribute — with multiple attributes, HASHED_ARRAY usually wins on both memory and speed, making SPARSE_HASHED a niche choice.

HASHED_ARRAY

HASHED_ARRAY uses one hash table to map each key to an index, then stores each attribute in a contiguous array addressed by that index. This separates the key index (one hash table) from the attribute values (plain arrays), so adding attributes adds only array storage rather than additional hash-table overhead. The result is the lowest memory footprint of the three while keeping lookup speed essentially on par with HASHED.

CREATE DICTIONARY country_codes_array
(
    id UInt64,
    country String,
    region String,
    continent String
)
PRIMARY KEY id
SOURCE(CLICKHOUSE(TABLE 'country_dim'))
LIFETIME(MIN 300 MAX 600)
LAYOUT(HASHED_ARRAY());

The memory advantage of HASHED_ARRAY over HASHED grows with the number of attributes, which is why it is the recommended general-purpose layout. For a dictionary with only one or two attributes, plain HASHED may be marginally faster and the memory difference is smaller — worth measuring if lookups are extremely hot.

Complex-Key Variants

All three layouts require UInt64 keys. When your key is a String, a tuple, or otherwise not castable to UInt64, use the COMPLEX_KEY_* counterpart. They carry the same performance and memory tradeoffs as their simple-key versions:

Simple key (UInt64) Composite / non-UInt64 key
HASHED COMPLEX_KEY_HASHED
SPARSE_HASHED COMPLEX_KEY_SPARSE_HASHED
HASHED_ARRAY COMPLEX_KEY_HASHED_ARRAY
CREATE DICTIONARY geo_dim
(
    country String,
    city String,
    population UInt64
)
PRIMARY KEY country, city
SOURCE(CLICKHOUSE(TABLE 'geo_table'))
LIFETIME(MIN 300 MAX 600)
LAYOUT(COMPLEX_KEY_HASHED_ARRAY());

Choosing a Layout

A practical decision path:

  1. Default to HASHED_ARRAY (or COMPLEX_KEY_HASHED_ARRAY for non-UInt64 keys). It gives the best memory-to-speed ratio for the common case of multiple attributes.
  2. Use HASHED if the dictionary has only one or two attributes and lookups are on the hottest path, where the marginal speed edge matters and memory is not a concern.
  3. Use SPARSE_HASHED only when memory is critically constrained, the dictionary has essentially a single attribute, and you can tolerate roughly 3x slower lookups.
  4. For very small, dense UInt64 keyspaces, consider the FLAT layout instead — it is the fastest (a plain array offset lookup) but keys must be UInt64 and bounded by a maximum key value (500,000 by default), so it is unsuitable for sparse or large keyspaces.

Remember that dictionaries store data uncompressed in memory, so RAM usage is typically much larger than the compressed on-disk size of the equivalent table. This is exactly why the hashed-variant choice matters at scale. If you are weighing dictionaries against other enrichment approaches, see Dictionaries vs LowCardinality in ClickHouse.

Measuring Memory and Verifying Your Choice

Check actual per-dictionary memory and key count with the system table:

SELECT
    name,
    type,                       -- the layout, e.g. HashedArray
    status,
    element_count,
    formatReadableSize(bytes_allocated) AS memory
FROM system.dictionaries
WHERE name = 'country_codes_array';

To compare variants empirically, create the same dictionary under each layout and read bytes_allocated and element_count, then benchmark lookups with a representative dictGet workload:

SELECT count()
FROM numbers(10000000)
WHERE dictGet('country_codes_array', 'country', number % 5000000) != '';

Run each variant a few times to warm caches and average the results.

Common Issues

  1. Choosing SPARSE_HASHED for multi-attribute dictionaries. This is the most common misstep — HASHED_ARRAY usually beats it on both memory and speed when there is more than one attribute. Benchmark before settling on SPARSE_HASHED.
  2. Using a simple-key layout for String keys. HASHED and friends require UInt64 keys. A String primary key needs a COMPLEX_KEY_* layout or the dictionary will fail to create.
  3. Forgetting dictionaries are uncompressed in RAM. A 5 GB compressed table can become tens of GB of dictionary memory. Size accordingly and prefer HASHED_ARRAY.
  4. Raising MAX_LOAD_FACTOR blindly. Packing the table tighter saves memory but increases collision chains and slows lookups. Treat it as a fine-tuning knob, not a primary memory strategy — switching layout is the bigger lever.
  5. Slow loads on large dictionaries. If reloads stall query startup, set SHARDS greater than 1 to parallelize loading. Also consider deferring load with the dictionaries lazy load setting.

Best Practices

  1. Start with HASHED_ARRAY and only switch if benchmarks justify it.
  2. Use the smallest viable data types for keys and attributes; this compounds the layout savings.
  3. Measure with system.dictionaries (bytes_allocated, element_count) rather than estimating.
  4. Match the key type to the layout — simple UInt64 keys to HASHED*, composite or String keys to COMPLEX_KEY_*.
  5. Use SHARDS for parallel loading of large dictionaries, especially SPARSE_HASHED which loads more slowly.
  6. Re-evaluate as attribute count grows — the more attributes you add, the more HASHED_ARRAY pulls ahead of plain HASHED on memory.

How Pulse Helps

Dictionary memory is easy to underestimate because it is uncompressed and lives outside the usual table-size accounting. Pulse monitors ClickHouse memory broken down by consumer — including dictionaries — and surfaces dictionaries whose layout choice is costing more RAM than necessary or whose reloads are stalling queries. When a dictionary is sized or laid out sub-optimally, Pulse flags it with the relevant system.dictionaries metrics and a concrete recommendation, so you can right-size layouts before they turn into memory-limit incidents. Pulse is built and operated by ClickHouse experts who can advise on dictionary design, not just alert on it.

Frequently Asked Questions

Q: Which hashed layout should I use by default?

HASHED_ARRAY (or COMPLEX_KEY_HASHED_ARRAY for non-UInt64 keys). It is the recommended default in current ClickHouse documentation because it offers the lowest memory usage among the hashed variants while keeping lookup speed nearly identical to HASHED.

Q: How much memory does SPARSE_HASHED actually save?

In the Altinity Knowledge Base test with 5 million keys, SPARSE_HASHED used about 76 MiB versus 256 MiB for HASHED — roughly 3x less. The cost was about 2.6x slower lookups. Your savings depend on key distribution and attribute count, so benchmark with your data.

Q: Is HASHED ever faster than HASHED_ARRAY?

Marginally, when the dictionary has very few attributes. With one or two attributes, plain HASHED can edge out HASHED_ARRAY on lookup latency. As attribute count rises, HASHED_ARRAY matches the speed and uses less memory, so it becomes the better choice.

Q: When do I need the complex_key variants?

Whenever your dictionary key is not a single UInt64 — for example a String key or a composite key made of several columns. COMPLEX_KEY_HASHED, COMPLEX_KEY_SPARSE_HASHED, and COMPLEX_KEY_HASHED_ARRAY carry the same tradeoffs as their simple-key counterparts.

Q: What does MAX_LOAD_FACTOR do?

It controls how full the hash table is allowed to get (default 0.5, range [0.5, 0.99]) for HASHED and SPARSE_HASHED. A higher value packs entries more tightly to save memory but increases hash collisions and slows lookups. Switching to a more memory-efficient layout is usually a bigger win than tuning this knob.

Q: How do I see how much memory a dictionary is using?

Query system.dictionaries and read bytes_allocated (use formatReadableSize) along with element_count and type. That reflects the actual in-memory, uncompressed footprint for the chosen layout.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.