Using PostgreSQL as a Dictionary Source in ClickHouse

ClickHouse dictionaries let you keep reference data — user names, product catalogs, geo lookups — in an external system and query it with fast in-memory dictGet calls instead of JOINs. PostgreSQL is one of the most common sources: you keep the authoritative table in Postgres and ClickHouse periodically pulls it into a dictionary.

This guide covers the practical end-to-end setup: defining a dictionary with a PostgreSQL source, choosing a layout, controlling refresh, handling authentication and replicas, and the issues that come up most often. For conceptual background, see What is a ClickHouse dictionary and ClickHouse external dictionaries.

Prerequisites

The PostgreSQL dictionary source ships in the standard ClickHouse server build — no extra package is required (unlike ODBC sources). You need:

  • A reachable PostgreSQL instance and a role with SELECT on the source table.
  • Network connectivity from every ClickHouse node that will load the dictionary. In a cluster, each replica loads the dictionary independently, so every node must be able to reach Postgres.
  • The dictionary attributes you want must exist as columns in the Postgres table (or be produced by a custom query).

Creating a Dictionary with a PostgreSQL Source

The DDL form (CREATE DICTIONARY) is the modern, recommended approach. A dictionary has four parts: the attribute schema with a PRIMARY KEY, a SOURCE, a LAYOUT, and a LIFETIME.

CREATE DICTIONARY customers_dict
(
    customer_id UInt64,
    name        String,
    country     String,
    tier        String DEFAULT 'standard'
)
PRIMARY KEY customer_id
SOURCE(POSTGRESQL(
    host 'postgres.internal'
    port 5432
    user 'clickhouse_ro'
    password 'secret'
    db 'crm'
    table 'customers'
))
LAYOUT(HASHED())
LIFETIME(MIN 300 MAX 600);

Once created, look up values with dictGet:

SELECT
    order_id,
    dictGet('customers_dict', 'name', customer_id)    AS customer_name,
    dictGet('customers_dict', 'country', customer_id) AS country
FROM orders;

dictGet('dict_name', 'attr', id_expr) returns the attribute for the given key. The key expression must match the key type — for a UInt64 key, wrap raw integers with toUInt64(...) when needed. Related helpers:

  • dictGetOrDefault('customers_dict', 'tier', toUInt64(42), 'unknown') — return a fallback when the key is missing instead of the attribute's declared default.
  • dictHas('customers_dict', toUInt64(42)) — returns 1 if the key exists, 0 otherwise.

PostgreSQL Source Parameters

These parameters go inside SOURCE(POSTGRESQL(...)):

Parameter Purpose
host / port PostgreSQL server address (default port 5432).
user / password Credentials for a role with SELECT on the source.
db Database name.
table Source table name. Mutually exclusive with query.
where Optional WHERE clause to filter source rows (cannot be combined with query).
query A custom SELECT instead of table/where. Either table or query is required.
invalidate_query A SELECT returning a single value; the dictionary reloads only when that value changes.
replica(...) One or more replica blocks with their own host, port, and priority.
background_reconnect Optional; reconnect to a replica automatically after a failure.

A custom query is useful when you want to join or transform in PostgreSQL, or rename columns to match your dictionary attributes:

SOURCE(POSTGRESQL(
    host 'postgres.internal' port 5432
    user 'clickhouse_ro' password 'secret'
    db 'crm'
    query 'SELECT id AS customer_id, full_name AS name, country_code AS country FROM customers WHERE active = true'
))

Note: table and where cannot be combined with query — pick one approach.

Choosing a Layout

The LAYOUT controls how the dictionary is held in memory and which key types are allowed. This choice matters more for performance than anything in the source definition.

Layout Key Use when
FLAT Single UInt64 Keys are dense, small integers. Fastest lookups, backed by a flat array. Key values must not exceed max_array_size (default 500,000); ClickHouse throws an exception and refuses to load if any key exceeds this limit.
HASHED Single UInt64 Sparse or large integer keys. The general-purpose default.
SPARSE_HASHED Single UInt64 Same as hashed but trades some CPU for lower memory.
COMPLEX_KEY_HASHED Composite / non-integer Keys are strings, tuples, or multiple columns.
HASHED_ARRAY Single UInt64 Many attributes; more memory-efficient than HASHED.
RANGE_HASHED UInt64 + range Lookups valid for a date/time range (e.g. price history).
CACHE / COMPLEX_KEY_CACHE Single / composite Dataset too large for memory; only hot keys cached. Tuning is hard — avoid unless necessary.
DIRECT / COMPLEX_KEY_DIRECT Single / composite No in-memory storage; every lookup hits PostgreSQL. Use only for always-fresh, low-volume lookups.
IP_TRIE CIDR prefixes IP-prefix / CIDR lookups.

For a String key from Postgres, use COMPLEX_KEY_HASHED and declare the key accordingly:

CREATE DICTIONARY country_codes_dict
(
    code     String,
    name     String,
    region   String
)
PRIMARY KEY code
SOURCE(POSTGRESQL(
    host 'postgres.internal' port 5432
    user 'clickhouse_ro' password 'secret'
    db 'reference' table 'countries'
))
LAYOUT(COMPLEX_KEY_HASHED())
LIFETIME(MIN 600 MAX 900);

Look up complex keys by passing the key as a tuple: dictGet('country_codes_dict', 'name', tuple('US')).

Controlling Refresh with LIFETIME

LIFETIME(MIN min MAX max) sets the refresh window in seconds. ClickHouse picks a random reload time between MIN and MAX for each node, which spreads load so all replicas don't hit PostgreSQL simultaneously. LIFETIME(300) is shorthand for a fixed interval.

LIFETIME(MIN 300 MAX 600)   -- reload sometime between 5 and 10 minutes
LIFETIME(0)                 -- never reload automatically

A blind reload re-reads the whole source. To avoid unnecessary full reloads, add an invalidate_query that returns a cheap change marker (a max timestamp or a row count). ClickHouse reloads only when the returned value differs from the previous run:

SOURCE(POSTGRESQL(
    host 'postgres.internal' port 5432
    user 'clickhouse_ro' password 'secret'
    db 'crm' table 'customers'
    invalidate_query 'SELECT max(updated_at) FROM customers'
))

To force an immediate reload regardless of LIFETIME:

SYSTEM RELOAD DICTIONARY customers_dict;

For more on keeping enrichment data current, see ClickHouse dictionary enrichment and update patterns.

High Availability with Replicas

If your PostgreSQL setup has read replicas, list them with priorities. ClickHouse tries replicas in priority order — lower number means higher priority — and falls back to the next one on failure.

SOURCE(POSTGRESQL(
    port 5432
    user 'clickhouse_ro' password 'secret'
    db 'crm' table 'customers'
    replica(host 'pg-primary'  port 5432 priority 1)
    replica(host 'pg-replica-1' port 5432 priority 2)
    background_reconnect true
))

Authentication and Security

  • Use a dedicated read-only role. Grant only SELECT on the source table; the dictionary never writes back.
  • Credentials live in the dictionary definition. With DDL dictionaries the password is stored in ClickHouse metadata. Restrict who can run SHOW CREATE DICTIONARY and consider XML config files with locked-down filesystem permissions if you need the credential out of SQL.
  • SSL/TLS: terminate Postgres connections over TLS at the server side; configure sslmode on the PostgreSQL server and require encrypted connections via pg_hba.conf.
  • Network: every ClickHouse node needs reachability to Postgres. Connection or auth failures surface as load errors — see ClickHouse error: PostgreSQL connection failure and dictionary access denied.

Dictionary vs. postgresql() Table Function vs. PostgreSQL Engine

A dictionary is not the only way to read Postgres from ClickHouse. Choose based on access pattern:

Approach Best for Trade-off
Dictionary High-frequency key lookups / enrichment in hot queries Loaded into memory; refreshed on a schedule, so slightly stale
postgresql() table function Ad-hoc one-off reads of a Postgres table Reads live each time; no caching, full network round trip
PostgreSQL table engine A persistent table mapping you query like any table Live reads on every query; not optimized for point lookups

If your workload is "given an ID, get the name" repeated millions of times per query, a dictionary is almost always the right tool. If you need always-live data and can tolerate per-query latency, the DIRECT layout or the table function fits better.

Common Issues

  1. Dictionary loads but is empty. Usually the table/schema/db combination points at the wrong place, or the where clause filters everything out. Check SELECT status, last_exception FROM system.dictionaries WHERE name = 'customers_dict'. See ClickHouse dictionary is empty.

  2. query and table both set. They are mutually exclusive; ClickHouse rejects the definition. See incorrect dictionary definition.

  3. Stale data. Either LIFETIME is too long, or an invalidate_query never changes. Verify the marker actually moves when source rows change, or run SYSTEM RELOAD DICTIONARY.

  4. Connection / auth failures on load. Wrong credentials, missing SELECT grant, or a firewall blocking a node. See PostgreSQL connection failure.

  5. Cache-layout update failures. With CACHE layouts, individual key misses that can't reach the source produce update errors. See cache dictionary update fail.

  6. Type mismatches. A Postgres bigint maps to UInt64/Int64; a text to String. Mismatched declared types cause load errors — align the dictionary attribute types with the Postgres column types.

Best Practices

  1. Prefer DDL over XML for new dictionaries — it's versionable through SQL and easier to manage.
  2. Pick the layout deliberately: FLAT for dense integer keys, HASHED for sparse ones, COMPLEX_KEY_HASHED for string/composite keys. Avoid CACHE and DIRECT unless you specifically need them.
  3. Add an invalidate_query so reloads happen only on real changes, not on every LIFETIME tick.
  4. Use a read-only Postgres role scoped to the source table.
  5. Filter at the source with where or a custom query to keep the dictionary small and the memory footprint low.
  6. Monitor system.dictionaries for status, last_exception, loading_duration, and bytes_allocated.

How Pulse Helps

Pulse monitors ClickHouse in production and surfaces dictionary problems before they corrupt query results. It tracks system.dictionaries for failed loads, growing loading_duration, repeated reload errors, and dictionaries stuck in a not-loaded state — including PostgreSQL-sourced ones that silently go stale when a LIFETIME or invalidate_query is misconfigured. It also watches the connection and memory side effects of large dictionaries across the cluster. Learn more at pulse.support.

Frequently Asked Questions

Q: Do I need to install anything to use a PostgreSQL dictionary source?

No. The PostgreSQL source is part of the standard ClickHouse server. Unlike ODBC sources, no extra driver package is required.

Q: How often does the dictionary refresh from PostgreSQL?

It reloads at a random time within the LIFETIME(MIN ... MAX ...) window, per node. Add an invalidate_query to skip reloads when nothing changed, or run SYSTEM RELOAD DICTIONARY to force one.

Q: My key is a string, not a number. Which layout do I use?

Use COMPLEX_KEY_HASHED (or COMPLEX_KEY_CACHE/COMPLEX_KEY_DIRECT). Plain FLAT/HASHED require a single UInt64 key. Look up complex keys by passing a tuple to dictGet.

Q: Should I use a dictionary or the postgresql() table function?

Use a dictionary for repeated, high-volume key lookups (enrichment in hot queries), where in-memory caching pays off. Use the postgresql() table function for ad-hoc or one-off reads where you want live data and don't need caching.

Q: How do I see why a dictionary failed to load?

Query SELECT name, status, last_exception, loading_duration FROM system.dictionaries. The last_exception column usually contains the underlying PostgreSQL or connection error.

Q: Can I filter or transform the PostgreSQL data before it becomes a dictionary?

Yes. Use the where parameter for a simple filter, or a full custom query to select, rename, or join columns in PostgreSQL before they load. query cannot be combined with table/where.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.