ClickHouse DB::Exception: Sampling not supported

Q: What does SAMPLE 0.1 actually do?

SAMPLE 0.1 reads approximately 10% of the data. ClickHouse uses the sampling key to deterministically select which rows to include. Results are consistent across repeated queries with the same sample ratio.

The "DB::Exception: Sampling not supported" error in ClickHouse occurs when you use a SAMPLE clause in a query against a table that was not created with a sampling key. ClickHouse requires a SAMPLE BY expression in the table definition to enable sampling. Without it, the SAMPLING_NOT_SUPPORTED error is raised.

Impact

The query fails immediately. No data is returned. The table itself is unaffected. This error is particularly common when users try to approximate query results on large datasets without first setting up the table for sampling.

Common Causes

Table created without a SAMPLE BY clause -- the most common cause. The table definition does not include a sampling key.
Using SAMPLE on a non-MergeTree engine -- SAMPLE is only supported by MergeTree family engines, and only when configured with a sampling key.
Querying a view or subquery with SAMPLE -- views and subqueries do not directly support SAMPLE.
Distributed table pointing to tables without sampling keys -- even through a Distributed table, the underlying local tables must have sampling keys.
Confusing SAMPLE with LIMIT -- users sometimes try SAMPLE when they actually want LIMIT or a percentage-based approximation.

Troubleshooting and Resolution Steps

Check if the table has a sampling key:

SELECT sampling_key FROM system.tables
WHERE database = 'default' AND name = 'your_table';

An empty result means sampling is not configured.

If you need sampling, create the table with a SAMPLE BY clause:

CREATE TABLE sampled_table (
    id UInt64,
    event_date Date,
    user_id UInt64,
    data String
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (user_id, id)
SAMPLE BY id;  -- id must be part of ORDER BY and should have good distribution

The sampling key must be included in the ORDER BY (sorting key) expression and should ideally be a hash or uniformly distributed value.

Add a sampling key to an existing table (requires data migration):

-- Create new table with sampling key
CREATE TABLE your_table_sampled AS your_table
ENGINE = MergeTree()
ORDER BY (existing_key, intHash32(id))
SAMPLE BY intHash32(id);

-- Migrate data
INSERT INTO your_table_sampled SELECT * FROM your_table;

-- Swap tables
RENAME TABLE your_table TO your_table_old, your_table_sampled TO your_table;

Use alternatives to SAMPLE for approximate results:

-- LIMIT for a fixed number of rows
SELECT * FROM your_table LIMIT 1000;

-- Approximate aggregate functions
SELECT uniqHLL12(user_id) FROM your_table;
SELECT countIf(status = 'active') FROM your_table;

-- WHERE-based sampling using modular arithmetic
SELECT * FROM your_table WHERE cityHash64(id) % 10 = 0;  -- ~10% sample

For Distributed tables, ensure all local tables have the sampling key:

SELECT database, name, sampling_key FROM system.tables
WHERE engine LIKE '%MergeTree%' AND name = 'your_local_table';

Best Practices

Plan for sampling during table design. Adding a sampling key later requires data migration.
Choose a sampling key with good distribution. Using intHash32 or intHash64 on an ID column works well.
The sampling key must be part of the primary key (ORDER BY). Plan your sorting key to accommodate this.
Consider whether you truly need SAMPLE or whether approximate functions or LIMIT would suffice for your use case.
For analytical workloads on very large tables, sampling can dramatically speed up exploratory queries.

Frequently Asked Questions

Q: Can I add a SAMPLE BY to an existing table without recreating it?
A: No. The SAMPLE BY expression must be part of the ORDER BY, and changing the ORDER BY requires recreating the table with data migration. There is no ALTER statement to add a sampling key in place.

Q: What makes a good sampling key?
A: A good sampling key has uniform distribution across its value range. Using a hash function like intHash32(id) or cityHash64(user_id) ensures even distribution regardless of the original data distribution.

Q: What does SAMPLE 0.1 actually do?
A: SAMPLE 0.1 reads approximately 10% of the data. ClickHouse uses the sampling key to deterministically select which rows to include. Results are consistent across repeated queries with the same sample ratio.

Q: Can I use SAMPLE with any table engine?
A: No. SAMPLE is only supported by MergeTree family engines (MergeTree, ReplicatedMergeTree, etc.) that have a SAMPLE BY clause in their definition. Other engines like Log, Memory, or Kafka do not support it.