ClickHouse SAMPLE Clause: Fast Approximate Queries on Large Tables

The SAMPLE clause in ClickHouse lets you query a fraction of a table's data, returning approximate results dramatically faster than a full scan. For exploratory analytics on multi-billion row tables, sampling turns minute-long queries into sub-second responses.

How Sampling Works

ClickHouse sampling works at the granule level — it selects a deterministic subset of granules (data blocks, typically 8,192 rows) rather than randomly picking individual rows. This is fast because it skips entire blocks of data rather than reading everything and filtering.

For sampling to work, the table must be created with a SAMPLE BY expression in the table definition.

Setting Up a Table for Sampling

Table Definition with SAMPLE BY

CREATE TABLE events
(
    event_date Date,
    user_id UInt64,
    event_type String,
    amount Decimal(10, 2),
    properties String
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, intHash32(user_id))
SAMPLE BY intHash32(user_id);

Key requirements:

The SAMPLE BY expression must be included in the ORDER BY key
The expression should produce well-distributed values (hash functions are ideal)
intHash32 or intHash64 on a high-cardinality column (like user_id) ensures uniform sampling

Why a Hash Function?

If you sample by user_id directly, the sample would take consecutive ID ranges (e.g., user_id 1–1000), which may not represent the full data distribution. A hash function ensures that sampled users are uniformly distributed across the entire population.

SAMPLE Syntax

Relative Sampling (Fraction)

Sample approximately 10% of the data:

SELECT
    count() AS sampled_count,
    count() * 10 AS estimated_total,
    avg(amount) AS avg_amount
FROM events
SAMPLE 0.1;

The SAMPLE 0.1 clause reads ~10% of granules. Multiply counts by the reciprocal (10x) to estimate totals.

ClickHouse automatically adjusts aggregation functions when sampling:

-- count() returns the sampled count
-- To get estimated total, multiply manually or use the _sample_factor virtual column
SELECT
    count() / _sample_factor AS estimated_count,
    avg(amount) AS avg_amount  -- avg is naturally unbiased
FROM events
SAMPLE 0.1;

Absolute Sampling (Row Count)

Sample approximately 10,000 rows:

SELECT avg(amount)
FROM events
SAMPLE 10000;

ClickHouse calculates the fraction needed to produce approximately the requested number of rows. The actual count may vary slightly since sampling operates on granules.

Offset Sampling

Use OFFSET to select a different sample segment (useful for parallel sampling or validation):

-- Sample 10%, starting at the 50% mark
SELECT avg(amount) FROM events SAMPLE 0.1 OFFSET 0.5;

The OFFSET shifts the starting point of the sample range. SAMPLE 0.1 OFFSET 0 and SAMPLE 0.1 OFFSET 0.5 return different, non-overlapping subsets.

Practical Examples

Exploratory Analysis on Billion-Row Tables

-- Quick count estimate
SELECT count() / _sample_factor AS estimated_events
FROM events
SAMPLE 0.01
WHERE event_date >= '2025-01-01';

-- Distribution analysis
SELECT
    event_type,
    count() / _sample_factor AS estimated_count,
    avg(amount) AS avg_amount
FROM events
SAMPLE 0.05
GROUP BY event_type
ORDER BY estimated_count DESC;

A/B Test Validation

-- Compare metrics across treatment groups using a sample
SELECT
    experiment_group,
    uniq(user_id) AS unique_users,
    avg(conversion_value) AS avg_conversion
FROM experiment_events
SAMPLE 0.1
GROUP BY experiment_group;

Since sampling is hash-based on user_id, the same users appear consistently across repeated queries — useful for reproducibility.

Dashboard Queries with Speed Priority

-- Real-time dashboard: approximate unique visitors per hour
SELECT
    toStartOfHour(event_time) AS hour,
    uniq(user_id) AS approx_unique_visitors
FROM page_views
SAMPLE 0.2
WHERE event_date = today()
GROUP BY hour
ORDER BY hour;

When to Use SAMPLE

Scenario	Use SAMPLE?	Why
Exploratory data analysis	Yes	Speed matters more than precision
Real-time dashboards (approximate)	Yes	Sub-second response on billions of rows
Exact billing/financial reports	No	Need exact numbers
Debugging specific events	No	Need to find exact records
Trend analysis	Yes	Trends are visible in samples
Cardinality estimation	Maybe	`uniq()` is already approximate; combining with SAMPLE compounds error
Small tables (< 10M rows)	No	Full scan is already fast

Accuracy Considerations

Statistical Error

Sampling error follows approximately 1 / sqrt(sample_size). For a 1% sample of 1 billion rows:

Sample size: 10 million rows
Expected error: ~0.03% (1/sqrt(10M))

For a 1% sample of 10 million rows:

Sample size: 100,000 rows
Expected error: ~0.3%

Averages and ratios are naturally unbiased by sampling. Counts and sums must be scaled by 1 / _sample_factor.

Deterministic Results

ClickHouse sampling is deterministic — the same SAMPLE fraction on the same data returns the same rows (because it's based on the hash-based ordering). This is useful for reproducibility but means different SAMPLE fractions don't give independent samples. Use OFFSET for independent subsets.

Combining SAMPLE with Other Features

SAMPLE + WHERE

SELECT avg(amount)
FROM events
SAMPLE 0.1
WHERE event_type = 'purchase' AND event_date >= '2025-01-01';

SAMPLE reduces data read before WHERE is applied, so the combination is very efficient.

SAMPLE + Approximate Functions

You can combine sampling with approximate aggregation functions for maximum speed:

SELECT
    uniq(user_id) AS approx_users,
    quantileTDigest(0.99)(response_time) AS approx_p99
FROM events
SAMPLE 0.05;

Be aware that this compounds approximation error — both sampling and the approximate function introduce uncertainty.

Frequently Asked Questions

Q: Can I add SAMPLE BY to an existing table?

No. The SAMPLE BY expression must be defined at table creation time because it must be part of the ORDER BY key. To add sampling to an existing table, create a new table with SAMPLE BY and re-insert the data.

Q: Does SAMPLE work with distributed tables?

Yes. Each shard samples its local data independently. The combined result approximates the requested fraction across the full dataset.

Q: Why does my sampled count not exactly match the expected fraction?

Sampling operates on granules, not individual rows. If you request SAMPLE 0.1, ClickHouse selects granules whose hash ranges cover ~10% of the hash space. The actual row count varies depending on data distribution across granules.

Q: Can I use SAMPLE in INSERT...SELECT?

Yes. This is useful for creating sample tables for development or testing:

INSERT INTO events_sample
SELECT * FROM events SAMPLE 0.01;

Q: What's the performance benefit of SAMPLE?

Roughly proportional to the sample fraction. SAMPLE 0.1 reads ~10% of the data, so queries run ~10x faster. The actual speedup depends on whether the query is I/O-bound (benefits fully) or CPU-bound (benefits partially).