The SAMPLE clause in ClickHouse lets you query a fraction of a table's data, returning approximate results dramatically faster than a full scan. For exploratory analytics on multi-billion row tables, sampling turns minute-long queries into sub-second responses.
How Sampling Works
ClickHouse sampling works at the granule level — it selects a deterministic subset of granules (data blocks, typically 8,192 rows) rather than randomly picking individual rows. This is fast because it skips entire blocks of data rather than reading everything and filtering.
For sampling to work, the table must be created with a SAMPLE BY expression in the table definition.
Setting Up a Table for Sampling
Table Definition with SAMPLE BY
CREATE TABLE events
(
event_date Date,
user_id UInt64,
event_type String,
amount Decimal(10, 2),
properties String
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, intHash32(user_id))
SAMPLE BY intHash32(user_id);
Key requirements:
- The
SAMPLE BYexpression must be included in theORDER BYkey - The expression should produce well-distributed values (hash functions are ideal)
intHash32orintHash64on a high-cardinality column (like user_id) ensures uniform sampling
Why a Hash Function?
If you sample by user_id directly, the sample would take consecutive ID ranges (e.g., user_id 1–1000), which may not represent the full data distribution. A hash function ensures that sampled users are uniformly distributed across the entire population.
SAMPLE Syntax
Relative Sampling (Fraction)
Sample approximately 10% of the data:
SELECT
count() AS sampled_count,
count() * 10 AS estimated_total,
avg(amount) AS avg_amount
FROM events
SAMPLE 0.1;
The SAMPLE 0.1 clause reads ~10% of granules. Multiply counts by the reciprocal (10x) to estimate totals.
ClickHouse automatically adjusts aggregation functions when sampling:
-- count() returns the sampled count
-- To get estimated total, multiply manually or use the _sample_factor virtual column
SELECT
count() / _sample_factor AS estimated_count,
avg(amount) AS avg_amount -- avg is naturally unbiased
FROM events
SAMPLE 0.1;
Absolute Sampling (Row Count)
Sample approximately 10,000 rows:
SELECT avg(amount)
FROM events
SAMPLE 10000;
ClickHouse calculates the fraction needed to produce approximately the requested number of rows. The actual count may vary slightly since sampling operates on granules.
Offset Sampling
Use OFFSET to select a different sample segment (useful for parallel sampling or validation):
-- Sample 10%, starting at the 50% mark
SELECT avg(amount) FROM events SAMPLE 0.1 OFFSET 0.5;
The OFFSET shifts the starting point of the sample range. SAMPLE 0.1 OFFSET 0 and SAMPLE 0.1 OFFSET 0.5 return different, non-overlapping subsets.
Practical Examples
Exploratory Analysis on Billion-Row Tables
-- Quick count estimate
SELECT count() / _sample_factor AS estimated_events
FROM events
SAMPLE 0.01
WHERE event_date >= '2025-01-01';
-- Distribution analysis
SELECT
event_type,
count() / _sample_factor AS estimated_count,
avg(amount) AS avg_amount
FROM events
SAMPLE 0.05
GROUP BY event_type
ORDER BY estimated_count DESC;
A/B Test Validation
-- Compare metrics across treatment groups using a sample
SELECT
experiment_group,
uniq(user_id) AS unique_users,
avg(conversion_value) AS avg_conversion
FROM experiment_events
SAMPLE 0.1
GROUP BY experiment_group;
Since sampling is hash-based on user_id, the same users appear consistently across repeated queries — useful for reproducibility.
Dashboard Queries with Speed Priority
-- Real-time dashboard: approximate unique visitors per hour
SELECT
toStartOfHour(event_time) AS hour,
uniq(user_id) AS approx_unique_visitors
FROM page_views
SAMPLE 0.2
WHERE event_date = today()
GROUP BY hour
ORDER BY hour;
When to Use SAMPLE
| Scenario | Use SAMPLE? | Why |
|---|---|---|
| Exploratory data analysis | Yes | Speed matters more than precision |
| Real-time dashboards (approximate) | Yes | Sub-second response on billions of rows |
| Exact billing/financial reports | No | Need exact numbers |
| Debugging specific events | No | Need to find exact records |
| Trend analysis | Yes | Trends are visible in samples |
| Cardinality estimation | Maybe | uniq() is already approximate; combining with SAMPLE compounds error |
| Small tables (< 10M rows) | No | Full scan is already fast |
Accuracy Considerations
Statistical Error
Sampling error follows approximately 1 / sqrt(sample_size). For a 1% sample of 1 billion rows:
- Sample size: 10 million rows
- Expected error: ~0.03% (1/sqrt(10M))
For a 1% sample of 10 million rows:
- Sample size: 100,000 rows
- Expected error: ~0.3%
Averages and ratios are naturally unbiased by sampling. Counts and sums must be scaled by 1 / _sample_factor.
Deterministic Results
ClickHouse sampling is deterministic — the same SAMPLE fraction on the same data returns the same rows (because it's based on the hash-based ordering). This is useful for reproducibility but means different SAMPLE fractions don't give independent samples. Use OFFSET for independent subsets.
Combining SAMPLE with Other Features
SAMPLE + WHERE
SELECT avg(amount)
FROM events
SAMPLE 0.1
WHERE event_type = 'purchase' AND event_date >= '2025-01-01';
SAMPLE reduces data read before WHERE is applied, so the combination is very efficient.
SAMPLE + Approximate Functions
You can combine sampling with approximate aggregation functions for maximum speed:
SELECT
uniq(user_id) AS approx_users,
quantileTDigest(0.99)(response_time) AS approx_p99
FROM events
SAMPLE 0.05;
Be aware that this compounds approximation error — both sampling and the approximate function introduce uncertainty.
Frequently Asked Questions
Q: Can I add SAMPLE BY to an existing table?
No. The SAMPLE BY expression must be defined at table creation time because it must be part of the ORDER BY key. To add sampling to an existing table, create a new table with SAMPLE BY and re-insert the data.
Q: Does SAMPLE work with distributed tables?
Yes. Each shard samples its local data independently. The combined result approximates the requested fraction across the full dataset.
Q: Why does my sampled count not exactly match the expected fraction?
Sampling operates on granules, not individual rows. If you request SAMPLE 0.1, ClickHouse selects granules whose hash ranges cover ~10% of the hash space. The actual row count varies depending on data distribution across granules.
Q: Can I use SAMPLE in INSERT...SELECT?
Yes. This is useful for creating sample tables for development or testing:
INSERT INTO events_sample
SELECT * FROM events SAMPLE 0.01;
Q: What's the performance benefit of SAMPLE?
Roughly proportional to the sample fraction. SAMPLE 0.1 reads ~10% of the data, so queries run ~10x faster. The actual speedup depends on whether the query is I/O-bound (benefits fully) or CPU-bound (benefits partially).