Mean, variance, and standard deviation describe where a distribution sits and how wide it is, but they say nothing about its shape. ClickHouse exposes four aggregate functions for shape analysis: skewPop and skewSamp measure asymmetry, while kurtPop and kurtSamp measure tailedness (how heavy the tails are relative to a normal distribution).
These are useful for outlier detection, data-quality checks, and validating whether a column is normally distributed before you apply techniques that assume it is. This guide covers each function's syntax, the sample-vs-population distinction, and the edge cases that trip people up.
Skewness: Measuring Asymmetry
Skewness quantifies how lopsided a distribution is. A perfectly symmetric distribution has a skewness of 0. A positive value means a longer right tail (most values cluster low with a few large outliers); a negative value means a longer left tail.
skewPop
skewPop(expr) computes the population skewness of a sequence. Use it when your rows represent the entire population you care about.
SELECT skewPop(response_time_ms) AS skewness
FROM requests;
- Argument:
expr— an expression returning a number. - Return type:
Float64. - A symmetric input returns
0; a right-skewed input returns a positive value.
skewSamp
skewSamp(expr) computes the sample skewness and provides an unbiased estimate of the skewness of a random variable when the passed values form its sample.
SELECT skewSamp(latency_ms) AS sample_skewness
FROM request_sample;
- Return type:
Float64. - If the sample size is
1or less, the function returnsnan.
Both functions were introduced in ClickHouse v20.1, so they are available in all currently supported releases (24.x/25.x).
Kurtosis: Measuring Tailedness
Kurtosis describes how heavy a distribution's tails are. ClickHouse reports standard (non-excess) kurtosis, where a normal distribution has a value of 3. Higher values indicate heavier tails and more frequent extreme outliers; lower values indicate lighter tails. If you want excess kurtosis (the more common convention in statistics textbooks, where normal = 0), subtract 3 yourself.
kurtPop
kurtPop(expr) computes the population kurtosis of a sequence.
SELECT kurtPop(value) AS kurtosis
FROM numbers(1, 10);
-- ≈ 1.776
- Argument:
expr— a numeric expression of type(U)Int*,Float*, orDecimal. - Return type:
Float64.
kurtSamp
kurtSamp(expr) computes the sample kurtosis and represents an unbiased estimate of the kurtosis of a random variable when the passed values form its sample.
SELECT kurtSamp(value) AS sample_kurtosis
FROM numbers(1, 10);
-- ≈ 1.44
- Return type:
Float64. - If the sample size
n <= 1, the function returnsnan.
Pop vs. Samp: Which to Use
The Pop variants treat the data as the complete population. The Samp variants apply a correction so the result is an unbiased estimate of the underlying population's skewness/kurtosis from a finite sample — analogous to the difference between `varPop` and `varSamp`, or `stddevPop` and stddevSamp.
| Aspect | skewPop / kurtPop |
skewSamp / kurtSamp |
|---|---|---|
| Interpretation of input | Entire population | A sample drawn from a larger population |
| Bias correction | None | Unbiased estimator for the population statistic |
| Behavior on tiny inputs | Returns a value (can be unstable) | Returns nan when n <= 1 |
| Typical use | You have all the data (e.g. every event in a window) | You have a subset and want to infer the population |
Rule of thumb: if your query already operates over the full set of rows you care about (a complete day of events, every sensor reading), use the Pop variants. If you are sampling — for example with SAMPLE or a filtered subset — and want to generalize, use the Samp variants.
Computing Shape Statistics Together
Because these are ordinary aggregate functions, you can compute the full distribution profile of a column in a single scan alongside mean, variance, and standard deviation:
SELECT
count() AS n,
avg(response_time_ms) AS mean,
stddevPop(response_time_ms) AS stddev,
skewPop(response_time_ms) AS skewness,
kurtPop(response_time_ms) AS kurtosis
FROM requests
WHERE event_date = today();
A common normality heuristic: a roughly normal distribution has skewness near 0 and (non-excess) kurtosis near 3. A skewness far from 0 or a kurtosis well above 3 signals asymmetry or heavy tails — often a sign of outliers worth investigating.
Per-Group Distribution Analysis
Combine with GROUP BY to compare distribution shape across segments — for example, to find endpoints whose latency distribution is unusually skewed:
SELECT
endpoint,
count() AS requests,
round(skewPop(latency_ms), 3) AS skewness,
round(kurtPop(latency_ms), 3) AS kurtosis
FROM requests
WHERE event_date >= today() - 7
GROUP BY endpoint
HAVING requests > 1000
ORDER BY skewness DESC
LIMIT 20;
Endpoints with high positive skewness and elevated kurtosis have occasional very slow requests dragging out the right tail — exactly where percentile latency (`quantile`) tells a different story from the average.
Common Issues
- NULLs and non-numeric values: like other aggregate functions, these ignore
NULLinputs. Ensure the expression is numeric; cast strings explicitly before aggregating. nanresults fromSampvariants: if a group hasn <= 1,skewSampandkurtSampreturnnan. Filter small groups withHAVING count() > 1(or a larger threshold) when you need stable estimates.- Constant or near-constant columns: when all values are identical the variance is zero and skewness/kurtosis are undefined or unstable. Guard against this in data-quality pipelines.
- Kurtosis convention confusion: ClickHouse returns non-excess kurtosis (normal ≈
3). If you compare against a library that reports excess kurtosis (normal ≈0), subtract3from the ClickHouse result before comparing. - Outlier sensitivity: both statistics, especially kurtosis, are highly sensitive to extreme outliers because they depend on third and fourth powers of deviations from the mean. A single bad value can dominate the result.
Best Practices
- Pick Pop vs. Samp deliberately. Don't default to one out of habit — choose based on whether your rows are a population or a sample. The difference matters most on small datasets.
- Compute shape statistics in one pass. ClickHouse evaluates multiple aggregates over a single scan, so add
skewPop/kurtPopto existing summary queries rather than running separate jobs. - Pair with percentiles for latency work. Skewness and kurtosis describe shape; `quantile` and `quantileTDigest` give you the concrete tail values that matter for SLAs.
- Use combinators for filtered analysis. Apply
-Ifto scope the calculation, e.g.skewPopIf(value, status = 200), to profile only the rows you care about. - Set a minimum group size. Require enough rows per group (
HAVING count() > N) before trusting shape statistics — small samples produce noisy, sometimesnan, results.
How Pulse Helps
Distribution shape is a leading indicator of trouble: a latency column that suddenly becomes more skewed or more heavy-tailed often means a subset of requests is degrading before averages move. Pulse monitors ClickHouse query performance and data patterns continuously, surfacing the kind of tail behavior that skewness and kurtosis quantify — and it does so without you having to hand-roll and schedule every diagnostic query. Pulse is operated by ClickHouse experts who help teams interpret these signals, tune aggregation-heavy workloads, and keep production clusters healthy. If you are running statistical analysis at scale on ClickHouse, Pulse provides the monitoring and expert guidance to act on what the numbers reveal.
Frequently Asked Questions
Q: What is the difference between skewPop and skewSamp?
skewPop computes population skewness, treating your rows as the complete population. skewSamp computes sample skewness, applying a correction to give an unbiased estimate of the population's skewness from a finite sample. Use skewSamp when your data is a subset you want to generalize; skewPop when you have all the data.
Q: Does ClickHouse return excess kurtosis or standard kurtosis?
kurtPop and kurtSamp return standard (non-excess) kurtosis, where a normal distribution is approximately 3. To get excess kurtosis (normal = 0), subtract 3 from the result.
Q: Why do kurtSamp and skewSamp return nan?
The sample variants return nan when the sample size is 1 or less, because an unbiased estimate cannot be computed from a single value. Ensure each group has at least two rows, e.g. with HAVING count() > 1.
Q: Can I compute skewness and kurtosis per group?
Yes. They are standard aggregate functions, so they work with GROUP BY and with combinators like -If and -State. This lets you compare distribution shape across endpoints, sensors, tenants, or time buckets in a single query.
Q: How do these compare to standard deviation and variance?
`stddevPop` and `varPop` measure spread — how far values are from the mean. Skewness measures asymmetry and kurtosis measures tailedness. Together they describe the full shape of a distribution. For very large datasets where you also need percentile estimates, see approximate methods.
Q: Are skewness and kurtosis sensitive to outliers?
Very. Both depend on high powers (third for skewness, fourth for kurtosis) of deviations from the mean, so a single extreme value can dominate the result. This sensitivity is exactly why they are useful for outlier and anomaly detection, but it also means you should validate inputs before drawing conclusions.