The count(DISTINCT) function in ClickHouse is an aggregation function used to count the number of unique (distinct) values in a column or expression. By default, it provides an exact count of distinct elements using the uniqExact implementation, though this behavior can be configured via the count_distinct_implementation setting. It's useful for calculating metrics like unique users, unique products purchased, or distinct categories in a dataset.
Syntax
count(DISTINCT expr)
Implementation Details
The behavior of count(DISTINCT) is controlled by the count_distinct_implementation setting, which determines which underlying function performs the operation. The available options are:
uniqExact(default) - Provides exact countsuniq- Uses adaptive sampling for approximate countsuniqCombined- Uses a combination of array, hash table, and HyperLogLoguniqCombined64- Like uniqCombined but for 64-bit hashesuniqHLL12- Uses HyperLogLog algorithm
You can change this setting to trade accuracy for performance:
SET count_distinct_implementation = 'uniq';
SELECT count(DISTINCT user_id) FROM events;
count_distinct_implementation Setting Documentation
Example Usage
-- Count distinct users
SELECT count(DISTINCT user_id) FROM user_events;
-- Count distinct products sold per category
SELECT
category,
count(DISTINCT product_id) AS unique_products
FROM sales
GROUP BY category;
-- Count distinct values with filter
SELECT count(DISTINCT customer_id)
FROM orders
WHERE order_date >= '2024-01-01';
-- Multiple distinct counts in one query
SELECT
count(DISTINCT user_id) AS unique_users,
count(DISTINCT session_id) AS unique_sessions,
count(*) AS total_events
FROM analytics;
Common Issues
- Performance on high-cardinality columns: Counting distinct values on columns with millions of unique entries can be slow and memory-intensive.
- Memory consumption: Large distinct counts require significant memory to track all unique values during aggregation.
- Distributed query overhead: On distributed tables, each shard must communicate its unique values, creating network overhead.
- NULL handling confusion:
count(DISTINCT column)excludes NULL values, which might not be the expected behavior.
Best Practices
- Use approximate functions for large datasets: For high-cardinality columns, configure
count_distinct_implementationto useuniqoruniqCombined, or call these functions directly for much better performance when approximate results are acceptable. - Add appropriate filters: Reduce the dataset size with WHERE clauses before counting distinct values to improve performance.
- Consider sampling: For exploratory analysis on very large datasets, use sampling to get quick estimates.
- Index optimization: Ensure proper indexing on columns used in filters to speed up data retrieval before aggregation.
- Monitor memory usage: Watch memory consumption when running distinct counts on large datasets, especially on distributed clusters.
- Materialize results: For frequently accessed distinct counts, consider materializing results in summary tables.
Frequently Asked Questions
Q: What's the difference between count(DISTINCT) and uniq()?
A: By default, count(DISTINCT) uses uniqExact to provide an exact count of unique values. When called directly, uniq() uses an adaptive sampling algorithm to provide an approximate count with typically less than 2% error, and is much faster and uses less memory on large datasets. You can configure count(DISTINCT) to use uniq or other implementations via the count_distinct_implementation setting.
Q: Can I use count(DISTINCT) with multiple columns?
A: Yes, you can count distinct combinations of multiple columns: count(DISTINCT (column1, column2)). This counts unique pairs of values from both columns.
Q: How does count(DISTINCT) handle NULL values?
A: count(DISTINCT column) excludes NULL values from the count. If you need to include NULLs, use count(DISTINCT ifNull(column, 'some_default_value')).
Q: Why is my count(DISTINCT) query running out of memory?
A: High-cardinality columns with millions of unique values require ClickHouse to track all unique values in memory. Consider using approximate functions like uniq(), adding filters to reduce the dataset, or increasing the max_memory_usage setting.
Q: What's the best alternative to count(DISTINCT) for better performance?
A: You can configure count(DISTINCT) via the count_distinct_implementation setting, or call the functions directly. For approximate counts, use uniq() (adaptive sampling) for good speed and accuracy balance, uniqCombined() for a hybrid approach, or uniqHLL12() (HyperLogLog) for very high cardinality. For exact counts, uniqExact() is the default implementation. Choose based on your accuracy requirements and data characteristics.