Capacity planning and storage forecasting in ClickHouse rely on a handful of well-known queries against system.parts, system.parts_columns, and system.projection_parts. These tables expose compressed bytes, uncompressed bytes, row counts, and the number of active parts for every table on the server. Combining them with formatReadableSize gives operators a quick way to identify the largest tables, the columns driving storage cost, and the compression ratios that confirm encoding choices are paying off.
Keep these queries in your DBA toolkit and run them whenever a disk usage alert fires or before you plan a hardware upgrade.
Table Size, Compression, and Part Count
The first query reports compressed and uncompressed size, compression ratio, row count, and part count per table. Filter by database or table with the LIKE patterns.
SELECT
database,
table,
formatReadableSize(sum(data_compressed_bytes) AS size) AS compressed,
formatReadableSize(sum(data_uncompressed_bytes) AS usize) AS uncompressed,
round(usize / size, 2) AS compr_rate,
sum(rows) AS rows,
count() AS part_count
FROM system.parts
WHERE (active = 1) AND (database LIKE '%') AND (table LIKE '%')
GROUP BY database, table
ORDER BY size DESC;
Use this query as the entry point for any storage investigation. The part_count column flags tables with merge backlogs, and compr_rate (uncompressed divided by compressed) shows how effective the codec configuration is.
Column Size and Compression Ratio
When one table dominates storage, drill into its columns. system.parts_columns gives per-column byte counts and the average row size:
SELECT
database,
table,
column,
formatReadableSize(sum(column_data_compressed_bytes) AS size) AS compressed,
formatReadableSize(sum(column_data_uncompressed_bytes) AS usize) AS uncompressed,
round(usize / size, 2) AS compr_ratio,
sum(rows) AS rows_cnt,
round(usize / rows_cnt, 2) AS avg_row_size
FROM system.parts_columns
WHERE (active = 1) AND (database LIKE '%') AND (table LIKE '%')
GROUP BY database, table, column
ORDER BY size DESC;
Columns with poor compression ratios (close to 1.0) are candidates for codec tuning, ordering key adjustments, or different data types. High avg_row_size on a wide String column often points to JSON or free-text payloads that compress better with CODEC(ZSTD) than the default LZ4.
Projection Size
Projections store additional physical layouts of the same data and add to total storage. Query system.projection_parts to see how much space each projection consumes:
SELECT
database,
table,
name,
formatReadableSize(sum(data_compressed_bytes) AS size) AS compressed,
formatReadableSize(sum(data_uncompressed_bytes) AS usize) AS uncompressed,
round(usize / size, 2) AS compr_rate,
sum(rows) AS rows,
count() AS part_count
FROM system.projection_parts
WHERE (table = 'ptest') AND active
GROUP BY database, table, name
ORDER BY size DESC;
Replace 'ptest' with the table you want to inspect. The name column is the projection name as defined in the table DDL.
Database Totals
Aggregating up to the database level helps with quota planning and chargeback:
SELECT
database,
formatReadableSize(sum(data_compressed_bytes)) AS compressed,
formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed,
sum(rows) AS rows,
count(DISTINCT table) AS tables,
count() AS parts
FROM system.parts
WHERE active
GROUP BY database
ORDER BY sum(data_compressed_bytes) DESC;
Interpreting the Numbers
| Metric | Meaning |
|---|---|
compressed |
Bytes on disk for active parts |
uncompressed |
Original byte size before codecs |
compr_rate / compr_ratio |
Uncompressed divided by compressed, higher is better |
part_count |
Active data parts, watch for thousands per partition |
avg_row_size |
Uncompressed bytes per row for that column |
A healthy ClickHouse table typically shows compr_rate above 3 for analytics workloads. Time-series workloads with sorted timestamps and low-cardinality dimensions often exceed 10.
Common Pitfalls
system.partsandsystem.parts_columnsinclude both active and inactive parts. Always filteractive = 1or you double-count parts that are about to be removed after a merge.- These queries run on a single node. For a cluster-wide total, wrap them in
clusterAllReplicasand aggregate across hosts. - Compressed bytes do not include the marks files or primary index, so on-disk usage from
duwill be slightly larger than the query reports. formatReadableSizereturns a string, which is convenient for humans but not sortable. Sort by the raw byte total (the inner aliassizeworks because it is thesum(...)value).- Projection size is reported separately from the parent table. Add
system.projection_partsto capacity reports if your tables use projections.
Frequently Asked Questions
Q: How do I find the largest table in a ClickHouse server?
A: Run the table size query and sort by size DESC with LIMIT 10. The first row is the largest active table on that server.
Q: Why is compressed size larger than expected?
A: Possible causes include too many small parts (overhead per part), poor codec choice for the data type, or unsorted data that defeats run-length and delta codecs. Check part_count first.
Q: How do I see size per partition?
A: Add partition to the GROUP BY and SELECT clauses, then filter on the table of interest. system.parts exposes the partition name directly.
Q: Do these queries include data still in memory?
A: No. Buffer table and pending insert blocks are not counted. They only become visible to system.parts after the next flush to disk.
Q: How do I get total cluster size, not per-node?
A: Wrap the inner SELECT in clusterAllReplicas('cluster_name', ...). Be aware that ReplicatedMergeTree tables will be counted once per replica, so divide by the replication factor for the logical total.