ClickHouse is built from the ground up for online analytical processing (OLAP). Its architecture makes fundamentally different trade-offs than row-oriented databases — optimizing for scan-heavy analytical queries over massive datasets rather than single-row transactional operations.
This guide walks through the key architectural components that make ClickHouse fast, how they interact, and what you need to understand to get the most out of your deployments.
Columnar Storage
The foundation of ClickHouse's performance is columnar storage. Instead of storing data row by row (like PostgreSQL or MySQL), ClickHouse stores each column in its own file on disk.
Why this matters for analytics:
- Reduced I/O: A query that reads 5 columns from a 200-column table only touches 5 files, not the entire dataset. Row-oriented databases would read all 200 columns for every row.
- Better compression: Values in a single column tend to be similar (same data type, similar ranges, repeated values), which compresses dramatically better than mixed-type row data. Compression ratios of 10–20x are common.
- CPU cache efficiency: Processing a contiguous array of same-type values leverages CPU caches and SIMD instructions far more effectively than jumping between fields in a row.
The MergeTree Engine Family
MergeTree is ClickHouse's workhorse table engine. Nearly every production ClickHouse table uses a MergeTree variant. Understanding how it works is essential.
Parts
When you insert data into a MergeTree table, ClickHouse writes it as an immutable data part — a self-contained directory on disk containing column files, an index, and metadata. Each insert creates a new part.
Parts are never modified in place. This append-only design is why ClickHouse achieves high write throughput — there's no row-level locking or in-place updates.
Background Merges
Over time, ClickHouse's background merge process combines smaller parts into larger ones. This is the "Merge" in MergeTree. Merges:
- Reduce the number of parts (too many parts degrades query performance)
- Apply data transformations specific to the engine variant (deduplication in ReplacingMergeTree, aggregation in AggregatingMergeTree, etc.)
- Maintain sort order within the merged result
You don't control when merges happen — ClickHouse manages this automatically. But you need to monitor part counts; if inserts create parts faster than merges can consolidate them, you'll hit the "too many parts" error.
Partitions
Partitions are logical divisions of a table, typically by time (month, day) or another dimension. Each partition contains its own set of parts.
Partitions enable:
- Efficient data pruning: Queries with a WHERE clause on the partition key skip entire partitions
- Easy data management: You can DROP or DETACH entire partitions for fast bulk deletes
- TTL enforcement: ClickHouse uses partitions to expire old data
A common mistake is over-partitioning — creating too many partitions (e.g., partitioning by hour on a table with daily queries) leads to too many small parts and degraded performance. Keep partition counts in the low hundreds, not thousands.
Granules and the Sparse Primary Index
Within each part, data is divided into granules — groups of rows (default 8,192 rows). Granules are the smallest unit that ClickHouse reads during a query.
The primary index in ClickHouse is a sparse index — it stores one entry per granule (the first row's key values), not one entry per row. This means:
- The index is tiny and fits entirely in memory, even for tables with billions of rows
- Index lookups identify which granules to read, not individual rows
- The primary key determines physical sort order on disk
This is fundamentally different from B-tree indexes in traditional databases. The primary key in ClickHouse doesn't enforce uniqueness — it defines the data's sort order to enable efficient range scans.
Choosing the right primary key is the single most impactful performance decision in ClickHouse. The key should match your most common query filter patterns, with the most selective column first.
Data Skipping Indexes
Beyond the primary index, ClickHouse supports data skipping indexes (also called secondary indexes) that store lightweight summaries per granule:
- minmax: Stores min/max values — useful for range filters
- set: Stores unique values up to a limit — useful for equality filters on low-cardinality columns
- bloom_filter: Probabilistic membership test — useful for string containment checks
- tokenbf_v1 / ngrambf_v1: Token and n-gram bloom filters for text search
These indexes don't point to specific rows. They tell ClickHouse which granules it can skip — hence the name.
Query Execution
Vectorized Processing
ClickHouse processes data in blocks (batches of rows, typically matching a granule) rather than row by row. This vectorized execution:
- Amortizes function call overhead across thousands of rows
- Enables SIMD (Single Instruction, Multiple Data) CPU instructions
- Maximizes CPU cache utilization
The result is that ClickHouse can process billions of rows per second on a single node for simple aggregations.
Query Pipeline
A typical query flows through these stages:
- Parsing and analysis: SQL is parsed and the query optimizer selects the best execution plan
- Part and granule pruning: The primary index and partition key eliminate irrelevant data
- Data reading: Only required columns are read from the surviving granules
- Processing: Filters, aggregations, and transformations are applied in a pipelined, parallel fashion
- Merging: Partial results from parallel processing streams are merged into the final result
ClickHouse automatically parallelizes across CPU cores. The max_threads setting controls this (defaults to the number of cores).
Compression
ClickHouse compresses data at the column level using configurable codecs:
- LZ4 (default): Fast compression and decompression, moderate ratio
- ZSTD: Higher compression ratio, slower decompression — good for cold/archival data
- Delta / DoubleDelta: Encode differences between sequential values — excellent for timestamps and monotonically increasing sequences
- Gorilla: Float-specific encoding from Facebook's time-series compression paper
- T64: Strips unused bits from integer types
You can stack codecs: CODEC(Delta, ZSTD) applies delta encoding first, then ZSTD compression on the deltas. This regularly achieves 20–50x compression on time-series data.
Distributed Architecture
Replication
ClickHouse uses ClickHouse Keeper (or ZooKeeper) to coordinate replication. ReplicatedMergeTree tables maintain identical copies across multiple nodes:
- Inserts are written to one replica and asynchronously replicated to others
- Keeper tracks the replication log — which parts each replica has and needs
- Replicas independently fetch missing parts, making the system self-healing
Keeper is the most operationally complex component in a ClickHouse deployment. It's a consensus-based coordination service that must maintain quorum to function.
Sharding
For datasets that exceed a single node's capacity, ClickHouse supports sharding — splitting data across multiple nodes. The Distributed table engine routes queries and inserts to the appropriate shards.
Sharding strategies:
- Round-robin: Even distribution, but cross-shard queries require data shuffling
- Hash-based: Consistent placement by a key, enabling local JOINs when both tables share the same sharding key
- Manual: Explicit shard assignment for maximum control
Distributed Queries
When you query a Distributed table, ClickHouse:
- Sends the query to all relevant shards
- Each shard processes its local data and returns partial results
- The initiator node merges partial results into the final answer
For aggregations, ClickHouse pushes GROUP BY processing to the shards, transferring only aggregated results — not raw data. This is why ClickHouse scales well for analytical queries even across many shards.
ClickHouse Cloud vs. Self-Managed Architecture
ClickHouse Cloud introduces SharedMergeTree, a cloud-native engine that separates storage from compute:
- Data lives in object storage (S3, GCS, Azure Blob)
- Compute nodes are stateless and can scale independently
- Multiple compute groups can read and write to the same data (compute-compute separation)
This architecture isn't available in open-source ClickHouse, where each node owns its data on local disk. SharedMergeTree is the primary technical differentiator of ClickHouse Cloud.
Key Architecture Decisions for Your Deployment
| Decision | Guidance |
|---|---|
| Primary key | Match your most common WHERE clause filters. Most selective column first. |
| Partition key | Use time-based partitioning (monthly is a good default). Avoid more than a few hundred partitions. |
| Batch inserts | Insert in batches of 10,000+ rows. Avoid single-row inserts — each creates a new part. |
| Compression | Use LZ4 for hot data, ZSTD for cold. Stack Delta/DoubleDelta before compression for time-series. |
| Replication | Use ReplicatedMergeTree with 2–3 replicas for production. Run Keeper on dedicated nodes. |
| Sharding | Start with a single shard and replicate. Only shard when data exceeds single-node capacity. |
Frequently Asked Questions
Q: Why is ClickHouse so much faster than PostgreSQL or MySQL for analytics?
The entire architecture is optimized for scan-heavy queries: columnar storage reads only needed columns, compression reduces I/O by 10–20x, vectorized execution processes data in CPU-cache-friendly batches, and the sparse index enables skipping irrelevant data without maintaining expensive B-tree structures.
Q: What's the difference between the primary key and a traditional database index?
ClickHouse's primary key defines the physical sort order of data and creates a sparse index with one entry per granule (8,192 rows by default). It doesn't enforce uniqueness and can't be used for point lookups the way a B-tree index can. Think of it as an efficient range-scan accelerator, not a row-level locator.
Q: How many parts is too many?
ClickHouse will warn at 300 parts per partition and reject inserts at 600 (configurable). In practice, keep part counts below 100–200 per partition by batching inserts and ensuring background merges keep up with ingestion.
Q: Can ClickHouse handle real-time data?
Yes, but with caveats. ClickHouse can ingest millions of rows per second and query them immediately. However, single-row inserts are inefficient — batch data in blocks of at least several thousand rows. For true real-time streaming, use Kafka or ClickPipes to buffer and batch automatically.
Q: When should I use ReplicatedMergeTree vs. plain MergeTree?
Use ReplicatedMergeTree for any production deployment. Plain MergeTree has no data redundancy — if the node fails, you lose data. Replication requires ClickHouse Keeper or ZooKeeper but provides fault tolerance and read scaling.
Q: How does ClickHouse handle UPDATEs and DELETEs?
ClickHouse supports mutations (ALTER TABLE UPDATE/DELETE), but they're heavy operations that rewrite parts. They're not designed for OLTP-style row-level updates. For use cases requiring frequent updates, consider ReplacingMergeTree (deduplicates on merge) or CollapsingMergeTree (tracks row state changes). For deletes, lightweight deletes (DELETE FROM) mark rows as deleted without immediate rewrite, but with some query overhead until the next merge.