INSERT INTO is the way data gets into ClickHouse. The statement is flexible: you can supply rows via VALUES, stream them in any of dozens of formats, copy from a SELECT on another table, or pull directly from a table function (S3, URL, MySQL, Kafka, and others). The single most important rule is batch your inserts. ClickHouse is designed for large bulk inserts; thousands of single-row inserts per second will degrade the cluster regardless of hardware. Asynchronous inserts are the supported answer when batching is genuinely impractical on the client side. This guide covers every variant, the deduplication features, and the patterns that hold up in production.
The Quick Answers
-- VALUES, for a handful of rows
INSERT INTO analytics.events VALUES
('2026-05-13 10:00:00', 42, 'page_view'),
('2026-05-13 10:00:01', 43, 'click');
-- From another table or query
INSERT INTO analytics.events_summary
SELECT event_date, count() FROM analytics.events GROUP BY event_date;
-- From a file in any supported format
INSERT INTO analytics.events FROM INFILE 'events.jsonl' FORMAT JSONEachRow;
-- From a table function (S3, URL, etc.)
INSERT INTO analytics.events
SELECT * FROM s3('https://bucket.s3.amazonaws.com/2026-05-13/*.parquet', 'Parquet');
Full INSERT Syntax
-- VALUES form
INSERT INTO [db.]table [(col1, col2, ...)]
[SETTINGS ...]
VALUES (v1, v2, ...) [, ...];
-- FORMAT form (typical for clients streaming data)
INSERT INTO [db.]table [(col1, col2, ...)]
[SETTINGS ...]
FORMAT format_name
<data in that format>;
-- SELECT form
INSERT INTO [db.]table [(col1, col2, ...)]
[SETTINGS ...]
SELECT ...;
-- FROM INFILE (clickhouse-client side)
INSERT INTO [db.]table [(col1, ...)] FROM INFILE 'path' [COMPRESSION 'method'] FORMAT format;
-- Into a table function (write to external storage)
INSERT INTO TABLE FUNCTION s3('...', 'JSONEachRow') VALUES (...);
SETTINGS must appear before FORMAT. The docs are explicit: "Settings must appear before the FORMAT clause". Easy to get wrong; the error message is helpful.
Choosing Columns
-- Explicit column list (recommended for forward compatibility)
INSERT INTO analytics.events (event_time, user_id, event_type) VALUES (...);
-- Implicit, depends on column order
INSERT INTO analytics.events VALUES (...);
-- Column matchers for INSERT SELECT
INSERT INTO analytics.events_v2 SELECT * EXCEPT(deprecated_col) FROM analytics.events;
INSERT INTO analytics.events_v2 SELECT * REPLACE (toString(user_id) AS user_id) FROM analytics.events;
Always supply the column list for application code. Adding a new column to a table in the future should not require redeploying every producer.
Formats
ClickHouse supports dozens of input formats. The most useful in practice:
| Format | When to use |
|---|---|
VALUES |
Hand-written or generated SQL. |
JSONEachRow |
One JSON object per line; the default for many ingestion tools. |
CSV, CSVWithNames, CSVWithNamesAndTypes |
Spreadsheet exports, generic flat files. |
TabSeparated, TSV |
Unix-friendly plain-text streams. |
Native |
Most compact, fastest; used by ClickHouse-aware clients. |
RowBinary |
Compact binary, used by some drivers. |
Parquet, ORC |
Columnar files from data lakes. |
Avro, Arrow |
Cross-language interop. |
Protobuf |
Schema-driven binary; requires a schema file. |
For data already in a columnar format (Parquet, Arrow, ORC), prefer those formats over JSONEachRow; the schema is explicit and ClickHouse can ingest them without per-row parsing.
INSERT SELECT From Another Table
INSERT INTO analytics.events_v2
SELECT * FROM analytics.events WHERE event_time >= '2026-01-01';
A common migration pattern. Two operational notes:
- The SELECT runs as a single statement on the server, but the insert into the destination still happens in blocks. The intermediate buffer is sized by
max_insert_block_size. - If both tables are large, consider running the operation per-partition to keep memory bounded and to make resumption easy.
For cross-database or cross-cluster moves, the same pattern works with table functions:
INSERT INTO local.events
SELECT * FROM remote('clickhouse-source:9000', 'analytics.events', 'reader', 'secret');
INSERT FROM INFILE
INSERT INTO analytics.events
FROM INFILE 'data/events-2026-05-13.jsonl.gz'
COMPRESSION 'gzip'
FORMAT JSONEachRow;
-- Glob patterns are supported
INSERT INTO analytics.events
FROM INFILE 'data/events-2026-05-*.jsonl'
FORMAT JSONEachRow;
The FROM INFILE clause is evaluated by clickhouse-client on the client side, so the file must be accessible there. For server-side reads, use the file() table function or one of the cloud-storage table functions (s3(), gcs(), azureBlobStorage()).
INSERT From a Table Function
-- Read from S3
INSERT INTO analytics.events
SELECT * FROM s3(
'https://bucket.s3.amazonaws.com/2026-05-13/*.parquet',
'AWS_ACCESS_KEY', 'AWS_SECRET_KEY',
'Parquet'
);
-- Write to S3
INSERT INTO TABLE FUNCTION s3(
'https://bucket.s3.amazonaws.com/exports/events.parquet',
'AWS_ACCESS_KEY', 'AWS_SECRET_KEY',
'Parquet'
)
SELECT * FROM analytics.events WHERE event_date = today();
This is how you do bulk ingest and export without leaving SQL. The same shape works for gcs(), azureBlobStorage(), url(), mysql(), postgresql(), mongodb(), and others.
Asynchronous Inserts
If your producers cannot batch (many small writers, one row at a time), enable asynchronous inserts:
INSERT INTO analytics.events SETTINGS async_insert = 1, wait_for_async_insert = 1
FORMAT JSONEachRow
{"event_time": "2026-05-13 10:00:00", "user_id": 42, "event_type": "click"}
The server collects incoming inserts into an internal buffer and flushes them together. Key settings:
async_insert = 1enables the feature for the query.wait_for_async_insert = 1makes the INSERT wait until its data has been flushed (default; safer).async_insert_max_data_sizecontrols the flush threshold by bytes.async_insert_busy_timeout_mscontrols the flush threshold by time.
Enable it server-side or per user (via SETTINGS on the user/role) when you have unavoidable small-insert traffic. It is dramatically better than direct small inserts, but still slower than client-side batching when batching is an option.
Deduplication
ReplicatedMergeTree automatically deduplicates inserted blocks based on a hash of the data, within a configurable window (replicated_deduplication_window). Resending the same INSERT after a network blip does not duplicate rows.
For finer control, use an explicit deduplication token:
INSERT INTO analytics.events SETTINGS insert_deduplication_token = 'batch-2026-05-13-001'
FORMAT JSONEachRow
...
Two inserts with the same token are treated as the same block, regardless of content. Useful for at-least-once producers (Kafka consumers, retry-on-failure pipelines) that want to write idempotently without dealing with content hashing.
Batch Size Guidance
The general rule for high-throughput pipelines:
- Aim for inserts of 10,000 to 100,000 rows per call.
- Group rows by partition key. Each insert that spans many partitions creates many small parts, which the merge background has to clean up later.
- For very high write rates, increase
max_insert_block_size(default 1,048,576) so the server batches larger blocks before flushing.
The docs note that ClickHouse implements automatic "squashing" of small blocks into larger ones, but that only helps so much. Client-side batching is still the right primary strategy.
Common Errors
Too many parts in active state: producers are inserting too often, faster than merges can keep up. Batch larger or enable async_insert.Timeout exceeded while writing to socket: long-running INSERT SELECT; raisesend_timeout/receive_timeoutor split the work.Cannot parse input: the data does not match the declared format. Use--format_settings input_format_allow_errors_num=Nfor a quick survival mode, then fix the upstream.Memory limit exceeded: usually an INSERT SELECT pulling too much into a buffer. Either split by partition or raisemax_memory_usagefor the session.READONLY: the table is on a read-only replica. See the readonly error guide.Distributed timeout while waiting for inserts: ADistributedengine table is not flushing writes to shards. Checksystem.distribution_queue.
How Pulse Helps With Ingest Health
Insert problems rarely fail loudly. They fail as creeping merge backlog, growing part counts, climbing replication lag, and slowly-rising query latency. By the time someone notices, the cluster has dug a hole that takes hours to climb out of. Pulse continuously tracks insert patterns on ClickHouse clusters and surfaces tables with too-small inserts (the canonical "too many parts" precursor), producers writing into too many partitions per insert, missing deduplication tokens on retry-prone pipelines, merge backlog by table, and async_insert buffer utilization. Connect your ClickHouse cluster to Pulse and catch ingest problems before they become incidents.
Frequently Asked Questions
Q: Why are my single-row inserts slow?
ClickHouse creates a new data part per insert. Many small parts mean many small files, lots of merging, and metadata overhead. The fix is to batch: send 10,000 to 100,000 rows per insert, or enable async_insert so the server batches for you.
Q: How do I make my INSERTs idempotent?
For ReplicatedMergeTree, identical inserts (same data, same block size) are deduplicated within replicated_deduplication_window (default 100). For explicit control, pass insert_deduplication_token and your client decides the idempotency key.
Q: Can I INSERT into multiple tables at once?
Not in one statement. The canonical pattern is to INSERT into a table that has a materialized view attached, so the view fans the data out. For independent inserts, use a transaction-less pattern with two separate INSERTs and accept that they are not atomic together.
Q: What's the difference between INSERT FROM INFILE and INSERT FROM file()?
FROM INFILE reads on the client side (the file lives on the machine running clickhouse-client). The file() table function reads on the server side and is restricted to a configured user_files_path. Use the table function for server-side batch jobs and FROM INFILE for ad-hoc client uploads.
Q: Should I enable async_insert globally?
Enable it for the users or roles that need it, not globally. Async insert has a small latency cost (server-side buffering) and changes failure semantics; you probably do not want it for batch ETL jobs that already write in large chunks.
Q: How does INSERT handle backpressure?
ClickHouse will reject INSERTs that would push a table past its part limit (parts_to_throw_insert, default 300 parts per partition). The error message is the canonical "too many parts" sign that you need to batch more or merge faster.
Q: Can I INSERT into a Distributed table?
Yes. Inserts go to the local distribution queue and are forwarded to shards in the background. Use insert_distributed_sync = 1 to make the INSERT block until shards acknowledge, at the cost of latency. For new pipelines, prefer inserting directly into the per-shard ReplicatedMergeTree tables when possible.