NEW

Pulse 2025 Product Roundup: From Monitoring to AI-Native Control Plane

ClickHouse INSERT INTO: Formats, Async Inserts, and Best Practices

INSERT INTO is the way data gets into ClickHouse. The statement is flexible: you can supply rows via VALUES, stream them in any of dozens of formats, copy from a SELECT on another table, or pull directly from a table function (S3, URL, MySQL, Kafka, and others). The single most important rule is batch your inserts. ClickHouse is designed for large bulk inserts; thousands of single-row inserts per second will degrade the cluster regardless of hardware. Asynchronous inserts are the supported answer when batching is genuinely impractical on the client side. This guide covers every variant, the deduplication features, and the patterns that hold up in production.

The Quick Answers

-- VALUES, for a handful of rows
INSERT INTO analytics.events VALUES
  ('2026-05-13 10:00:00', 42, 'page_view'),
  ('2026-05-13 10:00:01', 43, 'click');

-- From another table or query
INSERT INTO analytics.events_summary
SELECT event_date, count() FROM analytics.events GROUP BY event_date;

-- From a file in any supported format
INSERT INTO analytics.events FROM INFILE 'events.jsonl' FORMAT JSONEachRow;

-- From a table function (S3, URL, etc.)
INSERT INTO analytics.events
SELECT * FROM s3('https://bucket.s3.amazonaws.com/2026-05-13/*.parquet', 'Parquet');

Full INSERT Syntax

-- VALUES form
INSERT INTO [db.]table [(col1, col2, ...)]
    [SETTINGS ...]
    VALUES (v1, v2, ...) [, ...];

-- FORMAT form (typical for clients streaming data)
INSERT INTO [db.]table [(col1, col2, ...)]
    [SETTINGS ...]
    FORMAT format_name
    <data in that format>;

-- SELECT form
INSERT INTO [db.]table [(col1, col2, ...)]
    [SETTINGS ...]
    SELECT ...;

-- FROM INFILE (clickhouse-client side)
INSERT INTO [db.]table [(col1, ...)] FROM INFILE 'path' [COMPRESSION 'method'] FORMAT format;

-- Into a table function (write to external storage)
INSERT INTO TABLE FUNCTION s3('...', 'JSONEachRow') VALUES (...);

SETTINGS must appear before FORMAT. The docs are explicit: "Settings must appear before the FORMAT clause". Easy to get wrong; the error message is helpful.

Choosing Columns

-- Explicit column list (recommended for forward compatibility)
INSERT INTO analytics.events (event_time, user_id, event_type) VALUES (...);

-- Implicit, depends on column order
INSERT INTO analytics.events VALUES (...);

-- Column matchers for INSERT SELECT
INSERT INTO analytics.events_v2 SELECT * EXCEPT(deprecated_col) FROM analytics.events;
INSERT INTO analytics.events_v2 SELECT * REPLACE (toString(user_id) AS user_id) FROM analytics.events;

Always supply the column list for application code. Adding a new column to a table in the future should not require redeploying every producer.

Formats

ClickHouse supports dozens of input formats. The most useful in practice:

Format When to use
VALUES Hand-written or generated SQL.
JSONEachRow One JSON object per line; the default for many ingestion tools.
CSV, CSVWithNames, CSVWithNamesAndTypes Spreadsheet exports, generic flat files.
TabSeparated, TSV Unix-friendly plain-text streams.
Native Most compact, fastest; used by ClickHouse-aware clients.
RowBinary Compact binary, used by some drivers.
Parquet, ORC Columnar files from data lakes.
Avro, Arrow Cross-language interop.
Protobuf Schema-driven binary; requires a schema file.

For data already in a columnar format (Parquet, Arrow, ORC), prefer those formats over JSONEachRow; the schema is explicit and ClickHouse can ingest them without per-row parsing.

INSERT SELECT From Another Table

INSERT INTO analytics.events_v2
SELECT * FROM analytics.events WHERE event_time >= '2026-01-01';

A common migration pattern. Two operational notes:

  1. The SELECT runs as a single statement on the server, but the insert into the destination still happens in blocks. The intermediate buffer is sized by max_insert_block_size.
  2. If both tables are large, consider running the operation per-partition to keep memory bounded and to make resumption easy.

For cross-database or cross-cluster moves, the same pattern works with table functions:

INSERT INTO local.events
SELECT * FROM remote('clickhouse-source:9000', 'analytics.events', 'reader', 'secret');

INSERT FROM INFILE

INSERT INTO analytics.events
FROM INFILE 'data/events-2026-05-13.jsonl.gz'
COMPRESSION 'gzip'
FORMAT JSONEachRow;

-- Glob patterns are supported
INSERT INTO analytics.events
FROM INFILE 'data/events-2026-05-*.jsonl'
FORMAT JSONEachRow;

The FROM INFILE clause is evaluated by clickhouse-client on the client side, so the file must be accessible there. For server-side reads, use the file() table function or one of the cloud-storage table functions (s3(), gcs(), azureBlobStorage()).

INSERT From a Table Function

-- Read from S3
INSERT INTO analytics.events
SELECT * FROM s3(
    'https://bucket.s3.amazonaws.com/2026-05-13/*.parquet',
    'AWS_ACCESS_KEY', 'AWS_SECRET_KEY',
    'Parquet'
);

-- Write to S3
INSERT INTO TABLE FUNCTION s3(
    'https://bucket.s3.amazonaws.com/exports/events.parquet',
    'AWS_ACCESS_KEY', 'AWS_SECRET_KEY',
    'Parquet'
)
SELECT * FROM analytics.events WHERE event_date = today();

This is how you do bulk ingest and export without leaving SQL. The same shape works for gcs(), azureBlobStorage(), url(), mysql(), postgresql(), mongodb(), and others.

Asynchronous Inserts

If your producers cannot batch (many small writers, one row at a time), enable asynchronous inserts:

INSERT INTO analytics.events SETTINGS async_insert = 1, wait_for_async_insert = 1
FORMAT JSONEachRow
{"event_time": "2026-05-13 10:00:00", "user_id": 42, "event_type": "click"}

The server collects incoming inserts into an internal buffer and flushes them together. Key settings:

  • async_insert = 1 enables the feature for the query.
  • wait_for_async_insert = 1 makes the INSERT wait until its data has been flushed (default; safer).
  • async_insert_max_data_size controls the flush threshold by bytes.
  • async_insert_busy_timeout_ms controls the flush threshold by time.

Enable it server-side or per user (via SETTINGS on the user/role) when you have unavoidable small-insert traffic. It is dramatically better than direct small inserts, but still slower than client-side batching when batching is an option.

Deduplication

ReplicatedMergeTree automatically deduplicates inserted blocks based on a hash of the data, within a configurable window (replicated_deduplication_window). Resending the same INSERT after a network blip does not duplicate rows.

For finer control, use an explicit deduplication token:

INSERT INTO analytics.events SETTINGS insert_deduplication_token = 'batch-2026-05-13-001'
FORMAT JSONEachRow
...

Two inserts with the same token are treated as the same block, regardless of content. Useful for at-least-once producers (Kafka consumers, retry-on-failure pipelines) that want to write idempotently without dealing with content hashing.

Batch Size Guidance

The general rule for high-throughput pipelines:

  • Aim for inserts of 10,000 to 100,000 rows per call.
  • Group rows by partition key. Each insert that spans many partitions creates many small parts, which the merge background has to clean up later.
  • For very high write rates, increase max_insert_block_size (default 1,048,576) so the server batches larger blocks before flushing.

The docs note that ClickHouse implements automatic "squashing" of small blocks into larger ones, but that only helps so much. Client-side batching is still the right primary strategy.

Common Errors

  • Too many parts in active state: producers are inserting too often, faster than merges can keep up. Batch larger or enable async_insert.
  • Timeout exceeded while writing to socket: long-running INSERT SELECT; raise send_timeout / receive_timeout or split the work.
  • Cannot parse input: the data does not match the declared format. Use --format_settings input_format_allow_errors_num=N for a quick survival mode, then fix the upstream.
  • Memory limit exceeded: usually an INSERT SELECT pulling too much into a buffer. Either split by partition or raise max_memory_usage for the session.
  • READONLY: the table is on a read-only replica. See the readonly error guide.
  • Distributed timeout while waiting for inserts: A Distributed engine table is not flushing writes to shards. Check system.distribution_queue.

How Pulse Helps With Ingest Health

Insert problems rarely fail loudly. They fail as creeping merge backlog, growing part counts, climbing replication lag, and slowly-rising query latency. By the time someone notices, the cluster has dug a hole that takes hours to climb out of. Pulse continuously tracks insert patterns on ClickHouse clusters and surfaces tables with too-small inserts (the canonical "too many parts" precursor), producers writing into too many partitions per insert, missing deduplication tokens on retry-prone pipelines, merge backlog by table, and async_insert buffer utilization. Connect your ClickHouse cluster to Pulse and catch ingest problems before they become incidents.

Frequently Asked Questions

Q: Why are my single-row inserts slow?

ClickHouse creates a new data part per insert. Many small parts mean many small files, lots of merging, and metadata overhead. The fix is to batch: send 10,000 to 100,000 rows per insert, or enable async_insert so the server batches for you.

Q: How do I make my INSERTs idempotent?

For ReplicatedMergeTree, identical inserts (same data, same block size) are deduplicated within replicated_deduplication_window (default 100). For explicit control, pass insert_deduplication_token and your client decides the idempotency key.

Q: Can I INSERT into multiple tables at once?

Not in one statement. The canonical pattern is to INSERT into a table that has a materialized view attached, so the view fans the data out. For independent inserts, use a transaction-less pattern with two separate INSERTs and accept that they are not atomic together.

Q: What's the difference between INSERT FROM INFILE and INSERT FROM file()?

FROM INFILE reads on the client side (the file lives on the machine running clickhouse-client). The file() table function reads on the server side and is restricted to a configured user_files_path. Use the table function for server-side batch jobs and FROM INFILE for ad-hoc client uploads.

Q: Should I enable async_insert globally?

Enable it for the users or roles that need it, not globally. Async insert has a small latency cost (server-side buffering) and changes failure semantics; you probably do not want it for batch ETL jobs that already write in large chunks.

Q: How does INSERT handle backpressure?

ClickHouse will reject INSERTs that would push a table past its part limit (parts_to_throw_insert, default 300 parts per partition). The error message is the canonical "too many parts" sign that you need to batch more or merge faster.

Q: Can I INSERT into a Distributed table?

Yes. Inserts go to the local distribution queue and are forwarded to shards in the background. Use insert_distributed_sync = 1 to make the INSERT block until shards acknowledge, at the cost of latency. For new pipelines, prefer inserting directly into the per-shard ReplicatedMergeTree tables when possible.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.