Precreate ClickHouse Parts Offline With clickhouse-local

Q: Can I generate parts with skip indexes?

Yes. Define the index in the local CREATE TABLE exactly as on the server. The parts include the index files and the server uses them after attach.

Q: Does this work with MATERIALIZED columns?

Yes, but the materialized expression runs in clickhouse-local at insert time. Make sure any functions referenced are available in the local binary's version.

clickhouse-local runs the full ClickHouse engine in a single process against a directory you control. That makes it useful for more than ad hoc queries: you can use it to build real MergeTree parts on a disk, then move those parts to a production server's detached/ directory and attach them. The expensive work (parsing, sorting, compressing, building indexes) happens off-cluster, the server only takes the final parts.

This pattern is useful for large initial loads, periodic offline batch ingestion, and any scenario where you want to keep the write path off your production nodes.

Why Precreate Parts

A normal INSERT into a MergeTree server does several things at once:

Parses incoming data.
Sorts by the table's ORDER BY.
Writes a part with the appropriate granules, indexes, and codecs.
Triggers background merges.

For multi-terabyte loads, steps 1 through 3 burn CPU and memory on the production cluster while still trying to serve queries. Doing them in clickhouse-local against a scratch disk moves the cost elsewhere. The server's only job becomes "attach this directory as a part."

Walkthrough

Tested on ClickHouse 23.3 and later.

Step 1: Stage some data

cat <<EOF > /tmp/data.csv
1,2020-01-01,"String"
2,2020-02-02,"Another string"
3,2020-03-03,"One more string"
4,2020-01-02,"String for first partition"
EOF

rm -rf /tmp/precreate_parts
mkdir -p /tmp/precreate_parts
cd /tmp/precreate_parts

The directory /tmp/precreate_parts becomes the working storage that clickhouse-local reads and writes.

Step 2: Define the table schema

The schema must match the server's table exactly: same columns, types, sort order, partition expression, settings. Any difference and the resulting parts cannot be attached.

clickhouse-local --path=. --query="CREATE DATABASE local"

clickhouse-local --path=. --query="
CREATE TABLE local.test (
    id UInt64,
    d  Date,
    s  String,
    x  MATERIALIZED sleepEachRow(0.5)
)
Engine=MergeTree
ORDER BY id
PARTITION BY toYYYYMM(d);
"

--path=. tells clickhouse-local to keep data under the current directory. Without it, clickhouse-local uses a temporary directory that disappears on exit.

Step 3: Insert data

Three ways to feed data in.

A. Stream CSV directly into the target table:

cat /tmp/data.csv | clickhouse-local --path=. \
    --query="INSERT INTO local.test FORMAT CSV"

B. Use a File engine over stdin as a source:

clickhouse-local --path=. --query="
CREATE TABLE local.stdin (id UInt64, d Date, s String) Engine=File(CSV, stdin);
"

cat /tmp/data.csv | clickhouse-local --path=. \
    --query="INSERT INTO local.test SELECT * FROM local.stdin;"

C. Bind a File engine to a CSV path:

clickhouse-local --path=. --query="
CREATE TABLE local.data_csv (id UInt64, d Date, s String) Engine=File(CSV, '/tmp/data.csv');
"

clickhouse-local --path=. --query="INSERT INTO local.test SELECT * FROM local.data_csv;"

Option C is the most useful for repeating jobs: a small script can iterate over input files and run the same INSERT.

Step 4: Inspect the parts

clickhouse-local --path=. --query="SELECT _part, * FROM local.test ORDER BY id;"
ls -la data/local/test/

You will see one directory per part, with names like 202001_1_1_0, 202002_2_2_0, 202003_3_3_0. The first segment is the partition ID derived from toYYYYMM(d).

Step 5: Merge into the desired part count

OPTIMIZE TABLE ... FINAL collapses partitions into single parts. For an offline precreate workflow this is usually what you want, so the server gets large, already-merged parts:

clickhouse-local --path=. --query="OPTIMIZE TABLE local.test FINAL;"

Inactive parts (the inputs to the merge) stick around by default. To force their cleanup so the directory only contains attachable parts:

clickhouse-local --path=. --query="
ALTER TABLE local.test MODIFY SETTING
    old_parts_lifetime = 0,
    cleanup_delay_period = 0,
    cleanup_delay_period_random_add = 0;
"

clickhouse-local --path=. --query="
SELECT count() FROM numbers(100) WHERE sleepEachRow(0.1) SETTINGS max_block_size = 1;
"

The second query is a no-op stall to give the background cleanup thread time to run; tune the count for the size of the dataset.

Step 6: Move parts to the server

After cleanup, data/local/test/ contains only the final, merged parts. Copy them into the target server's detached/ directory:

SERVER_TABLE_PATH=/var/lib/clickhouse/data/analytics/events
cp -r data/local/test/* "$SERVER_TABLE_PATH/detached/"
chown -R clickhouse:clickhouse "$SERVER_TABLE_PATH/detached/"

Then attach each part on the server:

ALTER TABLE analytics.events ATTACH PART '202001_1_1_0';
ALTER TABLE analytics.events ATTACH PART '202002_2_2_0';
ALTER TABLE analytics.events ATTACH PART '202003_3_3_0';

Or attach by partition if all parts you copied belong to specific partitions:

ALTER TABLE analytics.events ATTACH PARTITION '202001';

Schema Caveats

The schemas in clickhouse-local and on the server must be byte-identical from a parts perspective:

Same column names, types, and order in the CREATE TABLE.
Same ORDER BY (primary key).
Same PARTITION BY.
Same compression codecs, if any.
Same settings that affect on-disk format (index_granularity, etc.).

A mismatch causes the attach to fail or, worse, silently produces a part the server cannot read for some queries. Generate the local table from the same DDL you use on production.

When This Helps

Initial bulk loads. Pre-build months of historical data offline, then attach in a single low-traffic window.
Periodic batch ingestion. A nightly job builds parts from yesterday's logs in a worker pod, then ATTACHes them.
Cross-region transfers. Build parts close to the data, transfer the part files (fewer, larger, already compressed) instead of streaming raw rows.
Disaster recovery rehearsals. Reconstruct a table from external storage without touching production write paths.

Common Pitfalls

Forgetting --path=.. clickhouse-local uses a temp directory by default and your parts vanish on exit.
Different index_granularity or codecs between local and server. The attach succeeds but reads can produce subtle errors. Match the server's settings exactly.
Missing partition expression. A partition expression mismatch causes attach to fail with "Partition value mismatch" or similar.
Attaching parts that overlap with existing data. ClickHouse will reject parts whose block ranges conflict with already-present parts. Use unique partition values or freeze existing partitions first.
Not cleaning up inactive parts before copying. You end up copying intermediate parts that the server then refuses or treats as duplicates.

How Pulse Helps With Bulk Loading

Offloading writes to clickhouse-local reduces production cluster load but introduces a new operational surface: jobs that fail silently, mismatched schemas, parts that pile up undetected. Pulse tracks attach operations across your cluster, flags schema drift between source pipelines and target tables, and watches detached/ directories for parts that have been waiting too long. Connect your ClickHouse cluster to Pulse and operationalize offline bulk loading.

Frequently Asked Questions

Q: Does clickhouse-local need the same version as the server?

Same major version is strongly recommended. Part format is stable across maintenance releases but has changed between feature releases. Use the same package for both.

Q: Can I use this for Replicated tables?

Yes. Attach the parts to one replica; ClickHouse replicates the new parts to the others. Make sure block numbers do not conflict by attaching to a partition that no other replica is writing to concurrently.

Q: Is this faster than parallel INSERTs?

For very large loads, usually yes, because the server's CPU is not consumed by parsing and sorting. For small loads, the overhead of generating, copying, and attaching is not worth it.

Q: What if the data does not fit on the local disk?

Process in batches. Generate parts for one partition or one chunk, attach, then move on to the next. The directory only needs space for one batch at a time.

Q: Can I generate parts with skip indexes?

Yes. Define the index in the local CREATE TABLE exactly as on the server. The parts include the index files and the server uses them after attach.

Q: Does this work with MATERIALIZED columns?

Yes, but the materialized expression runs in clickhouse-local at insert time. Make sure any functions referenced are available in the local binary's version.