ClickHouse writes its server log as multi-line text. That format is fine for grep but painful for any pipeline that wants structured fields. There are two ways to turn it into ndjson: have ClickHouse emit JSON natively (22.8+), or keep the text format and use Vector.dev to parse and re-emit. This article covers both, with the Vector pipeline as the primary route since it works on any version.
Option 1: Native JSON logs (22.8+)
Since 22.8, ClickHouse can write its log in JSON directly:
<?xml version="1.0"?>
<clickhouse>
<logger>
<formatting>
<type>json</type>
</formatting>
</logger>
</clickhouse>
Each log line becomes a JSON object with date_time, thread_name, thread_id, level, query_id, logger_name, source_file, source_line, and message. If you can run a recent version, this is the path of least resistance. Skip to "Storing structured logs in ClickHouse" below.
Option 2: Parse text logs with Vector
For versions before 22.8 or when you don't want to change ClickHouse config, Vector.dev reads the existing text log and re-emits ndjson.
Install Vector
On Debian/Ubuntu (the legacy packages.timber.io URLs were decommissioned in February 2024; use the apt.vector.dev repository hosted by Datadog):
bash -c "$(curl -L https://setup.vector.dev)"
apt-get install vector
Set up output directory and permissions:
systemctl stop vector
mkdir -p /var/log/clickhouse-server-json
chown vector:vector /var/log/clickhouse-server-json
usermod -a -G clickhouse vector
Adding vector to the clickhouse group is what lets it read /var/log/clickhouse-server/clickhouse-server.log.
Vector source
Read the file, fingerprint by inode so rotation doesn't reprocess everything, and reassemble multi-line entries by anchoring on the timestamp:
[sources.clickhouse-log]
type = "file"
include = [ "/var/log/clickhouse-server/clickhouse-server.log" ]
fingerprinting.strategy = "device_and_inode"
message_start_indicator = '^\d+\.\d+\.\d+ \d+:\d+:\d+'
multi_line_timeout = 1000
message_start_indicator matches the ClickHouse timestamp prefix (2024.06.12 13:01:55). Any line not matching is glued onto the previous one, which is how multi-line stack traces stay intact.
Vector transform
Use the remap transform with VRL's parse_regex! to split out the fields:
[transforms.clickhouse-log-text]
inputs = [ "clickhouse-log" ]
type = "remap"
source = '''
. |= parse_regex!(.message, r'^(?P<timestamp>\d+\.\d+\.\d+ \d+:\d+:\d+\.\d+) \[\s?(?P<thread_id>\d+)\s?\] \{(?P<query_id>.*)\} <(?P<severity>\w+)> (?s)(?P<message>.*$)')
'''
The regex captures five named groups: timestamp, thread_id, query_id, severity, and message. The (?s) flag at the end lets . match newlines so multi-line stack traces flow into the message field.
Sink: ndjson file
Emit one ndjson file per day to a separate directory:
[sinks.emit-clickhouse-log-json]
type = "file"
inputs = [ "clickhouse-log-text" ]
compression = "none"
path = "/var/log/clickhouse-server-json/clickhouse-server.%Y-%m-%d.ndjson"
encoding.only_fields = ["timestamp", "thread_id", "query_id", "severity", "message"]
encoding.codec = "ndjson"
Start Vector and check the output:
systemctl start vector
tail /var/log/clickhouse-server-json/clickhouse-server.$(date +%F).ndjson
You should see one JSON object per line, ready to ship to S3, Loki, OpenSearch, or anywhere else that expects ndjson.
Storing structured logs back into ClickHouse
If you want to query the logs with SQL, push them into a dedicated MergeTree table.
Table and user
CREATE TABLE default.clickhouse_logs (
timestamp DateTime64(3),
host LowCardinality(String),
thread_id LowCardinality(String),
severity LowCardinality(String),
query_id String,
message String
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(timestamp)
ORDER BY (toStartOfHour(timestamp), host, severity, query_id);
CREATE USER vector IDENTIFIED BY 'vector1234';
GRANT INSERT ON default.clickhouse_logs TO vector;
CREATE SETTINGS PROFILE profile_vector
SETTINGS log_queries = 0
READONLY
TO vector;
log_queries = 0 on the Vector user matters. Without it, every insert from Vector writes a row to query_log. Vector then ships that row back through itself, generating another insert, generating another row, and you have a feedback loop that fills disk in hours.
Vector ClickHouse sink
[sinks.clickhouse-output-clickhouse]
inputs = ["clickhouse-log-text"]
type = "clickhouse"
host = "http://localhost:8123"
database = "default"
auth.strategy = "basic"
auth.user = "vector"
auth.password = "vector1234"
healthcheck = true
table = "clickhouse_logs"
encoding.timestamp_format = "unix"
buffer.type = "disk"
buffer.max_size = 104900000
buffer.when_full = "block"
request.in_flight_limit = 20
encoding.only_fields = ["host", "timestamp", "thread_id", "query_id", "severity", "message"]
buffer.type = "disk" and buffer.when_full = "block" mean Vector survives a ClickHouse restart without losing data.
Querying
SELECT
timestamp,
severity,
query_id,
message
FROM default.clickhouse_logs
WHERE timestamp > now() - INTERVAL 10 MINUTE
AND severity IN ('Error', 'Warning')
ORDER BY timestamp DESC
LIMIT 50;
Cross-referencing a slow query with its server-log messages becomes a two-table join on query_id:
SELECT
q.query_duration_ms,
q.query,
l.severity,
l.message
FROM system.query_log AS q
LEFT JOIN default.clickhouse_logs AS l USING (query_id)
WHERE q.query_id = '<id>'
ORDER BY l.timestamp;
Avoiding the self-logging loop
The biggest footgun: every insert into clickhouse_logs is itself a query, which generates query_log entries and possibly text-log entries. If you don't gate that, Vector reads its own writes and infinite-loops:
- Set
log_queries = 0on the Vector user (as above). - Drop INFO-level traffic before the ClickHouse sink. Add a filter transform that keeps only Warning and above for the ClickHouse table, and ship everything to the file sink for offline use.
[transforms.warnings-only]
type = "filter"
inputs = ["clickhouse-log-text"]
condition = '.severity != "Information" && .severity != "Debug" && .severity != "Trace"'
Point the ClickHouse sink at this transform's output instead of the raw parsed stream.
Common Pitfalls
- Forgetting
log_queries = 0on the Vector user. You get a feedback loop and exponential growth of log_queries. - Running Vector as a user that is not in the
clickhousegroup. The log file is mode 0640 owned byclickhouse:clickhouseand Vector cannot read it. - Emitting native JSON logs and parsing them with the text-format regex. The Vector regex above expects the classic text format; for native JSON use Vector's
parse_jsoninstead. - Skipping
message_start_indicatorandmulti_line_timeout. Multi-line stack traces then arrive as one line per frame, each marked as a separate log event.
Frequently Asked Questions
Q: Do I need Vector if I enable native JSON logs? A: Not for parsing, but Vector is still useful for buffering, retrying, and shipping to multiple sinks.
Q: Will Vector keep up with high log volumes?
A: For TRACE-level workloads, no - it'll fall behind. Set the ClickHouse log level to information or warning in production.
Q: How do I keep the original text logs alongside the ndjson? A: ClickHouse still writes the original file; Vector only reads it. Both files exist.
Q: Can Vector send logs to multiple destinations?
A: Yes. Define multiple [sinks.*] blocks. The same transform output can fan out to a file sink, an S3 sink, and a ClickHouse sink in parallel.
Q: How do I roll logs up by hour for cheaper storage?
A: Set a TTL on the clickhouse_logs table (TTL timestamp + INTERVAL 30 DAY DELETE) or use a materialized view that aggregates by (toStartOfHour(timestamp), severity) for long-term retention.