Ship ClickHouse Logs as ndjson with Vector.dev

Q: Can Vector send logs to multiple destinations?

Yes. Define multiple [sinks.*] blocks. The same transform output can fan out to a file sink, an S3 sink, and a ClickHouse sink in parallel.

Q: How do I roll logs up by hour for cheaper storage?

Set a TTL on the clickhouse_logs table ( TTL timestamp + INTERVAL 30 DAY DELETE ) or use a materialized view that aggregates by (toStartOfHour(timestamp), severity) for long-term retention.

ClickHouse writes its server log as multi-line text. That format is fine for grep but painful for any pipeline that wants structured fields. There are two ways to turn it into ndjson: have ClickHouse emit JSON natively (22.8+), or keep the text format and use Vector.dev to parse and re-emit. This article covers both, with the Vector pipeline as the primary route since it works on any version.

Option 1: Native JSON logs (22.8+)

Since 22.8, ClickHouse can write its log in JSON directly:

<?xml version="1.0"?>
<clickhouse>
    <logger>
        <formatting>
            <type>json</type>
        </formatting>
    </logger>
</clickhouse>

Each log line becomes a JSON object with date_time, thread_name, thread_id, level, query_id, logger_name, source_file, source_line, and message. If you can run a recent version, this is the path of least resistance. Skip to "Storing structured logs in ClickHouse" below.

Option 2: Parse text logs with Vector

For versions before 22.8 or when you don't want to change ClickHouse config, Vector.dev reads the existing text log and re-emits ndjson.

Install Vector

On Debian/Ubuntu (the legacy packages.timber.io URLs were decommissioned in February 2024; use the apt.vector.dev repository hosted by Datadog):

bash -c "$(curl -L https://setup.vector.dev)"
apt-get install vector

Set up output directory and permissions:

systemctl stop vector
mkdir -p /var/log/clickhouse-server-json
chown vector:vector /var/log/clickhouse-server-json
usermod -a -G clickhouse vector

Adding vector to the clickhouse group is what lets it read /var/log/clickhouse-server/clickhouse-server.log.

Vector source

Read the file, fingerprint by inode so rotation doesn't reprocess everything, and reassemble multi-line entries by anchoring on the timestamp:

[sources.clickhouse-log]
  type                    = "file"
  include                 = [ "/var/log/clickhouse-server/clickhouse-server.log" ]
  fingerprinting.strategy = "device_and_inode"
  message_start_indicator = '^\d+\.\d+\.\d+ \d+:\d+:\d+'
  multi_line_timeout      = 1000

message_start_indicator matches the ClickHouse timestamp prefix (2024.06.12 13:01:55). Any line not matching is glued onto the previous one, which is how multi-line stack traces stay intact.

Vector transform

Use the remap transform with VRL's parse_regex! to split out the fields:

[transforms.clickhouse-log-text]
  inputs = [ "clickhouse-log" ]
  type   = "remap"
  source = '''
     . |= parse_regex!(.message, r'^(?P<timestamp>\d+\.\d+\.\d+ \d+:\d+:\d+\.\d+) \[\s?(?P<thread_id>\d+)\s?\] \{(?P<query_id>.*)\} <(?P<severity>\w+)> (?s)(?P<message>.*$)')
  '''

The regex captures five named groups: timestamp, thread_id, query_id, severity, and message. The (?s) flag at the end lets . match newlines so multi-line stack traces flow into the message field.

Sink: ndjson file

Emit one ndjson file per day to a separate directory:

[sinks.emit-clickhouse-log-json]
  type        = "file"
  inputs      = [ "clickhouse-log-text" ]
  compression = "none"
  path        = "/var/log/clickhouse-server-json/clickhouse-server.%Y-%m-%d.ndjson"
  encoding.only_fields = ["timestamp", "thread_id", "query_id", "severity", "message"]
  encoding.codec       = "ndjson"

Start Vector and check the output:

systemctl start vector
tail /var/log/clickhouse-server-json/clickhouse-server.$(date +%F).ndjson

You should see one JSON object per line, ready to ship to S3, Loki, OpenSearch, or anywhere else that expects ndjson.

Storing structured logs back into ClickHouse

If you want to query the logs with SQL, push them into a dedicated MergeTree table.

Table and user

CREATE TABLE default.clickhouse_logs (
    timestamp   DateTime64(3),
    host        LowCardinality(String),
    thread_id   LowCardinality(String),
    severity    LowCardinality(String),
    query_id    String,
    message     String
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(timestamp)
ORDER BY (toStartOfHour(timestamp), host, severity, query_id);

CREATE USER vector IDENTIFIED BY 'vector1234';
GRANT INSERT ON default.clickhouse_logs TO vector;
CREATE SETTINGS PROFILE profile_vector
    SETTINGS log_queries = 0
    READONLY
    TO vector;

log_queries = 0 on the Vector user matters. Without it, every insert from Vector writes a row to query_log. Vector then ships that row back through itself, generating another insert, generating another row, and you have a feedback loop that fills disk in hours.

Vector ClickHouse sink

[sinks.clickhouse-output-clickhouse]
    inputs   = ["clickhouse-log-text"]
    type     = "clickhouse"
    host     = "http://localhost:8123"
    database = "default"
    auth.strategy = "basic"
    auth.user     = "vector"
    auth.password = "vector1234"
    healthcheck   = true
    table         = "clickhouse_logs"
    encoding.timestamp_format = "unix"
    buffer.type               = "disk"
    buffer.max_size           = 104900000
    buffer.when_full          = "block"
    request.in_flight_limit   = 20
    encoding.only_fields = ["host", "timestamp", "thread_id", "query_id", "severity", "message"]

buffer.type = "disk" and buffer.when_full = "block" mean Vector survives a ClickHouse restart without losing data.

Querying

SELECT
    timestamp,
    severity,
    query_id,
    message
FROM default.clickhouse_logs
WHERE timestamp > now() - INTERVAL 10 MINUTE
  AND severity IN ('Error', 'Warning')
ORDER BY timestamp DESC
LIMIT 50;

Cross-referencing a slow query with its server-log messages becomes a two-table join on query_id:

SELECT
    q.query_duration_ms,
    q.query,
    l.severity,
    l.message
FROM system.query_log AS q
LEFT JOIN default.clickhouse_logs AS l USING (query_id)
WHERE q.query_id = '<id>'
ORDER BY l.timestamp;

Avoiding the self-logging loop

The biggest footgun: every insert into clickhouse_logs is itself a query, which generates query_log entries and possibly text-log entries. If you don't gate that, Vector reads its own writes and infinite-loops:

Set log_queries = 0 on the Vector user (as above).
Drop INFO-level traffic before the ClickHouse sink. Add a filter transform that keeps only Warning and above for the ClickHouse table, and ship everything to the file sink for offline use.

[transforms.warnings-only]
  type      = "filter"
  inputs    = ["clickhouse-log-text"]
  condition = '.severity != "Information" && .severity != "Debug" && .severity != "Trace"'

Point the ClickHouse sink at this transform's output instead of the raw parsed stream.

Common Pitfalls

Forgetting log_queries = 0 on the Vector user. You get a feedback loop and exponential growth of log_queries.
Running Vector as a user that is not in the clickhouse group. The log file is mode 0640 owned by clickhouse:clickhouse and Vector cannot read it.
Emitting native JSON logs and parsing them with the text-format regex. The Vector regex above expects the classic text format; for native JSON use Vector's parse_json instead.
Skipping message_start_indicator and multi_line_timeout. Multi-line stack traces then arrive as one line per frame, each marked as a separate log event.

Frequently Asked Questions

Q: Do I need Vector if I enable native JSON logs? A: Not for parsing, but Vector is still useful for buffering, retrying, and shipping to multiple sinks.

Q: Will Vector keep up with high log volumes? A: For TRACE-level workloads, no - it'll fall behind. Set the ClickHouse log level to information or warning in production.

Q: How do I keep the original text logs alongside the ndjson? A: ClickHouse still writes the original file; Vector only reads it. Both files exist.

Q: Can Vector send logs to multiple destinations? A: Yes. Define multiple [sinks.*] blocks. The same transform output can fan out to a file sink, an S3 sink, and a ClickHouse sink in parallel.

Q: How do I roll logs up by hour for cheaper storage? A: Set a TTL on the clickhouse_logs table (TTL timestamp + INTERVAL 30 DAY DELETE) or use a materialized view that aggregates by (toStartOfHour(timestamp), severity) for long-term retention.