ClickHouse substring Function: Syntax, Offsets, and UTF-8 Examples

The substring function in ClickHouse extracts a portion of a string starting at a byte offset and optionally limited to a given length. It uses 1-based indexing (offset 1 is the first byte), and a negative offset counts from the end of the string. Offset 0 is a special case that always returns an empty string. Aliases: substr, mid, byteSlice. For Unicode-correct slicing by code point, use substringUTF8.

Syntax

substring(s, offset[, length])

See the official ClickHouse substring reference for the canonical definition.

Parameters

Name Type Description Required
s String, FixedString, or Enum The source string. Yes
offset (U)Int* 1-based starting byte. 0 returns empty; negative counts from the end. Yes
length (U)Int* Maximum number of bytes to return. Omitted means "to end of string". No

Return type: String. If any argument is NULL, returns NULL.

Examples

Basic extraction

SELECT substring('Hello, World!', 1, 5) AS prefix;
-- Output: Hello

Offset 1 is the first byte. length is the maximum number of bytes returned.

Omitting length

SELECT substring('clickhouse-server', 12) AS suffix;
-- Output: server

When length is omitted, substring returns from offset to the end of the string.

Negative offset

SELECT substring('clickhouse', -5) AS tail;
-- Output: house

SELECT substring('clickhouse', -5, 3) AS tail3;
-- Output: hou

A negative offset counts from the right. Combine with length to slice from a known distance from the end.

Special case: offset 0

SELECT substring('anything', 0, 5) AS result;
-- Output: '' (empty string)

offset = 0 is not "the first character"; it always returns an empty string. This is the single most common source of bugs when porting Python or JavaScript code.

UTF-8 strings: use substringUTF8

SELECT
    substring('cafe\xcc\x81', 1, 4)      AS bytes,    -- "caf" plus partial code point
    substringUTF8('cafe\xcc\x81', 1, 4)  AS chars;   -- "café" (4 code points)

substring works on bytes, so it can split a multi-byte UTF-8 sequence. substringUTF8 counts code points and is the right choice for human-readable strings.

Extracting domain from email

SELECT substring(email, position(email, '@') + 1) AS domain
FROM users
LIMIT 5;

Combine substring with position, `length`, or regex helpers for fixed-pattern extraction.

Common Issues and Gotchas

  • Offset 0 returns empty, not the first character. Most languages use 0-based indexing; ClickHouse uses 1-based. Always start at 1.
  • substring counts bytes, substringUTF8 counts code points. Cutting a multi-byte character with substring produces invalid UTF-8. Use substringUTF8 for any column that may contain non-ASCII text.
  • substringUTF8 does not validate input. It assumes the source is valid UTF-8 and produces undefined results on invalid bytes. Validate ingest if input is untrusted.
  • FixedString includes trailing nulls. Slicing a FixedString(20) containing 'hi' returns padding bytes. Strip with trim(BOTH '\0' FROM substring(col, ...)) or store as String if length varies.
  • Negative length is not supported in older versions. Negative length was added in v22.x; on older builds it raises ARGUMENT_OUT_OF_BOUND. Verify against your deployment with SELECT version().

Performance Notes

substring is a constant-time slice (it returns a view over the source string's bytes). It is heavily vectorized and rarely the bottleneck. The slow case is when substring appears inside a WHERE clause on an indexed column: predicates like WHERE substring(col, 1, 3) = 'abc' cannot use the MergeTree primary index. For repeated prefix lookups, store the prefix as a materialized column or use startsWith(col, 'abc'), which can use index scans.

Frequently Asked Questions

Q: Does ClickHouse's substring use 0-based or 1-based indexing?
A: 1-based. substring(s, 1, 5) returns the first five bytes. substring(s, 0, ...) is a special case that always returns an empty string regardless of length.

Q: How does substring handle NULL values?
A: If any argument is NULL, the result is NULL. Wrap the input with `coalesce` if you need a default: substring(coalesce(col, ''), 1, 5).

Q: What's the difference between substring and substringUTF8?
A: substring operates on bytes; substringUTF8 operates on Unicode code points. For multi-byte text (emoji, accented characters, CJK), substringUTF8 preserves character boundaries. substring is faster but unsafe for non-ASCII data.

Q: Can substring be used in a WHERE clause?
A: Yes, but it prevents primary-index pruning. Use startsWith(col, 'prefix') or endsWith(col, 'suffix') when filtering, both of which can use index scans on MergeTree tables.

Q: How does substring behave with a negative offset?
A: A negative offset counts from the end of the string. substring('hello', -3) returns 'llo'. Combine with length to extract a fixed-width slice from the right edge.

Q: What are substring's aliases?
A: substr, mid, and byteSlice. All four names invoke the same function. mid matches MySQL/MS SQL Server; substr matches Postgres and Oracle.

Pulse Catches Slow String Operations Before They Page You

String-heavy queries are often the first to slow down when CPU pressure or memory limits kick in. Pulse provides AI-powered monitoring for ClickHouse, with automated root-cause analysis that links slow queries to cluster conditions - hot replicas, lagging merges, memory exhaustion - before dashboards start timing out.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.