The substring function in ClickHouse extracts a portion of a string starting at a byte offset and optionally limited to a given length. It uses 1-based indexing (offset 1 is the first byte), and a negative offset counts from the end of the string. Offset 0 is a special case that always returns an empty string. Aliases: substr, mid, byteSlice. For Unicode-correct slicing by code point, use substringUTF8.
Syntax
substring(s, offset[, length])
See the official ClickHouse substring reference for the canonical definition.
Parameters
| Name | Type | Description | Required |
|---|---|---|---|
s |
String, FixedString, or Enum |
The source string. | Yes |
offset |
(U)Int* |
1-based starting byte. 0 returns empty; negative counts from the end. |
Yes |
length |
(U)Int* |
Maximum number of bytes to return. Omitted means "to end of string". | No |
Return type: String. If any argument is NULL, returns NULL.
Examples
Basic extraction
SELECT substring('Hello, World!', 1, 5) AS prefix;
-- Output: Hello
Offset 1 is the first byte. length is the maximum number of bytes returned.
Omitting length
SELECT substring('clickhouse-server', 12) AS suffix;
-- Output: server
When length is omitted, substring returns from offset to the end of the string.
Negative offset
SELECT substring('clickhouse', -5) AS tail;
-- Output: house
SELECT substring('clickhouse', -5, 3) AS tail3;
-- Output: hou
A negative offset counts from the right. Combine with length to slice from a known distance from the end.
Special case: offset 0
SELECT substring('anything', 0, 5) AS result;
-- Output: '' (empty string)
offset = 0 is not "the first character"; it always returns an empty string. This is the single most common source of bugs when porting Python or JavaScript code.
UTF-8 strings: use substringUTF8
SELECT
substring('cafe\xcc\x81', 1, 4) AS bytes, -- "caf" plus partial code point
substringUTF8('cafe\xcc\x81', 1, 4) AS chars; -- "café" (4 code points)
substring works on bytes, so it can split a multi-byte UTF-8 sequence. substringUTF8 counts code points and is the right choice for human-readable strings.
Extracting domain from email
SELECT substring(email, position(email, '@') + 1) AS domain
FROM users
LIMIT 5;
Combine substring with position, `length`, or regex helpers for fixed-pattern extraction.
Common Issues and Gotchas
- Offset 0 returns empty, not the first character. Most languages use 0-based indexing; ClickHouse uses 1-based. Always start at
1. substringcounts bytes,substringUTF8counts code points. Cutting a multi-byte character withsubstringproduces invalid UTF-8. UsesubstringUTF8for any column that may contain non-ASCII text.substringUTF8does not validate input. It assumes the source is valid UTF-8 and produces undefined results on invalid bytes. Validate ingest if input is untrusted.FixedStringincludes trailing nulls. Slicing aFixedString(20)containing'hi'returns padding bytes. Strip withtrim(BOTH '\0' FROM substring(col, ...))or store asStringif length varies.- Negative length is not supported in older versions. Negative
lengthwas added in v22.x; on older builds it raisesARGUMENT_OUT_OF_BOUND. Verify against your deployment withSELECT version().
Performance Notes
substring is a constant-time slice (it returns a view over the source string's bytes). It is heavily vectorized and rarely the bottleneck. The slow case is when substring appears inside a WHERE clause on an indexed column: predicates like WHERE substring(col, 1, 3) = 'abc' cannot use the MergeTree primary index. For repeated prefix lookups, store the prefix as a materialized column or use startsWith(col, 'abc'), which can use index scans.
Frequently Asked Questions
Q: Does ClickHouse's substring use 0-based or 1-based indexing?
A: 1-based. substring(s, 1, 5) returns the first five bytes. substring(s, 0, ...) is a special case that always returns an empty string regardless of length.
Q: How does substring handle NULL values?
A: If any argument is NULL, the result is NULL. Wrap the input with `coalesce` if you need a default: substring(coalesce(col, ''), 1, 5).
Q: What's the difference between substring and substringUTF8?
A: substring operates on bytes; substringUTF8 operates on Unicode code points. For multi-byte text (emoji, accented characters, CJK), substringUTF8 preserves character boundaries. substring is faster but unsafe for non-ASCII data.
Q: Can substring be used in a WHERE clause?
A: Yes, but it prevents primary-index pruning. Use startsWith(col, 'prefix') or endsWith(col, 'suffix') when filtering, both of which can use index scans on MergeTree tables.
Q: How does substring behave with a negative offset?
A: A negative offset counts from the end of the string. substring('hello', -3) returns 'llo'. Combine with length to extract a fixed-width slice from the right edge.
Q: What are substring's aliases?
A: substr, mid, and byteSlice. All four names invoke the same function. mid matches MySQL/MS SQL Server; substr matches Postgres and Oracle.
Pulse Catches Slow String Operations Before They Page You
String-heavy queries are often the first to slow down when CPU pressure or memory limits kick in. Pulse provides AI-powered monitoring for ClickHouse, with automated root-cause analysis that links slow queries to cluster conditions - hot replicas, lagging merges, memory exhaustion - before dashboards start timing out.
Related Reading
- ClickHouse concat Function: rebuild strings after slicing
- ClickHouse replaceAll Function: replace substrings inline
- ClickHouse lower Function: case-normalize before extracting
- ClickHouse upper Function: companion to
lowerfor case-folding - ClickHouse coalesce Function: replace NULL inputs before
substring - ClickHouse MergeTree: how primary-key index scans interact with string predicates
- ClickHouse Projections Guide: precompute extracted substrings for faster lookups