ClickHouse Machine Learning Functions: stochasticLinearRegression, Regression Combinators, and Linear Models

Q: What is the difference between `stochasticLinearRegression` and `simpleLinearRegression`?

simpleLinearRegression solves ordinary least squares in closed form for a single (x, y) pair and returns slope and intercept. stochasticLinearRegression handles multiple features via gradient descent and supports persisting a reusable model state.

Q: How do I save a trained model so other queries can use it?

Train with the -State combinator ( stochasticLinearRegressionState ) and store the result in a table — Memory for an ephemeral model, or a MergeTree engine for persistence. Reference that state in evalMLMethod .

Q: Can I combine models trained on different shards?

Yes. Because the model is an aggregate-function state, the -Merge family of combinators can merge partial states into a single state, which evalMLMethod can then evaluate. Feature ordering and scaling must be identical across all partial states.

Q: Which gradient descent method should I choose?

Adam (the default) is a reasonable starting point and adapts the step size automatically. SGD , Momentum , and Nesterov are available if you want manual control; benchmark on a holdout set, since the best choice depends on your feature distribution and batch size.

ClickHouse ships a small set of machine learning aggregate functions that let you fit linear models without exporting data to Python or another runtime. The core functions are stochasticLinearRegression (regression), stochasticLogisticRegression (binary classification), and simpleLinearRegression (ordinary least squares for a single feature), with predictions served by evalMLMethod.

These are not a replacement for a full ML platform, but for in-database scoring, forecasting baselines, and feature pipelines that already live in ClickHouse, they remove an entire data round-trip. This guide covers the function signatures, how the -State combinator persists a trained model, how to predict with evalMLMethod, and the practical gotchas — chiefly feature scaling.

The Available ML Functions

Function	Type	Solver	Output
`simpleLinearRegression(x, y)`	OLS, single feature	Closed-form	`Tuple(k, b)` for `y = k*x + b`
`stochasticLinearRegression(...)(target, x1, …)`	Multi-feature regression	Stochastic gradient descent (MSE loss)	Weights + bias, or a model state
`stochasticLogisticRegression(...)(target, x1, …)`	Binary classification	Stochastic gradient descent	Probability of label `1`, or a model state
`evalMLMethod(state, x1, …)`	Inference	—	Predicted value per row

simpleLinearRegression is the simplest: it computes the slope and intercept of a best-fit line in closed form and returns a tuple.

SELECT arrayReduce('simpleLinearRegression', [0, 1, 2, 3], [3, 4, 5, 6]);
-- Returns: (1,3)  -> y = 1*x + 3

For anything with more than one feature, you use the stochastic (gradient descent) variants.

stochasticLinearRegression

The signature is:

stochasticLinearRegression(
    [learning_rate, l2_regularization_coef, mini_batch_size, method]
)(target, x1, x2, ...)

The four optional parameters control gradient descent:

Parameter	Meaning	Default
`learning_rate`	Step-length coefficient for each gradient step	`0.00001`
`l2_regularization_coef`	L2 penalty to reduce overfitting	`0.1`
`mini_batch_size`	Rows used to compute each gradient step	`15`
`method`	Weight-update strategy: `Adam`, `SGD`, `Momentum`, or `Nesterov`	`Adam`

Calling it directly returns the learned weights followed by the bias as an Array(Float64):

SELECT stochasticLinearRegression(0.1, 0.0, 5, 'SGD')(target, x1, x2)
FROM train_data;

The default learning rate is deliberately tiny. With unscaled features it often will not converge in a single pass, which is why feature scaling (below) matters so much.

Persisting a Model with the -State Combinator

The weights array is useful for inspection, but to reuse a trained model you persist its aggregate-function state. The -State combinator returns the intermediate aggregation state instead of a final value — the same mechanism that powers AggregatingMergeTree and materialized views. For ML functions, that state is the trained model.

Store it in a small table:

CREATE TABLE my_model
ENGINE = Memory AS
SELECT
    stochasticLinearRegressionState(0.1, 0.0, 5, 'SGD')(target, x1, x2) AS state
FROM train_data;

Memory is fine for a single-row model you rebuild on demand; use a MergeTree (ENGINE = MergeTree ORDER BY tuple()) if you want the model to survive restarts.

Predicting with evalMLMethod

evalMLMethod takes the model state as its first argument, followed by the feature columns in the same order used during training, and returns one prediction per row:

WITH (SELECT state FROM my_model) AS model
SELECT
    id,
    evalMLMethod(model, x1, x2) AS prediction
FROM test_data;

The test data must present the same features in the same order; it does not need a target column. Because the model lives in a regular table, you can join predictions back into other queries or feed them through a materialized view.

Merging Multiple Model States

Because the model is an aggregate state, you can combine states trained on different shards or partitions with the -Merge family of combinators, just like any other aggregate. This lets evalMLMethod evaluate against a state assembled from several partial states (a simple form of distributed/ensemble training). Keep feature ordering and scaling identical across the partial states, or the merged weights will be meaningless.

stochasticLogisticRegression

For binary classification, stochasticLogisticRegression has the same parameter list and gradient-descent options, but the target must be a label in [-1, 1], and evalMLMethod returns the probability of the positive class:

CREATE TABLE churn_model
ENGINE = MergeTree ORDER BY tuple() AS
SELECT
    stochasticLogisticRegressionState(1.0, 1.0, 10, 'SGD')(target, x1, x2) AS state
FROM train_data;

-- Score and threshold at 0.5
WITH (SELECT state FROM churn_model) AS model
SELECT
    id,
    evalMLMethod(model, x1, x2) AS probability,
    probability > 0.5 AS predicted_label
FROM test_data;

Feature Scaling: The Critical Step

Because both stochastic functions use gradient descent, continuous features must be scaled or the model will not converge. The standard approach is standardization: subtract the training-set mean and divide by the training-set standard deviation for each feature.

Compute the statistics once on the training set and apply the same constants to both training and test data:

-- Capture scaling constants from the training set
WITH
    (SELECT avg(distance_km) FROM train_data) AS mu,
    (SELECT stddevPop(distance_km) FROM train_data) AS sigma
SELECT
    stochasticLinearRegressionState(0.1, 0.0, 32, 'Adam')(
        target,
        (distance_km - mu) / sigma
    ) AS state
FROM train_data;

You must reapply the identical mu and sigma when calling evalMLMethod, otherwise predictions are silently wrong. A common pattern is to store the scaling constants alongside the model, or to bake them into a view. Skewed targets (counts, prices, durations) often benefit from a log() transform as well.

Best Practices

Always scale continuous features. Standardize to zero mean and unit variance; reuse the training-set statistics at inference time. This is the single most common reason a ClickHouse regression "doesn't work."
Raise the learning rate after scaling. The 0.00001 default is conservative. With scaled inputs, values like 0.1 with Adam converge far faster — test on a holdout set.
Use simpleLinearRegression when you have one feature. It is exact and closed-form; gradient descent only earns its keep with multiple features.
Persist the state, not the weights array, if you want to score new data later — only a state works with evalMLMethod.
Keep feature order stable. evalMLMethod matches features positionally, not by name. A reordered SELECT produces wrong predictions with no error.
Treat these as baselines. Linear and logistic models are interpretable and cheap, but for non-linear relationships you will likely export features to a dedicated ML stack.

Common Issues

Predictions are constant or NaN — almost always unscaled features or a learning rate that is too high (diverging). Scale first, then tune the rate.
evalMLMethod errors on state type — you passed the result of the plain function (an array) instead of a -State result. Train with stochasticLinearRegressionState.
Train/test mismatch — different scaling constants, a different feature order, or a feature present in training but missing at inference. Lock the pipeline down with a view.
Categorical features — these functions only accept numeric inputs. One-hot encode categories into separate numeric columns before training.

How Pulse Helps

In-database ML can quietly become a correctness and operations problem: a model state stored in a Memory table that vanishes on restart, scaling constants that drift between training and inference, or a regression that silently diverges because the learning rate was never tuned for unscaled inputs. Pulse monitors ClickHouse in production — surfacing the table-engine and memory choices behind model-storage tables, flagging schema or ordering changes that would break a positional evalMLMethod pipeline, and helping you keep training and scoring workloads from contending with your interactive queries. If you are building ML features on top of ClickHouse, Pulse helps you keep the surrounding data plumbing reliable.

Frequently Asked Questions

Q: Can ClickHouse train deep learning or non-linear models?

No. The built-in ML functions cover linear regression and logistic regression only, fit by stochastic gradient descent. For non-linear models, use ClickHouse for feature engineering and export to a dedicated ML framework.

Q: What is the difference between stochasticLinearRegression and simpleLinearRegression?

simpleLinearRegression solves ordinary least squares in closed form for a single (x, y) pair and returns slope and intercept. stochasticLinearRegression handles multiple features via gradient descent and supports persisting a reusable model state.

Q: Why does my model produce nonsense predictions?

The most common cause is unscaled features. Gradient descent needs standardized inputs (zero mean, unit variance), and you must apply the same scaling constants at both training and prediction time. A second cause is feature order — evalMLMethod matches features by position.

Q: How do I save a trained model so other queries can use it?

Train with the -State combinator (stochasticLinearRegressionState) and store the result in a table — Memory for an ephemeral model, or a MergeTree engine for persistence. Reference that state in evalMLMethod.

Q: Can I combine models trained on different shards?

Yes. Because the model is an aggregate-function state, the -Merge family of combinators can merge partial states into a single state, which evalMLMethod can then evaluate. Feature ordering and scaling must be identical across all partial states.

Q: Which gradient descent method should I choose?

Adam (the default) is a reasonable starting point and adapts the step size automatically. SGD, Momentum, and Nesterov are available if you want manual control; benchmark on a holdout set, since the best choice depends on your feature distribution and batch size.