Pulse 2025 Product Roundup: From Monitoring to AI-Native Control Plane

Read more

ClickHouse Python Client: clickhouse-connect, Pandas Integration, and Bulk Inserts

Two Python clients dominate ClickHouse integration: clickhouse-connect and clickhouse-driver. They differ at the protocol level in ways that affect deployment, performance, and operational complexity. clickhouse-connect (maintained by ClickHouse, Inc.) communicates over HTTP/HTTPS using ports 8123 and 8443. clickhouse-driver is a community-maintained library that speaks the native TCP binary protocol on ports 9000 and 9440. The native protocol has a more compact wire format and can be faster for large result sets, but HTTP is easier to route through proxies, load balancers, and firewalls - and it's what ClickHouse Cloud exposes by default. For new projects, clickhouse-connect is the straightforward choice: it's actively maintained by the people running the database, it handles ClickHouse Cloud authentication out of the box, and its Pandas integration is first-class.

Install it with pip install clickhouse-connect. The library requires Python 3.9 or higher. If you need Pandas support, install pandas and numpy separately - they're optional dependencies.

Connecting and Running Queries

Creating a client is a single function call. For a local instance:

import clickhouse_connect

client = clickhouse_connect.get_client(
    host='localhost',
    port=8123,
    username='default',
    password='',
)

For ClickHouse Cloud, TLS is mandatory and the port is 8443:

client = clickhouse_connect.get_client(
    host='abc123.us-east-1.aws.clickhouse.cloud',
    port=8443,
    username='default',
    password='your_password',
)

The client uses connection pooling internally via urllib3. You don't manage connections manually - a single client instance is safe to use across a multi-threaded application. What catches people off guard on ClickHouse Cloud is certificate verification. The library uses Python's certifi bundle by default. If you're connecting to a self-managed instance with a self-signed certificate, pass verify=False or point ca_cert at your CA file. On Cloud, leave verification enabled.

Running queries returns a QueryResult object. The raw rows are in result_rows as a list of tuples, with column names accessible from column_names:

result = client.query('SELECT name, engine, total_rows FROM system.tables LIMIT 10')

for row in result.result_rows:
    print(row)

# Access column metadata
print(result.column_names)  # ('name', 'engine', 'total_rows')

For DDL and other commands that don't return rows, use command():

client.command('''
    CREATE TABLE IF NOT EXISTS events (
        event_id  UInt64,
        user_id   UInt32,
        event_ts  DateTime,
        payload   String
    ) ENGINE = MergeTree()
    ORDER BY (event_ts, user_id)
''')

Pandas Integration

clickhouse-connect has native Pandas support through query_df() and insert_df(). query_df() returns a DataFrame directly without an intermediate conversion step:

df = client.query_df(
    'SELECT user_id, count() AS events, max(event_ts) AS last_seen '
    'FROM events GROUP BY user_id'
)

print(df.dtypes)
print(df.head())

ClickHouse types map to Pandas/NumPy dtypes automatically. UInt32 becomes uint32, DateTime becomes datetime64[s] (in pandas 2.x; earlier pandas versions used datetime64[ns]), Float64 becomes float64. Nullable columns map to the object dtype because Pandas' integer types don't natively support NULL in the same way ClickHouse does. If you're doing downstream arithmetic on nullable integer columns, you may need to call .astype('Int64') (capital I - the pandas nullable integer type) after the query, or handle NaN values explicitly.

Loading a DataFrame back into ClickHouse uses insert_df(). The column names in the DataFrame must match the target table's column names exactly:

import pandas as pd

df = pd.DataFrame({
    'event_id': [1001, 1002, 1003],
    'user_id':  [42, 43, 42],
    'event_ts': pd.to_datetime(['2024-11-01 10:00:00', '2024-11-01 10:01:00', '2024-11-01 10:02:00']),
    'payload':  ['click', 'scroll', 'click'],
})

client.insert_df('events', df)

The column names in the DataFrame must match the target table's column names exactly. If column names don't align, pass column_names explicitly to insert_df() to map DataFrame columns to table columns.

Bulk Insert Patterns

Row-by-row inserts kill ClickHouse performance. The storage engine uses a log-structured merge tree - ClickHouse sorts and indexes data internally on write, but every INSERT statement triggers a new part file regardless, and ClickHouse has to merge these in the background. If you flood it with thousands of single-row inserts per second, the part count balloons, merges fall behind, and you hit the Too many parts error. The fix is batching.

The insert() method accepts a list of tuples or lists, which maps directly to a single INSERT statement:

batch = [
    (1001, 42, '2024-11-01 10:00:00', 'click'),
    (1002, 43, '2024-11-01 10:01:00', 'scroll'),
    (1003, 42, '2024-11-01 10:02:00', 'purchase'),
    # ... thousands more
]

client.insert(
    'events',
    batch,
    column_names=['event_id', 'user_id', 'event_ts', 'payload']
)

For high-volume pipelines, target batches of 100,000 to 1,000,000 rows per INSERT. That range keeps part sizes reasonable and gives ClickHouse's MergeTree enough data per write to be efficient. In practice, if you're consuming from a queue or stream, accumulate rows in a list until you hit a row count threshold or a time deadline (whichever comes first), then flush. Something like 100,000 rows or 5 seconds is a reasonable starting point.

ClickHouse also has a server-side async insert mode (async_insert=1), which lets the server buffer small inserts and batch them internally:

client.insert(
    'events',
    small_batch,
    column_names=['event_id', 'user_id', 'event_ts', 'payload'],
    settings={'async_insert': 1, 'wait_for_async_insert': 0},
)

wait_for_async_insert=0 means the server acknowledges the insert immediately without waiting for the buffer to flush. This can dramatically increase throughput in append-heavy workloads but should be used with care: insert errors (malformed rows, type mismatches, constraint violations) are not reported back to the client. In production, use wait_for_async_insert=1 (the default when async_insert is enabled) unless you have explicit monitoring on system.async_insert_log to catch failures.

Async Usage

clickhouse-connect is synchronous by default, but ships with an async wrapper for use in asyncio contexts:

import asyncio
import clickhouse_connect

async def run():
    client = await clickhouse_connect.get_async_client(
        host='localhost',
        port=8123,
        username='default',
        password='',
    )

    result = await client.query('SELECT count() FROM events')
    print(result.result_rows[0][0])

    rows = [
        (2001, 99, '2024-11-02 09:00:00', 'login'),
        (2002, 99, '2024-11-02 09:01:00', 'click'),
    ]
    await client.insert('events', rows, column_names=['event_id', 'user_id', 'event_ts', 'payload'])

    await client.close()

asyncio.run(run())

The async client wraps the synchronous implementation using asyncio's thread executor under the hood - it doesn't use a native async HTTP library for the actual socket I/O. For most applications that mix async Python code with ClickHouse reads and writes, this is fine. For very high-concurrency workloads where you're firing hundreds of simultaneous queries, the thread executor model can become a bottleneck; the aiochclient library offers more natural async I/O built on aiohttp, but check its maintenance status before adopting it — it has seen limited activity as of early 2026.

Common Pitfalls

The type coercion behavior during inserts deserves attention. For columns with types that have multiple valid Python representations - UUID, IPv4Address, Decimal - the library inspects the first non-None value in each column to determine how to serialize the entire column. If you have a mixed-type column where the first row is None and subsequent rows are actual UUID objects, the library may fall back to string serialization and fail silently or raise confusing errors. Always keep Python-side types consistent within a batch.

Connection pooling defaults work for most scenarios, but if you're running many concurrent threads against the same client, you may see urllib3 pool exhaustion. The client's internal pool handles 8 connections by default. For a multi-threaded web application with heavy ClickHouse usage, consider either configuring a larger pool or creating one client per worker thread.

For ClickHouse Cloud specifically: the HTTP endpoint enforces TLS, and connections that are idle for a few minutes will be reset. Clients that hold connections open in a connection pool across long idle periods will get a broken pipe on the next request. clickhouse-connect handles reconnection automatically on the next query, so this usually surfaces only as a logged warning rather than a hard failure - but it's worth knowing when reading logs.

clickhouse-driver remains the better option if you need the native binary protocol - for example, when streaming very large result sets where the HTTP overhead adds up, or when connecting to a self-managed cluster on a private network where the native port is already open. Its async story is less mature: the asynch library provides a native-protocol asyncio driver, but it's a separate project and not officially supported by ClickHouse, Inc. For anything touching ClickHouse Cloud, clickhouse-connect is the path of least resistance.

Pulse - Elasticsearch Operations Done Right

Pulse can solve your ClickHouse issues

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.