ClickHouse DB::Exception: Error processing Parquet file

The DB::Exception: PARQUET_EXCEPTION error surfaces when ClickHouse encounters a problem while reading from or writing to an Apache Parquet file. Parquet is a widely used columnar storage format, and this error can stem from corrupt files, unsupported Parquet features, or type incompatibilities between the Parquet schema and the ClickHouse table definition.

Impact

When this error is triggered, the query that attempts to read or write the Parquet file fails entirely. No partial results are returned for SELECT queries, and no data is inserted for INSERT operations. Pipelines that ingest Parquet data from S3, GCS, or local files will stall until the issue is addressed.

Common Causes

The Parquet file is corrupted or was only partially written (e.g., the writer process crashed before finalizing the footer).
The file uses a Parquet feature or encoding not supported by ClickHouse's Parquet reader (such as certain nested types or rare encodings).
A type mismatch exists between the Parquet column types and the target ClickHouse table schema.
The Parquet file was compressed with a codec not enabled in the ClickHouse build (e.g., LZ4_RAW vs LZ4_FRAME, or Brotli).
Reading from a remote source where the file was truncated during transfer.
The Parquet file uses encryption features that ClickHouse does not support.

Troubleshooting and Resolution Steps

Validate the Parquet file outside ClickHouse using parquet-tools or PyArrow:

pip install pyarrow
python3 -c "import pyarrow.parquet as pq; print(pq.read_metadata('data.parquet'))"

Check the Parquet file schema to understand column types and compare them against your ClickHouse table:
```
python3 -c "import pyarrow.parquet as pq; print(pq.read_schema('data.parquet'))"
```
Try reading the file in ClickHouse with DESCRIBE to see how ClickHouse interprets the schema:
```
DESCRIBE file('data.parquet', 'Parquet');
```
If specific columns cause the failure, select only the non-problematic columns and handle the difficult ones with explicit casting:
```
SELECT col1, col2, CAST(col3 AS Nullable(String))
FROM file('data.parquet', 'Parquet');
```

For compression-related failures, check which codec the file uses and verify ClickHouse supports it:

python3 -c "
import pyarrow.parquet as pq
f = pq.ParquetFile('data.parquet')
for i in range(f.metadata.num_row_groups):
    for j in range(f.metadata.num_columns):
        print(f.metadata.row_group(i).column(j).compression)
"

If the file is corrupt, attempt recovery by reading it with PyArrow and rewriting it:

import pyarrow.parquet as pq
table = pq.read_table('corrupt.parquet')
pq.write_table(table, 'fixed.parquet')

For remote files, download the file locally first to rule out network-related truncation:

-- Instead of
SELECT * FROM s3('https://bucket/data.parquet');
-- Try
SELECT * FROM file('/tmp/data.parquet', 'Parquet');

Best Practices

Use well-tested Parquet writers (Apache Spark, PyArrow, DuckDB) to produce files destined for ClickHouse.
Stick to widely supported compression codecs like Snappy or ZSTD for maximum compatibility.
Validate Parquet files in your pipeline before attempting to load them into ClickHouse.
When writing Parquet from ClickHouse, use the output_format_parquet_version setting to control the Parquet version for downstream compatibility.
Avoid deeply nested Parquet schemas unless your ClickHouse table is designed to handle them with Nested or Tuple types.

Frequently Asked Questions

Q: Which Parquet versions does ClickHouse support?
A: ClickHouse supports Parquet format versions 1.0 and 2.x (2.4, 2.6). Most standard Parquet files produced by modern tools work without issues.

Q: Can ClickHouse read Parquet files with nested structs?
A: Yes, but with limitations. Simple nested structs map to ClickHouse Tuple types. Deeply nested or repeated groups may require flattening the data before import.

Q: Does the PARQUET_EXCEPTION error ever indicate a ClickHouse bug?
A: Occasionally, yes. If your file is valid and readable by other tools but ClickHouse rejects it, consider filing a bug report with a minimal reproduction case.