ClickHouse DB::Exception: Cannot extract table structure

Q: How do I handle S3 paths with wildcards?

ClickHouse supports glob patterns in S3 paths: s3('https://s3.amazonaws.com/bucket/data_*.csv', ...) . If schema inference fails with wildcards, it may be because the first matching file is empty or has a different format. Provide an explicit schema to avoid this.

The "DB::Exception: Cannot extract table structure" error in ClickHouse occurs when the automatic schema inference engine fails to determine the column names and types from an external data source. The error code is CANNOT_EXTRACT_TABLE_STRUCTURE. This commonly happens with the file(), url(), s3(), and similar table functions when ClickHouse cannot deduce the schema from the data format and content.

Impact

Queries and table creation statements that rely on automatic schema inference will fail. This prevents ad-hoc exploration of files, URLs, and S3 objects without an explicit schema definition. The error does not affect existing tables or stored data -- it only blocks the specific operation that requires schema detection.

Common Causes

Empty data source -- the file, URL, or S3 object contains no data or only headers, leaving nothing for ClickHouse to infer types from.
Unsupported or ambiguous format -- the data format does not provide enough information for type inference (e.g., raw CSV without clear type indicators).
Corrupted or malformed data -- the source data contains syntax errors, inconsistent column counts, or mixed types that confuse the inference engine.
Access errors disguised as inference failures -- the file does not exist, the URL returns an error page, or S3 credentials are invalid, and the error manifests as a schema inference failure.
Format auto-detection failure -- when the format is not specified explicitly, ClickHouse may guess incorrectly based on the file extension.
Binary or compressed data without format specification -- ClickHouse cannot infer schema from raw binary data or compressed files without knowing the underlying format.

Troubleshooting and Resolution Steps

Provide an explicit schema instead of relying on inference:

SELECT *
FROM file('data.csv', 'CSV', 'col1 String, col2 UInt64, col3 DateTime')
LIMIT 10;

Verify the data source is accessible and non-empty:

# For local files
ls -la /path/to/data.csv
head -5 /path/to/data.csv

# For URLs
curl -I https://example.com/data.csv

# For S3
aws s3 ls s3://bucket/path/to/data.csv

Specify the format explicitly:

-- Instead of relying on auto-detection
SELECT * FROM url('https://example.com/data', 'JSONEachRow') LIMIT 5;

Use DESCRIBE to see what ClickHouse can infer, for debugging:

DESCRIBE TABLE file('data.csv', 'CSV');
-- This may give more specific errors about what went wrong

Check for format-specific issues. For CSV, verify the delimiter and quoting with format_csv_delimiter:
```
SELECT *
FROM file('data.csv', 'CSV', 'col1 String, col2 String')
SETTINGS format_csv_delimiter = ';'
LIMIT 5;
```
For tab-separated data, use the TabSeparated format instead of overriding the CSV delimiter.

For S3, verify credentials and bucket access:

SELECT *
FROM s3('https://s3.amazonaws.com/bucket/data.csv', 'access_key', 'secret_key', 'CSV', 'col1 String, col2 UInt64')
LIMIT 5;

Increase the number of rows used for inference (the default is 25000):

SET input_format_max_rows_to_read_for_schema_inference = 100000;
SELECT * FROM file('data.csv') LIMIT 10;

Best Practices

Always provide explicit schemas for production pipelines rather than relying on automatic inference. Inference is useful for ad-hoc exploration but is fragile for automated workflows.
Specify the data format explicitly in table function calls instead of relying on file extension detection.
Validate external data sources (files, URLs, S3 objects) for accessibility and correct formatting before referencing them in queries.
Use input_format_max_rows_to_read_for_schema_inference to control how many rows ClickHouse samples for type detection.
For files with ambiguous types (e.g., columns that look numeric but occasionally contain strings), provide explicit types to avoid inference errors.

Frequently Asked Questions

Q: Which formats support automatic schema inference?
A: ClickHouse supports schema inference for many formats including JSONEachRow, CSV, TSV, Parquet, ORC, Avro, Arrow, and Native. Binary formats like Parquet and Avro provide the most reliable inference since they include type metadata. Text formats like CSV require ClickHouse to guess types from data values.

Q: Can I provide a partial schema and let ClickHouse infer the rest?
A: No. Schema inference in ClickHouse is all-or-nothing. You either provide the complete schema or let ClickHouse infer all columns. If inference fails, you need to provide the full schema.

Q: Why does inference work for small files but fail for large ones?
A: ClickHouse only reads a sample of rows for inference (controlled by input_format_max_rows_to_read_for_schema_inference). If the first rows have different types than later rows, inference may succeed on the sample but the full query could fail with type errors. This is another reason to prefer explicit schemas.

Q: How do I handle S3 paths with wildcards?
A: ClickHouse supports glob patterns in S3 paths: s3('https://s3.amazonaws.com/bucket/data_*.csv', ...). If schema inference fails with wildcards, it may be because the first matching file is empty or has a different format. Provide an explicit schema to avoid this.