SQL Query Optimization Techniques

Q: Is the PostgreSQL queryid stable across versions?

No. The PostgreSQL docs state it is not safe to assume queryid is stable across major versions, because the jumble depends on internal structures and object identifiers. It is stable between minor releases on the same architecture with matching catalog metadata, which is why WAL replicas share the same queryid as their primary.

Q: Why do two identical-looking queries sometimes get different fingerprints?

Because PostgreSQL's queryid is computed from the post-parse-analysis tree, which carries resolved object IDs and settings like search_path . If a referenced function or table is dropped and recreated, or the two queries resolve different objects, the trees differ and the fingerprints diverge even though the text looks identical.

Q: What is the difference between pg_stat_statements queryid and a libpg_query fingerprint?

pg_stat_statements queryid hashes the post-parse-analysis tree, so it includes catalog OIDs and is only comparable on servers running the same version with the same catalog. libpg_query fingerprints the raw parse tree before catalog resolution, producing a portable identifier you can compute from query text on any machine.

Q: Does query normalization replace values in IN lists?

It depends on the tool. pt-query-digest collapses any-length IN (...) list to a single placeholder, so IN (1,2,3) and IN (7,8) fingerprint identically. PostgreSQL historically gave different-length lists different queryid values; PostgreSQL 17 added merging of simple IN-list and array constants to reduce that proliferation.

A SQL query fingerprint is a stable identifier that collapses queries differing only in their literal values into a single group. SELECT * FROM orders WHERE id > 50 and SELECT * FROM orders WHERE id > 100 produce the same fingerprint because the constants are normalized away, leaving SELECT * FROM orders WHERE id > $1. This is how query-analysis tools recognize that thousands of distinct query texts are really one query pattern, so you can rank patterns by total time, call count, or rows read instead of drowning in near-duplicates.

One of the things that Pulse does when analyzing query performance is recognizing duplicates. To attribute load correctly you have to recognize that two queries are the same query with different parameters, then aggregate their execution stats under one identifier. Fingerprinting is the mechanism that makes that possible.

How Query Fingerprinting Works

Fingerprinting normalizes a query, then hashes the normalized form. Normalization replaces literal constants with placeholders ($1, ?, or a generic token), strips formatting and whitespace differences, and in some implementations collapses IN (...) lists of varying length to a single element. Two queries that normalize to the same form get the same hash, and therefore the same fingerprint.

The hard part is normalization, not hashing. Doing it on raw query text with regular expressions is fragile - string literals can contain anything, comments interleave with SQL, and users_2009 versus users_2010 may or may not be the same logical query. The robust approach parses the query into a tree first, then walks the tree and ignores the nodes that hold constant values. Working on the parse tree is why two semantically different queries with identical text (for example, the same text resolved under different search_path settings) can end up as distinct fingerprints.

There are two broad families. PostgreSQL's built-in queryid hashes the post-parse-analysis tree, which includes resolved object identifiers, so it is server-local but precise. Portable fingerprinters like libpg_query hash the raw parse tree before catalog resolution, so the same fingerprint can be computed on any machine from the query text alone.

PostgreSQL queryid and pg_stat_statements

PostgreSQL computes a 64-bit bigint query identifier by "jumbling" the post-parse-analysis tree: it walks the parsed node tree, reads the fields that define query shape, ignores the nodes holding constant values, and hashes the result. The constant nodes that were ignored are shown as $1, $2 placeholders in the normalized text that pg_stat_statements stores. From the upstream documentation: "two queries will be considered the same for this purpose if they are semantically equivalent except for the values of literal constants appearing in the query."

Since PostgreSQL 14 this computation lives in core, controlled by the compute_query_id GUC. The default is auto, which lets a module such as pg_stat_statements turn it on. Valid values are off, on, auto, and regress. With it enabled, the identifier surfaces in pg_stat_statements.queryid, in pg_stat_activity.queryid, in EXPLAIN (VERBOSE), and in the log line prefix via %Q. Before version 14, pg_stat_statements computed the value itself and it was not exposed elsewhere.

The identifier is stable and comparable as long as the server version and catalog metadata match. Two servers in a WAL streaming-replication pair share the same queryid for the same query, because they run the same binary against the same catalog OIDs. The same does not hold across major versions. The docs are explicit: "it is not safe to assume that queryid will be stable across major versions of PostgreSQL," while values "are stable between minor version releases" given the same machine architecture and catalog metadata. Because the hash depends on internal object IDs, dropping and recreating a referenced function makes two otherwise-identical queries hash differently.

-- Enable in core (postgresql.conf), then load the extension
-- compute_query_id = on   # or leave at the default 'auto'
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Find the heaviest query patterns by total execution time.
-- Each row is one fingerprint (queryid), aggregating all
-- executions that differed only in their constants.
SELECT queryid,
       calls,
       round(total_exec_time::numeric, 1) AS total_ms,
       round(mean_exec_time::numeric, 2)  AS mean_ms,
       query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

To see what an active backend is running right now, join pg_stat_activity.queryid back to pg_stat_statements to recover the normalized text.

Portable Fingerprints and the MySQL Digest

When you need a fingerprint that is identical across servers and versions, compute it from the raw parse tree client-side. The pg_query family (the Ruby gem, pg_query_go, and the underlying `libpg_query` C library) statically links the real PostgreSQL parser, so it parses every valid PostgreSQL statement, then fingerprints the raw parse tree before catalog resolution. As of fingerprint version 3.0 the output is xxHash XXH3 64-bit, shown as a 16-character hex string. It ignores constants, node location fields, output column aliases, and trailing digit sequences in relation names, which makes it portable where queryid is not.

MySQL takes a different route. The Performance Schema parser produces a normalized "statement digest" - the digest text, with values replaced and whitespace collapsed - and a SHA-256 DIGEST hash over it. These aggregate in events_statements_summary_by_digest, keyed by SCHEMA_NAME plus DIGEST. Percona's pt-query-digest and pt-fingerprint work on the SQL text token stream instead: they lowercase the query, replace literals (including NULL and hex), collapse IN (...) and VALUES (...) lists of any length to one placeholder, and replace digits embedded in identifiers, so users_2009 and users_2010 fingerprint the same.

Mechanism	Engine	Operates on	Hash	Portable across servers/versions
`pg_stat_statements` `queryid`	PostgreSQL 14+	Post-parse-analysis tree (with OIDs)	64-bit jumble	No - tied to server version and catalog
`libpg_query` fingerprint v3	PostgreSQL (client-side)	Raw parse tree	xxHash XXH3 64-bit (hex)	Yes - computed from query text
Performance Schema `DIGEST`	MySQL 8.0+	Parser token stream	SHA-256	Per server; stable for a given parser
`pt-query-digest` / `pt-fingerprint`	MySQL (offline)	SQL text tokens	CRC of fingerprint text	Yes - text-based

Normalization examples, side by side:

-- Raw queries
SELECT * FROM orders WHERE status = 'paid' AND id IN (1, 2, 3);
SELECT * FROM orders WHERE status = 'open' AND id IN (7, 8);

-- PostgreSQL normalized text (pg_stat_statements)
SELECT * FROM orders WHERE status = $1 AND id IN ($2, $3, $4)
-- (a list of different length can yield a different queryid)

-- pt-query-digest fingerprint (IN list collapsed to one element)
select * from orders where status = ? and id in(?+)

Note the divergence on IN lists: pt-query-digest collapses any-length lists to one placeholder, so both raw queries above share a fingerprint. PostgreSQL's queryid historically treated lists of different lengths as different identifiers, though PostgreSQL 17 added the ability to merge IN-list and array constants when they are all simple values, reducing that fan-out.

Where Fingerprinting Fits in Query Optimization

Fingerprinting is the first step of any serious query-performance workflow: you cannot optimize what you cannot group. Once executions are grouped by fingerprint, you rank patterns by total time (the product of mean latency and call count), pull the normalized text and a representative parameterized example, run `EXPLAIN ANALYZE` on it, and decide whether the fix is an index, a rewrite, or a schema change. A query that is individually fast but runs millions of times often outranks a slow one-off, and only fingerprint-level aggregation surfaces that.

Pulse, an AI-native database optimization and maintenance platform, builds on exactly this. It groups and fingerprints queries to surface the heaviest query patterns across ClickHouse, OpenSearch, and Elasticsearch (with PostgreSQL support on the roadmap), runs automated root-cause analysis on the top patterns, and recommends concrete fixes that a human approves before anything changes. Grouping by fingerprint is what lets it point at the handful of patterns responsible for most of the load rather than listing raw queries.

Frequently Asked Questions

Q: What is a SQL query fingerprint?
A SQL query fingerprint is a stable identifier produced by normalizing a query - replacing literal constants with placeholders and stripping formatting - then hashing the result. Queries that differ only in their values get the same fingerprint, which lets monitoring tools group and rank query patterns instead of treating every parameter variation as a separate query.

Q: How does PostgreSQL compute the queryid in pg_stat_statements?
PostgreSQL "jumbles" the post-parse-analysis tree of the query: it walks the parsed nodes, ignores the ones holding constant values, and hashes the rest into a 64-bit bigint. Since PostgreSQL 14 this is done in core under the compute_query_id GUC (default auto), and the value appears in pg_stat_statements, pg_stat_activity, and EXPLAIN.

Q: Is the PostgreSQL queryid stable across versions?
No. The PostgreSQL docs state it is not safe to assume queryid is stable across major versions, because the jumble depends on internal structures and object identifiers. It is stable between minor releases on the same architecture with matching catalog metadata, which is why WAL replicas share the same queryid as their primary.

Q: Why do two identical-looking queries sometimes get different fingerprints?
Because PostgreSQL's queryid is computed from the post-parse-analysis tree, which carries resolved object IDs and settings like search_path. If a referenced function or table is dropped and recreated, or the two queries resolve different objects, the trees differ and the fingerprints diverge even though the text looks identical.

Q: What is the difference between pg_stat_statements queryid and a libpg_query fingerprint?
pg_stat_statements queryid hashes the post-parse-analysis tree, so it includes catalog OIDs and is only comparable on servers running the same version with the same catalog. libpg_query fingerprints the raw parse tree before catalog resolution, producing a portable identifier you can compute from query text on any machine.

Q: How does MySQL fingerprint queries?
MySQL's Performance Schema normalizes each statement into a digest text (values replaced, whitespace collapsed) and computes a SHA-256 DIGEST hash over it, aggregating rows in events_statements_summary_by_digest. Percona's pt-query-digest does text-based fingerprinting on logs instead, collapsing IN lists and digit-suffixed table names so related queries group together.

Q: Does query normalization replace values in IN lists?
It depends on the tool. pt-query-digest collapses any-length IN (...) list to a single placeholder, so IN (1,2,3) and IN (7,8) fingerprint identically. PostgreSQL historically gave different-length lists different queryid values; PostgreSQL 17 added merging of simple IN-list and array constants to reduce that proliferation.

PostgreSQL Slow Queries: Finding and Fixing Them: use fingerprinted stats to locate the queries worth optimizing.
PostgreSQL EXPLAIN ANALYZE: read the plan for a representative query once you have grouped it.
PostgreSQL Performance Tuning: turn fingerprint rankings into indexes, rewrites, and config changes.
MySQL Slow Query Log: the input that pt-query-digest fingerprints.
MySQL EXPLAIN Guide: inspect execution plans for top MySQL digests.
Database Monitoring Best Practices: where query fingerprinting fits in a broader observability setup.
PostgreSQL Connection Pooling with PgBouncer: a related operational lever for high-throughput query workloads.
PostgreSQL Tutorial: foundational PostgreSQL concepts.

SQL Query Optimization Techniques - SQL Fingerprinting

How Query Fingerprinting Works

PostgreSQL queryid and pg_stat_statements

Portable Fingerprints and the MySQL Digest

Where Fingerprinting Fits in Query Optimization

Frequently Asked Questions

Related Reading