Pulse 2025 Product Roundup: From Monitoring to AI-Native Control Plane

Read more

How to Use PostgreSQL EXPLAIN ANALYZE for Query Optimization

EXPLAIN ANALYZE is PostgreSQL's primary tool for understanding why a query is slow. It executes the query and shows the actual runtime statistics alongside the planner's estimates, revealing where time was spent, how many rows were processed, and how much I/O was involved. Reading the output accurately is a prerequisite for diagnosing and fixing query performance.

Syntax and Options

-- Show the query plan without executing (estimated only)
EXPLAIN SELECT * FROM orders WHERE customer_id = 42;

-- Execute and show actual runtime statistics
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 42;

-- Most useful form for performance diagnosis
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = 42;

-- Full diagnostic: timing, buffers, verbose column output, WAL stats, planner settings
EXPLAIN (ANALYZE, BUFFERS, VERBOSE, SETTINGS, WAL) SELECT * FROM orders WHERE customer_id = 42;

-- Machine-readable JSON output for visualizer tools
EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON) SELECT * FROM orders WHERE customer_id = 42;

EXPLAIN ANALYZE actually executes the statement. For DML, wrap it in a transaction you roll back:

BEGIN;
EXPLAIN ANALYZE UPDATE orders SET status = 'shipped' WHERE id = 1;
ROLLBACK;

Options Reference

Option Default Description
ANALYZE off Execute the query; show actual rows, actual time, loops
BUFFERS off Show block-level I/O: cache hits, disk reads, dirtied, written, temp file usage
VERBOSE off Show column lists, schema-qualified names, function detail in trigger nodes
SETTINGS off Show planner GUCs that differ from their built-in defaults (e.g., work_mem, enable_seqscan)
WAL off Show WAL record generation: record count, full page images, bytes (requires ANALYZE)
TIMING on Per-node actual startup and execution time; disable with TIMING OFF to reduce overhead when only row counts are needed
SUMMARY off Append planning time and execution time totals at the bottom
MEMORY off Show memory consumed during planning (PostgreSQL 17+)
GENERIC_PLAN off Show a generic plan with parameter placeholders; cannot combine with ANALYZE (PostgreSQL 16+)
FORMAT TEXT Output format: TEXT, JSON, XML, YAML

JSON format is the richest and what visualizer tools consume. Use FORMAT JSON when capturing plans for tooling.

Reading the Output

The output is a tree of plan nodes, indented with -> arrows. Child nodes execute before their parents — the innermost (deepest-indented) nodes run first, and their output flows up to parent nodes.

 Nested Loop  (cost=4.65..118.50 rows=10 width=488) (actual time=0.017..0.051 rows=10 loops=1)
   ->  Bitmap Heap Scan on orders t1  (cost=4.36..39.38 rows=10 width=244) (actual time=0.009..0.017 rows=10 loops=1)
         Recheck Cond: (customer_id < 10)
         ->  Bitmap Index Scan on orders_customer_id_idx  (cost=0.00..4.36 rows=10 width=0) (actual time=0.004..0.004 rows=10 loops=1)
               Index Cond: (customer_id < 10)
   ->  Index Scan using products_pkey on products  (cost=0.29..7.90 rows=1 width=244) (actual time=0.003..0.003 rows=1 loops=10)
         Index Cond: (id = t1.product_id)

Cost Notation

(cost=X..Y rows=Z width=W)

  • X (startup cost): Estimated cost before the node can produce its first row. A Sort node has a high startup cost because it must read all input before returning anything. A Seq Scan has 0.00 because rows start flowing immediately.
  • Y (total cost): Estimated cost to produce all output rows. Costs are in abstract internal units — one unit roughly equals one sequential disk page read (seq_page_cost = 1.0). They are not milliseconds.
  • Z (rows): Estimated number of output rows.
  • W (width): Estimated average byte width per output row.

Actual Statistics

(actual time=A..B rows=C loops=D) — shown when ANALYZE is used.

  • A: Actual startup time in milliseconds (time to first row).
  • B: Actual total time in milliseconds to produce all rows from this node.
  • C: Average rows returned per loop.
  • D: Number of times this node was executed.

When loops > 1, actual time and rows are per-loop averages. To get totals:

Total actual time = actual_time_per_loop × loops
Total rows from node = actual_rows × loops

Example: actual time=0.003..0.003 rows=1 loops=10 means the node ran 10 times, total time is 0.03 ms, total rows is 10. This is critical when diagnosing Nested Loop joins — a fast per-execution time becomes catastrophically slow at 100,000 loops.

Scan Node Types

Seq Scan

Reads the entire table sequentially, page by page.

Seq Scan on orders  (cost=0.00..9821.00 rows=49215 width=44) (actual time=0.012..892.430 rows=492150 loops=1)
  Filter: (status = 'pending')
  Rows Removed by Filter: 2008000

Rows Removed by Filter is the key signal: this query read 2.5 million rows to return 492,150. A high ratio here (many rows removed for each row returned) is a candidate for indexing.

Sequential scans are appropriate for large fractions of the table or small tables. They indicate a problem when selecting a small percentage of a large table.

Index Scan

Traverses the B-tree index to find matching row pointers, then fetches each row from the heap via random I/O.

Index Scan using orders_customer_id_idx on orders  (cost=0.43..8.45 rows=10 width=44)
  Index Cond: (customer_id = 42)

Index Cond means the condition is evaluated within the index itself — only matching rows are fetched from the heap. Best for selective queries returning a small fraction of the table (typically under 5–10%).

Index Only Scan

All needed columns are in the index; no heap fetch is required (unless the visibility map indicates pages need visibility checks).

Index Only Scan using orders_customer_status_idx on orders
  Index Cond: (customer_id = 42)
  Heap Fetches: 0

Heap Fetches: 0 means the visibility map was clean — no heap access at all. This is the most I/O-efficient scan type. Achievable by adding all query columns to the index using INCLUDE:

CREATE INDEX orders_customer_status_idx ON orders (customer_id) INCLUDE (status, amount);

Bitmap Index Scan + Bitmap Heap Scan

A two-phase operation for medium-selectivity queries. Phase 1 (Bitmap Index Scan) builds an in-memory bitmap of matching page/row locations. Phase 2 (Bitmap Heap Scan) reads matching pages in physical order to minimize random I/O.

Bitmap Heap Scan on orders  (cost=214.23..8932.10 rows=9821 width=44)
  Recheck Cond: (status = 'pending')
  Heap Blocks: exact=4821
  ->  Bitmap Index Scan on orders_status_idx  (cost=0.00..211.78 rows=9821 width=0)
        Index Cond: (status = 'pending')

Recheck Cond is always displayed but is only actually re-evaluated when the bitmap becomes "lossy" — when the number of matching pages exceeds work_mem capacity, the bitmap degrades from per-row to per-page granularity. If you see Heap Blocks: lossy=N, increasing work_mem may help. Multiple bitmap scans can be combined via BitmapAnd / BitmapOr to use multiple indexes for a single query.

Join Node Types

Nested Loop

For each row in the outer (left) input, scans the inner (right) input for matches. Fast startup, efficient when the outer side produces few rows and the inner side has an index. Extremely slow when the outer side is large.

Nested Loop  (cost=0.43..24.89 rows=10 width=88) (actual time=0.031..0.089 rows=10 loops=1)
  ->  Seq Scan on customers  (actual time=0.012..0.013 rows=1 loops=1)
        Filter: (id = 42)
  ->  Index Scan using orders_customer_id_idx on orders  (actual time=0.018..0.073 rows=10 loops=1)
        Index Cond: (customer_id = 42)

Watch for the inner side with high loops. If the inner node shows actual time=0.09..0.09 rows=1 loops=50000, total inner time is 4,500 ms.

Hash Join

Builds an in-memory hash table from the smaller input, then probes it for each row of the larger input. High startup cost (building the hash table). Excellent for large unsorted inputs with equality join conditions.

Hash Join  (cost=1.04..312.39 rows=9821 width=88) (actual time=0.102..4.832 rows=9821 loops=1)
  Hash Cond: (orders.customer_id = customers.id)
  ->  Seq Scan on orders  ...
  ->  Hash  (actual time=0.045..0.046 rows=200 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 26kB

Batches: 1 — hash table fit in memory, optimal. Batches: N where N > 1 — the hash table spilled to disk in N batches, a strong indicator that work_mem is too low for this query. Increasing work_mem is usually the fix.

Merge Join

Both inputs must be sorted on the join key; PostgreSQL merges them in a single pass. Low execution overhead once sorted, but high startup if sorting is required. Good for large pre-sorted inputs and the only standard join type supporting non-equality conditions.

Merge Join  (cost=234.15..567.34 rows=9821 width=88)
  Merge Cond: (orders.customer_id = customers.id)
  ->  Sort  (cost=117.08..119.54 rows=983 width=44)
        Sort Key: orders.customer_id
  ->  Index Scan using customers_pkey on customers

Other Important Nodes

Sort

Sort  (cost=713.05..713.30 rows=100 width=44) (actual time=2.995..3.002 rows=100 loops=1)
  Sort Key: created_at DESC
  Sort Method: quicksort  Memory: 74kB

The sort method tells you whether the sort happened in memory or on disk:

  • quicksort Memory: Xkb — in-memory, fast.
  • external merge Disk: Xkb — spilled to disk because work_mem was insufficient. This is a performance problem. Increase work_mem and re-test.
  • top-N heapsort — used with a LIMIT clause; only tracks the N smallest/largest values without sorting everything.
  • incremental sort (PostgreSQL 13+) — input is partially sorted; only groups sharing a common prefix key are re-sorted.

Hash

The inner side of a Hash Join. Shows memory usage and batch count:

Hash  (actual time=245.3..245.3 rows=500000 loops=1)
  Buckets: 131072  Batches: 8  Memory Usage: 4096kB

Batches: 8 means work_mem was too small and the hash table spilled to disk 8 times. The fix is the same as for Hash Join: increase work_mem.

Aggregate

Groups and aggregates rows. In parallel plans, appears as Partial Aggregate (in workers) and Finalize Aggregate (in the leader).

Gather / Gather Merge

Gather collects results from parallel worker processes in arbitrary order. Gather Merge does the same but preserves sort order, enabling ORDER BY without a final sort step.

Gather  (actual time=892.123..892.130 rows=1 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  ->  Partial Aggregate  (actual time=888.232..888.233 rows=1 loops=3)

If Workers Launched < Workers Planned, workers were unavailable — check max_parallel_workers and max_worker_processes.

Materialize

Stores the output of a subplan so it can be re-scanned without re-executing. Used by Nested Loop when the inner side would otherwise be re-evaluated on every outer row.

CTE Scan

Scans a previously-materialized CTE result. Since PostgreSQL 12, CTEs are not always materialized — the optimizer may inline them. Use WITH name AS MATERIALIZED (...) to force materialization.

Row Estimation Errors

Row estimation errors — a large discrepancy between rows=Z (estimated) and rows=C (actual) — are a primary cause of poor plan selection. Under-estimates are usually more dangerous: the planner chooses Nested Loop (designed for small inputs) when a Hash Join is needed.

A 10x or greater mismatch between estimated and actual rows warrants investigation.

Causes

Stale statistics. Autovacuum runs ANALYZE on a schedule, but it may lag on heavily-written tables. Manual fix:

ANALYZE orders;

Low statistics target. By default PostgreSQL samples 100 histogram buckets per column. For skewed distributions, increase it:

ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500;
ANALYZE orders;

Multi-column correlation. PostgreSQL assumes column conditions are statistically independent. A query on WHERE region = 'EU' AND city = 'Berlin' will under-estimate if city implies region. Fix with extended statistics:

CREATE STATISTICS orders_region_city ON region, city FROM orders;
ANALYZE orders;

Function expressions. The planner cannot use column statistics for WHERE lower(email) = 'foo@example.com'. Consider a functional index and reviewing how statistics are gathered for that expression.

Buffer Statistics

With BUFFERS, each node shows block-level I/O:

Buffers: shared hit=36 read=6 dirtied=2 written=1
  • shared hit=N — blocks found in PostgreSQL's shared buffer cache. Fast, in-memory.
  • read=N — blocks read from disk (or OS page cache). High values indicate I/O pressure.
  • dirtied=N — blocks dirtied by this query (rare for plain SELECT).
  • written=N — dirty blocks evicted from the cache to make room. Indicates buffer pressure.
  • temp read/written — temporary files used by Sort, Hash, or other nodes that spilled to disk. Directly tied to work_mem.

A high read with low hit on the first execution is normal (cold cache). After the first run, a warm cache should show mostly hit. Persistent high read values indicate that the working set does not fit in shared_buffers.

work_mem and Its Impact

work_mem sets the maximum memory for each sort or hash operation within a query. A single complex query may have multiple concurrent sort/hash operations, each with its own work_mem budget. The default is 4 MB — appropriate for many OLTP queries but too small for analytical workloads.

Test changes safely at the session level:

SET work_mem = '256MB';
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders ORDER BY created_at;
RESET work_mem;

Setting work_mem globally requires care: with 100 connections and multiple operations per query, max_connections × work_mem × operations can exhaust RAM. A common pattern is to keep the global default conservative and increase it per-session for analytical queries.

Practical Example: Before and After Indexing

Before

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE customer_id = 42;

 Seq Scan on orders  (cost=0.00..98210.00 rows=1 width=88) (actual time=14.832..892.341 rows=234 loops=1)
   Filter: (customer_id = 42)
   Rows Removed by Filter: 4999766
   Buffers: shared hit=2841 read=42469
 Planning Time: 0.134 ms
 Execution Time: 892.398 ms

5 million rows scanned, 42,469 blocks read from disk, 892 ms to find 234 rows. The row estimate (1) vs. actual (234) is also a large mismatch.

Add the Index

CREATE INDEX CONCURRENTLY orders_customer_id_idx ON orders (customer_id);
ANALYZE orders;

After

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE customer_id = 42;

 Index Scan using orders_customer_id_idx on orders  (cost=0.56..24.89 rows=234 width=88) (actual time=0.041..0.312 rows=234 loops=1)
   Index Cond: (customer_id = 42)
   Buffers: shared hit=7 read=4
 Planning Time: 0.189 ms
 Execution Time: 0.341 ms

11 buffer accesses instead of 45,310. Execution time dropped from 892 ms to 0.3 ms.

JIT Compilation

PostgreSQL 11+ compiles query execution code to native machine code using LLVM when estimated cost exceeds jit_above_cost (default: 100,000). When JIT fires, a JIT: section appears at the bottom of the plan:

JIT:
  Functions: 3
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 1.259 ms, Inlining 4.201 ms, Optimization 12.384 ms,
          Emission 15.048 ms, Total 32.892 ms

If the JIT Total time is a large fraction of execution time, JIT is adding overhead without benefit — common for short OLTP queries that happen to have a high estimated cost. Raise jit_above_cost or disable JIT for that workload.

auto_explain: Automatic Plan Capture in Production

The auto_explain extension automatically logs EXPLAIN plans for slow queries in production without requiring manual intervention.

# postgresql.conf (requires restart)
shared_preload_libraries = 'auto_explain'

auto_explain.log_min_duration = 1000    # Log plans for queries over 1 second
auto_explain.log_analyze = true         # Include actual timing (adds measurement overhead)
auto_explain.log_buffers = true         # Include buffer statistics
auto_explain.log_format = json          # JSON format for tooling

Warning: log_analyze = true instruments every query with per-node timing, even queries below the threshold. This can add meaningful overhead on high-throughput systems. Use log_timing = false to get row counts without timing overhead, or use sample_rate (a value between 0.0 and 1.0) to capture only a fraction of qualifying queries.

For testing in a single session without restarting:

LOAD 'auto_explain';
SET auto_explain.log_min_duration = 0;
SET auto_explain.log_analyze = true;
-- Run your queries; plans appear in the PostgreSQL log

Visualizer Tools

Pasting raw EXPLAIN output into a visualizer makes it much easier to spot expensive nodes.

Tool Notes
explain.dalibo.com The most widely used free web visualizer. Accepts text or JSON.
explain.tensor.ru Very detailed; supports multiple view modes (flowchart, piechart, tilemap). Works with Greenplum, Citus, TimescaleDB.
pgMustard Commercial (free trial). Scores performance issues 1–5 stars and explains them in plain language. Supports all PostgreSQL 17 EXPLAIN options.
pganalyze Commercial SaaS. Integrates plan visualization with pg_stat_statements, index advisor, and alerting.

For the most complete data in these tools, capture plans as JSON: EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON).

Quick Diagnostic Checklist

When looking at an EXPLAIN ANALYZE output:

  1. Find the most expensive nodes — look for the highest actual times near the leaves of the tree.
  2. Check loops — multiply actual time and rows by loops to get true totals.
  3. Compare estimated vs. actual rows — a 10x+ mismatch points to stale statistics or missing extended statistics.
  4. Look for Seq Scan with high Rows Removed by Filter — indexing candidate.
  5. Look for Sort Method: external merge Disk:work_mem is too low for this sort.
  6. Look for Hash Batches: N where N > 1work_mem is too low for this hash join.
  7. Check Nested Loop inner side — verify that actual time × loops is not unexpectedly large.
  8. Review Buffers: read=N — high disk reads indicate the working set is not cached.
Pulse - Elasticsearch Operations Done Right

Pulse can solve your PostgreSQL issues

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.