How PostgreSQL Indexes Work: B-Tree Internals and Index-Only Scans

A PostgreSQL index is a separate on-disk structure that maps column values to the physical locations of matching rows, so the engine can find them without reading the whole table. The default and most common type is the B-tree, a balanced tree that keeps entries sorted and supports equality, range, sorting, and IS NULL lookups. Indexes trade write overhead and disk space for read speed: every insert, update, and delete has to maintain them, but the right index turns an O(n) sequential scan into a lookup that stays nearly constant as the table grows.

How PostgreSQL Stores Table Data

Every table and index in PostgreSQL is stored as an array of fixed-size pages, normally 8 KB (the BLCKSZ compile-time default). The engine reads and writes whole pages, so it never has to load an entire table into memory to answer a query. Internally, pages are also called blocks, and the two terms are interchangeable.

Each page has a 24-byte header, an array of line pointers (item identifiers), free space in the middle, and the row data itself, which grows backward from the end of the page. A row is stored as a heap tuple: a header followed by the column values. Two header fields drive MVCC visibility: t_xmin holds the transaction ID that inserted the tuple, and t_xmax holds the transaction ID that deleted or updated it (0 if it is still live). A snapshot taken at the start of a statement or transaction compares these against committed/in-progress transaction state to decide which version of each row the query should see.

Every tuple has a TID (tuple identifier): a pair of (page number, line pointer) that pins down its exact physical location. Index entries store the TID, not the row itself, which is why an index lookup ends with a jump back into the heap to fetch the actual columns - unless the index can answer the query on its own.

Sequential Scan vs Index Scan

Without a usable index, SELECT * FROM orders WHERE id = 123 forces a sequential scan: PostgreSQL walks every page, checks every tuple's visibility, and applies the filter, even though at most one row matches. The planner costs this with a published model. Disk cost is seq_page_cost (default 1.0) times the number of pages; CPU cost is cpu_tuple_cost (default 0.01) plus cpu_operator_cost (default 0.0025) per tuple evaluated. The total scales linearly with table size, so a query that was fast on 10,000 rows degrades steadily as the table reaches millions.

An index scan changes the shape of that cost. Instead of touching every page, the planner descends the B-tree to the matching entries and follows their TIDs into the heap. Reads into the heap are random rather than sequential, so they are priced at random_page_cost (default 4.0) per page - four times a sequential page on the assumption of spinning disks. On SSDs and cloud volumes that ratio is too pessimistic; lowering random_page_cost toward 1.1-2.0 often makes the planner pick index scans it was avoiding. If you expect an index to be used and it is not, check the plan before adding more indexes.

B-Tree Structure and Search

A PostgreSQL B-tree is a self-balancing, multi-level tree of 8 KB pages. Leaf pages hold the indexed values and their TIDs; internal pages hold separator keys that route searches downward. Pages at each level are doubly linked to their siblings, which lets the engine walk a range in sorted order after the initial descent. Page 0 is a metapage that points at the current root.

A point worth correcting: a B-tree node here is not limited to two or three children. Because each page is 8 KB and a typical key is small, a single internal page holds hundreds of entries, giving the tree a very high fan-out. High fan-out means shallow trees - even a table with hundreds of millions of rows usually has a B-tree only three or four levels deep, so a lookup touches a handful of pages rather than scanning millions of tuples. That shallowness is the whole point: search cost grows logarithmically, not linearly.

Search starts at the root and binary-searches each page's keys to pick the child to descend into, repeating until it reaches a leaf. Equality and range queries both work this way; a range scan lands on the first matching leaf entry and then follows sibling links. Multi-column indexes sort by the leftmost column first, so an index on (a, b, c) serves WHERE a = ? AND b = ? and ORDER BY a, b, c, but it cannot satisfy an ORDER BY a, c or a query that filters only on b without scanning.

Covering Indexes and the Visibility Map

A plain index scan still has to visit the heap for every matching row, both to read non-indexed columns and to confirm each tuple's visibility (the index entry carries no t_xmin/t_xmax). A covering index avoids the column fetch by storing extra payload columns with INCLUDE:

-- email is searchable; full_name rides along as a non-key payload column
CREATE INDEX idx_users_email ON users (email) INCLUDE (full_name);

A query that selects only email and full_name filtered on email can then be answered as an index-only scan, reading just the index. But the visibility problem remains, and PostgreSQL solves it with the visibility map: a per-table bitmap with bits marking heap pages where every tuple is visible to all transactions. During an index-only scan, if the target tuple's heap page is flagged all-visible, PostgreSQL skips the heap entirely; if not, it falls back to a heap fetch to check t_xmin/t_xmax. The map is maintained by VACUUM, so a table that has not been vacuumed after heavy writes will show heap fetches in EXPLAIN (ANALYZE) even with a perfect covering index. Keeping autovacuum healthy is what makes index-only scans actually skip the heap.

This is exactly the kind of regression that hides in plain sight: the index exists, the plan still says "Index Only Scan," but heap-fetch counts creep up and latency drifts. Pulse monitors PostgreSQL query plans and table-level vacuum and visibility statistics continuously, so when index-only scans start falling back to the heap or the planner abandons an index after a stats change, its agentic SRE layer surfaces the root cause instead of leaving you to reverse-engineer it from slow-query logs.

Access Methods: Choosing the Right Index Type

PostgreSQL treats every index algorithm as an access method registered in the pg_am catalog, alongside the heap table access method. New methods can be added by extensions (for example bloom). Run SELECT amname FROM pg_am; to list what is installed. The six built-in index methods differ in the data and query shapes they serve:

Access method	Best for	Notes
`btree`	Equality, range, sorting, `IS NULL`	Default; only method supporting unique indexes and `ASC`/`DESC` ordering
`hash`	Single-column equality only	Smaller than B-tree for equality; no range or ordering support
`gist`	Geometric, full-text, nearest-neighbor, range types	Lossy, extensible; supports exclusion constraints
`spgist`	Non-balanced partitioned data (quadtrees, tries, IP ranges)	Good for data that clusters unevenly
`gin`	Multi-value columns: `jsonb`, arrays, full-text `tsvector`	Indexes each element; fast containment lookups
`brin`	Very large tables with natural physical ordering (time-series)	Stores per-block-range summaries; tiny on disk

B-tree, GiST, GIN, and BRIN support multi-column indexes; only B-tree enforces uniqueness. The reason the same WHERE clause can use different methods is that each method registers operator classes and operator families, which tie a data type's operators (<, =, >, containment, distance) to the access method's strategy numbers. That indirection is what lets you index a custom type: define an operator class so the B-tree knows how to order your values, and CREATE INDEX ... USING btree starts working on it.

Frequently Asked Questions

Q: How deep is a PostgreSQL B-tree index in practice?
A: Usually three to four levels, even for hundreds of millions of rows. Each 8 KB index page holds hundreds of entries, so the fan-out is high and the tree stays shallow. A lookup reads one page per level plus the heap fetch, which is why B-tree search cost is logarithmic rather than linear.

Q: What is the difference between an index scan and an index-only scan?
A: An index scan finds matching entries in the index, then reads each matching row from the heap to get the remaining columns and confirm visibility. An index-only scan answers the query entirely from the index, skipping the heap whenever the visibility map marks the relevant pages all-visible. Index-only scans require a covering index and a well-vacuumed table.

Q: Why does an index-only scan still show heap fetches in EXPLAIN ANALYZE?
A: Heap fetches appear when the visibility map does not mark the target heap pages all-visible, so PostgreSQL must read the heap tuple to check t_xmin/t_xmax. This happens after heavy inserts or updates when VACUUM has not yet refreshed the map. Running VACUUM (or letting autovacuum catch up) usually drives the heap-fetch count back down.

Q: When should I use a hash index instead of a B-tree?
A: Use a hash index only for single-column equality lookups where you never need range queries or ordering. Hash indexes can be smaller than B-trees for pure equality, but B-trees handle equality well and also serve ranges and sorts, so most workloads default to B-tree. Hash indexes became crash-safe and WAL-logged in PostgreSQL 10.

Q: Why does a multi-column index not get used for my query?
A: A multi-column B-tree sorts by its leftmost column first, so it can only be used when the query constrains a left-to-right prefix of the index columns. An index on (a, b, c) serves filters on a, on a, b, or on a, b, c, but not a query that filters only on b or c. Reordering the index columns or creating a separate index fixes it.

Q: How does PostgreSQL decide between a sequential scan and an index scan?
A: The planner estimates the cost of each plan from table statistics and cost constants - seq_page_cost, random_page_cost, cpu_tuple_cost, and others - then picks the cheapest. For queries returning a large fraction of the table, a sequential scan is genuinely cheaper. On SSD-backed storage the default random_page_cost of 4.0 often overstates index-scan cost; lowering it makes index scans more attractive.

PostgreSQL CREATE INDEX: Syntax, Types, and Production Patterns: the syntax and options behind every index type discussed here.
Debugging Low Cache Hit Ratio in PostgreSQL: how the buffer cache and page reads interact with index and heap access.
Reading PostgreSQL EXPLAIN and EXPLAIN ANALYZE: confirm whether your query uses an index scan, index-only scan, or sequential scan.
Debugging Slow Queries in PostgreSQL: from a slow query to the missing or unused index behind it.
PostgreSQL Performance Tuning: cost-constant settings like random_page_cost that shape planner index choices.