Pulse 2025 Product Roundup: From Monitoring to AI-Native Control Plane

Read more

PostgreSQL CREATE INDEX: Syntax, Types, and Production Patterns

Indexes in PostgreSQL are the primary mechanism for avoiding sequential scans on large tables. Getting them right - choosing the correct type, building them safely in production, and knowing when to remove them - makes a measurable difference to query latency and write throughput alike.

The basic syntax is straightforward:

CREATE INDEX index_name ON table_name (column_name);

But that default hides a large number of options that matter in practice. The full form is:

CREATE [UNIQUE] INDEX [CONCURRENTLY] [IF NOT EXISTS] index_name
  ON table_name USING method (column_or_expression [operator_class] [ASC|DESC] [NULLS {FIRST|LAST}], ...)
  [INCLUDE (column, ...)]
  [WHERE predicate];

The USING clause selects the index access method. When omitted, PostgreSQL defaults to btree. The INCLUDE clause attaches non-key columns for index-only scans. The WHERE clause restricts the index to a subset of rows. Each of these options has real trade-offs worth understanding before applying them.

Choosing the Right Index Type

PostgreSQL ships with six built-in index access methods: btree, hash, gist, spgist, gin, and brin. The choice between them depends on the data type, query operators, and write patterns.

B-tree is the default and handles equality, range queries, sorting, and IS NULL predicates. It works for any data type with a defined sort order and is the right choice for the vast majority of columns. When in doubt, use B-tree.

Hash indexes store a 32-bit hash of the indexed value. They support only equality comparisons (=) and offer no benefit over B-tree for ranges or sorting. Their main advantage is a smaller index size when the indexed column is wide - longer than roughly a third of a page (8KB default) - and queries are pure equality lookups. Hash indexes became crash-safe in PostgreSQL 10 and are WAL-logged since then. Before PostgreSQL 10, they required manual REINDEX after a crash, so treat any hash index on an older cluster with suspicion.

GIN (Generalized Inverted Index) is designed for composite values where a single column contains multiple searchable elements: arrays, jsonb, tsvector for full-text search, and hstore. A GIN index on a tsvector column or a jsonb column with the default jsonb_ops operator class lets you search inside the value efficiently. GIN builds are slow and the index is large, but lookups are fast. Use GIN on relatively static data or accept longer index maintenance during heavy writes.

For JSONB columns where queries use only containment operators (@>), prefer the jsonb_path_ops operator class. It creates indexes that are often 60-80% smaller than the default jsonb_ops and is faster for containment lookups, at the cost of not supporting key-existence operators (?, ?|, ?&):

CREATE INDEX ON docs USING GIN (payload jsonb_path_ops);

GiST (Generalized Search Tree) covers geometric types, range types, and full-text search as an alternative to GIN. GiST indexes support a wider set of operators (containment, overlap, nearest-neighbor) and are faster to update than GIN. The trade-off: GiST lookups are roughly three times slower than GIN for full-text workloads, and GIN indexes are two-to-three times larger than GiST for tsvector use cases. For a full-text search column that changes frequently, GiST is often the better fit because it is smaller and cheaper to update. For a mostly-static corpus you query heavily, GIN wins on lookup speed despite the larger index size.

BRIN (Block Range Index) stores the minimum and maximum value for each range of consecutive physical pages. At a few kilobytes, a BRIN index can be orders of magnitude smaller than a B-tree on the same column. The trade-off is that BRIN is only effective when column values correlate tightly with the physical storage order - timestamps in append-only event tables, auto-incrementing IDs, or date-partitioned data. On a randomized column, BRIN degrades to a full scan. It is not a replacement for B-tree on arbitrary columns.

Building Indexes Without Downtime

A standard CREATE INDEX acquires a SHARE lock on the table for the duration of the build. This blocks all writes - inserts, updates, deletes - until the index is complete. On a large table in production, that can mean seconds to minutes of write unavailability.

CREATE INDEX CONCURRENTLY avoids that by using a SHARE UPDATE EXCLUSIVE lock, which does not block writes. The build runs in multiple passes: two sequential scans of the table with a wait between them to let concurrent transactions settle. The total build time is longer - typically two to three times a normal build - but the table stays writable throughout.

CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders (user_id);

There are hard constraints to know. CONCURRENTLY cannot run inside a transaction block - it needs to commit internal transactions between phases. It cannot be used on partitioned tables at the parent level (you must create indexes on each partition individually). If the build fails partway - due to a deadlock or a uniqueness violation on a unique index - PostgreSQL leaves behind an invalid index. An invalid index is not used by the planner, but it still gets maintained during writes. You need to drop it explicitly and retry:

-- find invalid indexes
SELECT indexrelid::regclass, indisvalid
FROM pg_index
WHERE NOT indisvalid;

Drop the invalid index and rebuild. Leaving invalid indexes around is a subtle correctness issue that is easy to miss until you check pg_index directly.

Partial and Expression Indexes

Partial indexes index only rows that satisfy a WHERE predicate. They are smaller, faster to scan, and cheaper to maintain than full-table indexes on the same column when most queries target a specific subset.

A common case: a users table where active users represent 5% of rows but account for 95% of queries:

CREATE INDEX idx_users_email_active ON users (email) WHERE status = 'active';

The resulting index is far smaller than a full index on email. Queries that include WHERE status = 'active' can use it; queries without that predicate cannot. The predicate in the query must logically imply the index predicate for the planner to choose it - a loose match is not enough.

Soft-delete patterns pair well with partial indexes. Rather than indexing all rows including deleted ones:

CREATE INDEX idx_documents_slug ON documents (slug) WHERE deleted_at IS NULL;

This also allows enforcing uniqueness among non-deleted rows without a constraint that blocks duplicate slugs on deleted records - a common headache with naive unique constraints on soft-delete schemas.

Expression indexes (also called functional indexes) index the result of an arbitrary expression rather than a raw column value. They are useful when queries filter on a function of a column:

CREATE INDEX idx_users_lower_email ON users (lower(email));

Without this, a query like WHERE lower(email) = 'foo@example.com' requires a sequential scan even if email itself is indexed. With the expression index, the planner can use it directly. The expression in the query predicate must match the expression in the index definition exactly.

Covering Indexes with INCLUDE

Introduced in PostgreSQL 11 for B-tree indexes, and extended to GiST and SP-GiST in PostgreSQL 12, the INCLUDE clause lets you add non-key columns to an index. These columns are stored in the leaf pages but are not part of the sort key, so they cannot be used as search predicates or for ordering. Their function is to let the database satisfy a query entirely from the index without fetching heap pages - an index-only scan.

CREATE INDEX idx_orders_status ON orders (status) INCLUDE (created_at, total_amount);

A query like SELECT created_at, total_amount FROM orders WHERE status = 'pending' can now be served from the index alone. The benefit is most pronounced on wide tables with high row counts where heap fetches are expensive. One caveat: PostgreSQL must still check the heap if the visibility map bit for the relevant pages has not been set by autovacuum. An index-only scan on a table with stale visibility map data degrades to regular index scan performance.

Columns in INCLUDE do not participate in uniqueness checks. You can include a non-unique column in a UNIQUE index without widening the uniqueness constraint. That is intentional - the key columns alone determine uniqueness.

Index Maintenance and Monitoring

Indexes accumulate dead pages after updates and deletes. Autovacuum reclaims heap space but may not fully compact index pages below the high-water mark. Over time, an index can occupy significantly more space than the live data warrants. The pgstattuple extension exposes the avg_leaf_density metric - a value below 70% is a signal the index has substantial bloat. REINDEX CONCURRENTLY (available since PostgreSQL 12) rebuilds the index without blocking writes:

REINDEX INDEX CONCURRENTLY idx_orders_user_id;

The second maintenance concern is unused indexes. Every index adds overhead to INSERT, UPDATE, and DELETE because PostgreSQL must keep all indexes consistent with the heap. A table that has accumulated fifteen indexes but queries that use three of them pays a write tax on every modification. pg_stat_user_indexes tracks scan counts since the last statistics reset:

SELECT
  schemaname,
  tablename,
  indexname,
  idx_scan,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;

Zero scans over months of normal traffic means the index is a candidate for removal. Statistics reset on cluster restart, so cross-reference with your monitoring system before dropping. Removing a 500MB index that was written on every order insert is worth doing - the write savings are immediate and compounding.

One more operational note: composite indexes deserve attention on column order. An index on (status, created_at) can satisfy queries filtering on status alone or on both status and created_at. It cannot efficiently serve queries filtering on created_at alone. Put the column used in equality predicates first, range predicates last. This ordering rule interacts with partial indexes: sometimes a partial index on a single column beats a composite index on two columns because the partial index eliminates most rows before the scan starts.

Pulse - Elasticsearch Operations Done Right

Pulse can solve your Elasticsearch issues

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.