Indexes: How They Work, When They Help, When They Hurt


Indexes are one of the most effective tools for making a slow database fast - and one of the easiest to misuse. Developers add them when queries are slow, leave them when they stop being useful, and occasionally forget that they have costs. The result is tables with a dozen indexes, most of which are doing nothing except slowing down every write.

Understanding what an index actually is - not just “it makes queries faster” - is what lets you reason about when to add one, when to remove one, and what to expect from either decision.

What an index is

An index is a separate data structure that the database maintains alongside your table. It stores a copy of the indexed column values, sorted, with pointers back to the original rows.

The most common type is a B-tree index. A B-tree is a balanced tree where each node contains sorted key values and pointers to child nodes. The leaf nodes contain the actual indexed values and, for each value, a pointer to the row’s physical location (the tuple ID).

When you query for a specific value:

SELECT * FROM users WHERE email = 'alice@example.com';

Without an index on email, the database reads every row in the table and checks whether the email matches. For a table with 5 million rows, that’s 5 million comparisons.

With a B-tree index on email, the database traverses the tree - O(log n) - finds the leaf node containing alice@example.com, gets the tuple ID, and goes directly to that page. For 5 million rows, the tree is maybe 20 levels deep. The difference is not subtle.

What the query planner actually does

The planner doesn’t always use an index even when one exists. It makes a cost estimate.

For a small table, a sequential scan may be cheaper than an index scan because the overhead of traversing the index and following tuple ID pointers adds up when the table fits in a few pages anyway.

For a query that returns a large fraction of the rows, an index scan can actually be slower than a sequential scan. If you’re fetching 60% of a table, you’re going to touch most pages anyway - visiting each row via a random page access (index scan) is slower than reading pages sequentially.

The rule of thumb: indexes are most valuable for high-cardinality columns where queries are selective - filtering for a small percentage of rows.

Range queries and sort order

B-tree indexes are sorted, which makes them useful for more than just equality checks. They support range queries efficiently:

SELECT * FROM orders WHERE created_at BETWEEN '2025-01-01' AND '2025-03-31';

The planner traverses the tree to the first matching value, then scans forward along the leaf nodes - which are linked - until it exits the range. This is fast.

The sort order also helps with ORDER BY. If your query sorts by an indexed column and the planner can read the index in order, it can skip the sort step entirely.

Composite indexes and column order

An index on multiple columns is called a composite index:

CREATE INDEX idx_orders_user_status ON orders (user_id, status);

The order of columns in a composite index matters significantly. This index supports:

  • Queries filtering on user_id alone
  • Queries filtering on both user_id and status

It does not efficiently support queries filtering on status alone. The index is sorted first by user_id, then by status within each user_id. Scanning it for a specific status without knowing the user_id would require a full index scan.

The leading column rule: an index is usable for queries that filter on the leftmost column(s) of the index. Put the most selective column first, or the one that appears alone in the most queries.

Covering indexes

When a query only needs columns that are all in the index, the database can answer the query from the index alone without touching the table. This is called an index-only scan, and it’s fast because it avoids the random I/O of fetching the actual row data.

-- If we have an index on (user_id, created_at, status):
SELECT user_id, created_at, status FROM orders WHERE user_id = 42;

The database can read this entirely from the index. For read-heavy workloads, designing indexes to cover your most common queries can make a measurable difference.

The real costs of indexes

Here is the part that’s easy to forget.

Every index on a table must be updated on every INSERT, UPDATE, and DELETE. The more indexes, the more work every write does. On a write-heavy table with many indexes, the index maintenance overhead can dominate.

Indexes also take space. A B-tree index on a large table can be nearly as large as the table itself. Index bloat is real - deleted rows leave gaps in the index that need periodic cleanup.

There’s also the planning cost. More indexes means more options for the planner to evaluate, which adds a small but non-zero cost to each query planning step.

When you’re adding an index, ask: how many writes does this table see relative to reads? What queries am I actually serving? Is this index likely to be used by the planner, or will the selectivity be too low?

Finding unused indexes

Most databases give you visibility into index usage. In PostgreSQL:

SELECT
  schemaname,
  tablename,
  indexname,
  idx_scan,
  idx_tup_read,
  idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC;

Indexes with an idx_scan count near zero are candidates for removal. Before dropping one, check whether it enforces a uniqueness constraint - those have value even if they’re never used by a query plan.

Partial indexes

An index doesn’t have to cover all rows. A partial index includes only rows matching a condition:

CREATE INDEX idx_users_active ON users (email) WHERE status = 'active';

If most queries on email filter for active users, this index is smaller, cheaper to maintain, and more selective than a full index on email.

Partial indexes are underused. They’re particularly valuable when a small fraction of rows is queried frequently - pending orders, unread messages, active sessions.

When to add an index

A few situations where adding an index is almost always the right move:

  • Columns used in WHERE clauses on large tables, especially if the query is selective
  • Foreign key columns - many databases don’t create these automatically, and joins on unindexed foreign keys are slow
  • Columns used in ORDER BY for queries that don’t need to fetch many rows
  • Columns used in JOIN conditions

And when to be more careful:

  • Tables with very high write rates - every index slows down inserts and updates
  • Low-cardinality columns like boolean flags - the planner often skips the index anyway because it’s not selective enough
  • Queries that already run fast - adding an index to a query that takes 2ms is not useful

An index is not free. It’s a tradeoff with a cost on the write side and a benefit on the read side. The cases where that tradeoff is clearly worth it are common - but so are the cases where it isn’t, and those indexes accumulate quietly until someone looks at the schema and wonders why there are fifteen of them.



Read more