Indexing Foreign Keys in Postgresql
Asked Answered
S

2

6

Like many Postgres n00bs we have a lot of tables with foreign key constraints that are not indexed. I some cases this should not be a big performance hit - but this would be subject for further analysis.

I have read the following article: https://www.cybertec-postgresql.com/en/index-your-foreign-key/

And used the following query to find all foreign keys without an index:

SELECT c.conrelid::regclass AS "table",
       /* list of key column names in order */
       string_agg(a.attname, ',' ORDER BY x.n) AS columns,
       pg_catalog.pg_size_pretty(
          pg_catalog.pg_relation_size(c.conrelid)
       ) AS size,
       c.conname AS constraint,
       c.confrelid::regclass AS referenced_table
FROM pg_catalog.pg_constraint c
   /* enumerated key column numbers per foreign key */
   CROSS JOIN LATERAL
      unnest(c.conkey) WITH ORDINALITY AS x(attnum, n)
   /* name for each key column */
   JOIN pg_catalog.pg_attribute a
      ON a.attnum = x.attnum
         AND a.attrelid = c.conrelid
WHERE NOT EXISTS
        /* is there a matching index for the constraint? */
        (SELECT 1 FROM pg_catalog.pg_index i
         WHERE i.indrelid = c.conrelid
           /* the first index columns must be the same as the
              key columns, but order doesn't matter */
           AND (i.indkey::smallint[])[0:cardinality(c.conkey)-1]
               @> c.conkey::int[])
  AND c.contype = 'f'
GROUP BY c.conrelid, c.conname, c.confrelid
ORDER BY pg_catalog.pg_relation_size(c.conrelid) DESC;

This shows me for tables with composite unique constraints only "one" of the columns in the unique index:

\d topics_items;
-----------------+---------+--------------+---------------+------------------------------
 topics_items_id | integer |              | not null      | generated always as identity
 topic_id        | integer |              | not null      |
 item_id         | integer |              | not null      |
Index:
    "topics_items_pkey" PRIMARY KEY, btree (topics_items_id)
    "topic_id_item_id_unique" UNIQUE CONSTRAINT, btree (topic_id, item_id)
Foreign Keys:
    "topics_items_item_id_fkey" FOREIGN KEY (item_id) REFERENCES items(item_id) ON DELETE CASCADE
    "topics_items_topic_id_fkey" FOREIGN KEY (topic_id) REFERENCES topics(topic_id) ON DELETE CASCADE

In this case the check query only finds the item_id and not the topic_id as an not indexed field.

Is it fair to say, that this is just an issue of the query used, and I have to separately index both fields (topic_id and item_id) - or is there some black sorcery involved and only the item_id needs an index?

Soyuz answered 22/3, 2020 at 19:51 Comment(0)
M
5

tl;dr You need to add an index on item_id. The "black magic" of Postgres indexing is covered in 11. Indexes.

You have a composite index on (topic_id, item_id) and column order is important. Postgres can use this to index queries on topic_id, queries on both topic_id and item_id, but not (or less efficiently) item_id alone.

From 11.3. Multicolumn Indexes...

A multicolumn B-tree index can be used with query conditions that involve any subset of the index's columns, but the index is most efficient when there are constraints on the leading (leftmost) columns.

-- indexed
select *
from topics_items
where topic_id = ?

-- also indexed
select *
from topics_items
where topic_id = ?
  and item_id = ?

-- probably not indexed
select *
from topics_items
where item_id = ?

This is because a composite index like (topic_id, item_id) stores the topic ID first, then an item IDs which also have that topic ID. In order to look up an item ID efficiently in this index, Postgres must first narrow the search with a topic ID.


Postgres can reverse an index if it thinks it's worth the effort. If there's a small number of possible topic IDs, and a large number of possible index IDs, it will search for the index ID in each topic ID.

For example, let's say you have 10 possible topic IDs and 1000 possible item IDs and your index (topic_id, index_id). This is like having 10 clearly labelled topic ID buckets each with 1000 clearly labelled item ID buckets inside. To get to the item ID buckets, it must look inside each topic ID bucket. To use this index on where item_id = 23 Postgres must search each of the 10 topic ID buckets for all the buckets with item ID 23.

But if you have 1000 possible topic IDs and 10 possible item IDs, Postgres would have to search 1000 topic IDs buckets. Most likely it will do a full table scan instead. In this case you'd want to reverse your index and make it (item_id, topic_id).

This depends heavily on having good table statistics, which means making sure autovacuum is working properly.

So you can get away with a single index for two columns, if one column has far less variability than another.


Postgres can also use mulitple indexes if it thinks it will make the query run faster. For example, if you had an index on topic_id and an index on item_id, it can use both indexes and combine the results. For example where topic_id = 23 or item_id = 42 could use the topic_id index to search for topic ID 23, and the item_id index to search for item ID 42, then combine the results.

This is generally slower than having a composite (topic_id, item_id) index. It can also be slower than using a single index, so don't be surprised if Postgres decides not to use multiple indexes.


In general, for b-tree indexes, when you have two columns you have three possible combinations.

  • a + b
  • a
  • b

And you need two indexes.

  • (a, b) -- a and a + b
  • (b) -- b

(a, b) covers both searches for a and a + b. (b) covers searching for b.

When you have three columns, you have seven possible combinations.

  • a + b + c
  • a + b
  • a + c
  • a
  • b + c
  • b
  • c

But you only need three indexes.

  • (a, b, c) -- a, a + b, a + b + c
  • (b, c) -- b, b + c
  • (c, a) -- c, c + a

However, you probably actually want to avoid having an index on three columns. It's often slower. What you actually want is this.

  • (a, b)
  • (b, c)
  • (c, a)

Multicolumn indexes should be used sparingly. In most situations, an index on a single column is sufficient and saves space and time. Indexes with more than three columns are unlikely to be helpful unless the usage of the table is extremely stylized.

Reading from an index is slower than reading from the table. You want your indexes to reduce the number of rows which must be read, but you don't want Postgres to have to do any more index scanning than necessary.

Constraints on columns to the right... are checked in the index, so they save visits to the table proper, but they do not reduce the portion of the index that has to be scanned. For example, given an index on (a, b, c) and a query condition WHERE a = 5 AND b >= 42 AND c < 77, the index would have to be scanned from the first entry with a = 5 and b = 42 up through the last entry with a = 5. Index entries with c >= 77 would be skipped, but they'd still have to be scanned through.

Maxwell answered 22/3, 2020 at 20:16 Comment(0)
M
1

The rows with a certain topic_id can be found efficiently using the index on (topic_id, item_id), that's why my query considers that foreign key covered.

The index is sorted on topic_id, and within all entries with the same topic_id, it is sorted by item_id. That allows it to be used for searches on topic_id alone.

Missi answered 23/3, 2020 at 7:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.