LATERAL JOIN not using trigram index
Asked Answered
F

3

9

I want to do some basic geocoding of addresses using Postgres. I have an address table that has around 1 million raw address strings:

=> \d addresses
  Table "public.addresses"
 Column  | Type | Modifiers
---------+------+-----------
 address | text |

I also have a table of location data:

=> \d locations
   Table "public.locations"
   Column   | Type | Modifiers
------------+------+-----------
 id         | text |
 country    | text |
 postalcode | text |
 latitude   | text |
 longitude  | text |

Most of the address strings contain postalcodes, so my first attempt was to do a like and a lateral join:

EXPLAIN SELECT * FROM addresses a
JOIN LATERAL (
    SELECT * FROM locations
    WHERE address ilike '%' || postalcode || '%'
    ORDER BY LENGTH(postalcode) DESC
    LIMIT 1
) AS l ON true;

That gave the expected result, but it was slow. Here's the query plan:

                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Nested Loop  (cost=18383.07..18540688323.77 rows=1008572 width=91)
   ->  Seq Scan on addresses a  (cost=0.00..20997.72 rows=1008572 width=56)
   ->  Limit  (cost=18383.07..18383.07 rows=1 width=35)
         ->  Sort  (cost=18383.07..18391.93 rows=3547 width=35)
               Sort Key: (length(locations.postalcode))
               ->  Seq Scan on locations  (cost=0.00..18365.33 rows=3547 width=35)
                     Filter: (a.address ~~* (('%'::text || postalcode) || '%'::text))

I tried adding a gist trigram index to the address column, like mentioned at https://mcmap.net/q/28722/-postgresql-like-query-performance-variations, but the query plan for the above query doesn't make use of it, and the query plan in unchanged.

CREATE INDEX idx_address ON addresses USING gin (address gin_trgm_ops);

I have to remove the order by and limit in the lateral join query for the index to get used, which doesn't give me the results I want. Here's the query plan for the query without ORDER or LIMIT:

                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Nested Loop  (cost=39.35..129156073.06 rows=3577682241 width=86)
   ->  Seq Scan on locations  (cost=0.00..12498.55 rows=709455 width=28)
   ->  Bitmap Heap Scan on addresses a  (cost=39.35..131.60 rows=5043 width=58)
         Recheck Cond: (address ~~* (('%'::text || locations.postalcode) || '%'::text))
         ->  Bitmap Index Scan on idx_address  (cost=0.00..38.09 rows=5043 width=0)
               Index Cond: (address ~~* (('%'::text || locations.postalcode) || '%'::text))

Is there something I can do to get the query to use the index, or is there a better way to rewrite this query?

Francophile answered 17/5, 2016 at 4:25 Comment(11)
Can you post the query and the plan for the one that does use the index?Lucielucien
Added more details and the additional query planFrancophile
Is there an PK in your addresses table? Is there a common/valid format for postalcode? E.g. 4-6 digits, or "up to 6 digits followed by up to two optional letters with no space". Are your addresses formatted in any standard way at all? Your best strategy might be finding the list of candidate valid postcode numbers (as an array) and use the postcode index in locations, instead of using a trigram index.Pinkney
What's the purpose of ORDER BY LENGTH(postalcode) DESC? I thought that postal codes have a fixed length in most countries.Erund
In the UK and US at least you have 2 types of postalcode, a short form and long form. The locations table has both, and this order by makes sure to match the more specific one, if it exists in the table.Francophile
So there will be at most two postal codes per any address? In some cases there may be more but how many do you expect to get?Lucielucien
No there can be many different postal codes per address.Francophile
I took the liberty to fix the table name in the index to match. Please also clarify: You mention a gist trigram index, but then you display a gin trigram index?Carmoncarmona
Thanks Erwin. I tried both, but you're right that the example includes a gin index. Same results with both (index used when not doing order or limit, not used when doing either order or limit).Francophile
Please clarify the question accordingly. It's a bit confusing as it is. I think I have a solution for you ...Carmoncarmona
Sure, happy to clarify. Can you let me know what's not currently clear? I've tried to include the goal, and a working query that I'd like to improve the speed of. What else would help?Francophile
C
6

Why?

The query cannot use the index on principal. You would need an index on the table locations, but the one you have is on the table addresses.

You can verify my claim by setting:

SET enable_seqscan = off;

(In your session only, and for debugging only. Never use it in production.) It's not like the index would be more expensive than a sequential scan, there is just no way for Postgres to use it for your query at all.

Aside: [INNER] JOIN ... ON true is just an awkward way of saying CROSS JOIN ...

Why is the index used after removing ORDER and LIMIT?

Because Postgres can rewrite this simple form to:

SELECT *
FROM   addresses a
JOIN   locations l ON a.address ILIKE '%' || l.postalcode || '%';

You'll see the exact same query plan. (At least I do in my tests on Postgres 9.5.)

Solution

You need an index on locations.postalcode. And while using LIKE or ILIKE you would also need to bring the indexed expression (postalcode) to the left side of the operator. ILIKE is implemented with the operator ~~* and this operator has no COMMUTATOR (a logical necessity), so it's not possible to flip operands around. Detailed explanation in these related answers:

A solution is to use the trigram similarity operator % or its inverse, the distance operator <-> in a nearest neighbour query instead (each is commutator for itself, so operands can switch places freely):

SELECT *
FROM   addresses a
JOIN   LATERAL (
   SELECT *
   FROM   locations
   ORDER  BY postalcode <-> a.address
   LIMIT  1
   ) l ON address ILIKE '%' || postalcode || '%';

Find the most similar postalcode for each address, and then check if that postalcode actually matches fully.

This way, a longer postalcode will be preferred automatically since it's more similar (smaller distance) than a shorter postalcode that also matches.

A bit of uncertainty remains. Depending on possible postal codes, there could be false positives due to matching trigrams in other parts of the string. There is not enough information in the question to say more.

Here, [INNER] JOIN instead of CROSS JOIN makes sense, since we add an actual join condition.

The manual:

This can be implemented quite efficiently by GiST indexes, but not by GIN indexes.

So:

CREATE INDEX locations_postalcode_trgm_gist_idx ON locations
USING gist (postalcode gist_trgm_ops);
Carmoncarmona answered 21/5, 2016 at 14:51 Comment(1)
Great answer, always learn a lot from you! The EXPLAIN shows this using the new index, but unfortunately the query still hasn't completed after 9 hours, which means it's going to be at least twice as slow as the original query without the index.Francophile
P
2

It's a far shot, but how does the following alternative perform?

SELECT DISTINCT ON ((x.a).address) (x.a).*, l.*
FROM (
  SELECT a, l.id AS lid, LENGTH(l.postalcode) AS pclen
  FROM addresses a
  LEFT JOIN locations l ON (a.address ilike '%' || l.postalcode || '%') -- this should be fast, but produce many rows
  ) x
LEFT JOIN locations l ON (l.id = x.lid)
ORDER BY (x.a).address, pclen DESC -- this is where it will be slow, as it'll have to sort the entire results, to filter them by DISTINCT ON
Pinkney answered 19/5, 2016 at 9:22 Comment(4)
@BenDowling found and fixed one, although without information about the error that may not be the one. Does it work now? Otherwise, what's the error?Pinkney
ERROR: syntax error at or near "," LINE 1: explain SELECT DISTINCT ON ((x.a).address), (x.a).*, l.*Francophile
@BenDowling fixed (the problem was a comma after distinct on :-/ ) how about now?Pinkney
EXPLAIN shows that this still doesn't use the index :(Francophile
L
2

It can work if you turn the lateral join inside out. But even then it might still be really slow

SELECT DISTINCT ON (address) *
FROM (
    SELECT * 
    FROM locations
       ,LATERAL(
           SELECT * FROM addresses
           WHERE address ilike '%' || postalcode || '%'
           OFFSET 0 -- force fencing, might be redundant
        ) a
) q
ORDER BY address, LENGTH(postalcode) DESC

The downside is that you can implement paging only on postalcodes, not addresses.

Lucielucien answered 20/5, 2016 at 20:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.