I want to do some basic geocoding of addresses using Postgres. I have an address table that has around 1 million raw address strings:
=> \d addresses
Table "public.addresses"
Column | Type | Modifiers
---------+------+-----------
address | text |
I also have a table of location data:
=> \d locations
Table "public.locations"
Column | Type | Modifiers
------------+------+-----------
id | text |
country | text |
postalcode | text |
latitude | text |
longitude | text |
Most of the address strings contain postalcodes, so my first attempt was to do a like and a lateral join:
EXPLAIN SELECT * FROM addresses a
JOIN LATERAL (
SELECT * FROM locations
WHERE address ilike '%' || postalcode || '%'
ORDER BY LENGTH(postalcode) DESC
LIMIT 1
) AS l ON true;
That gave the expected result, but it was slow. Here's the query plan:
QUERY PLAN
--------------------------------------------------------------------------------------
Nested Loop (cost=18383.07..18540688323.77 rows=1008572 width=91)
-> Seq Scan on addresses a (cost=0.00..20997.72 rows=1008572 width=56)
-> Limit (cost=18383.07..18383.07 rows=1 width=35)
-> Sort (cost=18383.07..18391.93 rows=3547 width=35)
Sort Key: (length(locations.postalcode))
-> Seq Scan on locations (cost=0.00..18365.33 rows=3547 width=35)
Filter: (a.address ~~* (('%'::text || postalcode) || '%'::text))
I tried adding a gist trigram index to the address column, like mentioned at https://mcmap.net/q/28722/-postgresql-like-query-performance-variations, but the query plan for the above query doesn't make use of it, and the query plan in unchanged.
CREATE INDEX idx_address ON addresses USING gin (address gin_trgm_ops);
I have to remove the order by and limit in the lateral join query for the index to get used, which doesn't give me the results I want. Here's the query plan for the query without ORDER
or LIMIT
:
QUERY PLAN
-----------------------------------------------------------------------------------------------
Nested Loop (cost=39.35..129156073.06 rows=3577682241 width=86)
-> Seq Scan on locations (cost=0.00..12498.55 rows=709455 width=28)
-> Bitmap Heap Scan on addresses a (cost=39.35..131.60 rows=5043 width=58)
Recheck Cond: (address ~~* (('%'::text || locations.postalcode) || '%'::text))
-> Bitmap Index Scan on idx_address (cost=0.00..38.09 rows=5043 width=0)
Index Cond: (address ~~* (('%'::text || locations.postalcode) || '%'::text))
Is there something I can do to get the query to use the index, or is there a better way to rewrite this query?
postalcode
? E.g. 4-6 digits, or "up to 6 digits followed by up to two optional letters with no space". Are your addresses formatted in any standard way at all? Your best strategy might be finding the list of candidate valid postcode numbers (as an array) and use the postcode index in locations, instead of using a trigram index. – PinkneyORDER BY LENGTH(postalcode) DESC
? I thought that postal codes have a fixed length in most countries. – Erundgist
trigram index
, but then you display a gin trigram index? – Carmoncarmona