Search in 300 million addresses with pg_trgm
Asked Answered
P

1

7

I have 300 million addresses in my PostgreSQL 9.3 DB and I want to use pg_trgm to fuzzy search the rows. The final purpose is to implement a search function just like Google Map search.

When I used pg_trgm to search these addresses, it cost about 30s to get the results. There are many rows matching the default similarity threshold condition of 0.3 but I just need about 5 or 10 results. I created a trigram GiST index:

CREATE INDEX addresses_trgm_index ON addresses USING gist (address gist_trgm_ops);

This is my query:

SELECT address, similarity(address, '981 maun st') AS sml 
FROM addresses 
WHERE address % '981 maun st' 
ORDER BY sml DESC 
LIMIT 10;

The test table on production environment has been removed. I show the EXPLAIN output from my test environment. There are about 7 million rows and it needs about 1.6s to get results. With 300 million, it needs more than 30s.

ebdb=> explain analyse select address, similarity(address, '781 maun st') as sml from addresses where address % '781 maun st' order by sml desc limit 10;
                                    QUERY PLAN                                                                            
————————————————————————————————————————————————————————————————————————————————    
 Limit  (cost=7615.83..7615.86 rows=10 width=16) (actual time=1661.004..1661.010 rows=10 loops=1)
 ->  Sort  (cost=7615.83..7634.00 rows=7268 width=16) (actual time=1661.003..1661.005 rows=10 loops=1)
     Sort Key: (similarity((address)::text, '781 maun st'::text))
     Sort Method: top-N heapsort  Memory: 25kB
     ->  Index Scan using addresses_trgm_index on addresses  (cost=0.41..7458.78 rows=7268 width=16) (actual time=0.659..1656.386 rows=5241 loops=1)
           Index Cond: ((address)::text % '781 maun st'::text)
 Total runtime: 1661.066 ms
(7 rows)

Is there a good way to improve the performance or is it a good plan to do table partitioning?

Pacemaker answered 27/6, 2017 at 6:16 Comment(8)
"... I just need about 5 or 10 results" ... are you placing an appropriate LIMIT on the query?Knockknee
Partitioning is available in Postgres 9.3 but is implemented using table inheritance. It is explicitly available in postgres 10.Affrica
I take it "300MM+" means 300 million? If so, you should consider using ElasticSearch or something similar.Speculative
@DavidAldridge Thanks, I have added a limit. However I need to order by the similarity, so the calculate is very slow.Pacemaker
@GaryTao I would look at parallel query to see if there's a quick win there. postgresql.org/docs/9.6/static/parallel-query.htmlKnockknee
@DavidAldridge Yeah, I will also consider to upgrade the database and use the parallel way.Pacemaker
Please Edit your question and add the execution plan generated using explain (analyze, verbose). Formatted text please, no screen shotsJorgan
@a_horse_with_no_name Have updated the question. Would appreciate hearing any suggestion.Pacemaker
A
10

PostgreSQL 9.3 ... Is there a good way to improve the performance or is it a good plan to do table partitioning?

Table partitioning will not help at all.

But yes, there is a good way: Upgrade to a current version of Postgres. There have been many improvements for GiST indexes, for the pg_trgm module in particular and for big data in general. Should be substantially faster with Postgres 10.

Your "nearest neighbor" search looks correct but for a small LIMIT use this equivalent query instead:

SELECT address, similarity(address, '981 maun st') AS sml 
FROM   addresses 
WHERE  address % '981 maun st' 
ORDER  BY address <-> '981 maun st'
LIMIT  10;

Quoting the manual:

It will usually beat the first formulation when only a small number of the closest matches is wanted.

Alumnus answered 30/6, 2017 at 3:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.