Django Full Text Search Optimization - Postgres
Asked Answered
E

2

8

I am trying to create a Full Text Search for an address autocomplete feature, leveraging Django (v2.1) and Postgres (9.5), but the performance is not suitable for an auto-complete at the moment and I don't get the logic behind the performance results I get. For info the table is quite sizeable, with 14 million rows.

My model:

from django.db import models
from postgres_copy import CopyManager
from django.contrib.postgres.indexes import GinIndex

class Addresses(models.Model):
date_update = models.DateTimeField(auto_now=True, null=True)
longitude = models.DecimalField(max_digits=9, decimal_places=6 , null=True)
latitude = models.DecimalField(max_digits=9, decimal_places=6 , null=True)
number = models.CharField(max_length=16, null=True, default='')
street = models.CharField(max_length=60, null=True, default='')
unit = models.CharField(max_length=50, null=True, default='')
city = models.CharField(max_length=50, null=True, default='')
district = models.CharField(max_length=10, null=True, default='')
region = models.CharField(max_length=5, null=True, default='')
postcode = models.CharField(max_length=5, null=True, default='')
addr_id = models.CharField(max_length=20, unique=True)
addr_hash = models.CharField(max_length=20, unique=True)
objects = CopyManager()

class Meta:
    indexes = [
        GinIndex(fields=['number', 'street', 'unit', 'city', 'region', 'postcode'], name='search_idx')
    ]

I created a little test to check performance based on number of words in the search:

    search_vector = SearchVector('number', 'street', 'unit', 'city', 'region', 'postcode')

    searchtext1 = "north"
    searchtext2 = "north bondi"
    searchtext3 = "north bondi blair"
    searchtext4 = "north bondi blair street 2026"

    print('Test1: 1 word')
    start_time = time.time()
    result = AddressesAustralia.objects.annotate(search=search_vector).filter(search=searchtext1)[:10]
    #print(len(result))
    time_exec = str(timedelta(seconds=time.time() - start_time))
    print(time_exec)
    print(' ')

    #print(AddressesAustralia.objects.annotate(search=search_vector).explain(verbose=True))

    print('Test2: 2 words')
    start_time = time.time()
    result = AddressesAustralia.objects.annotate(search=search_vector).filter(search=searchtext2)[:10]
    #print(len(result))
    time_exec = str(timedelta(seconds=time.time() - start_time))
    print(time_exec)
    print(' ')

    print('Test3: 3 words')
    start_time = time.time()
    result = AddressesAustralia.objects.annotate(search=search_vector).filter(search=searchtext3)[:10]
    #print(len(result))
    time_exec = str(timedelta(seconds=time.time() - start_time))
    print(time_exec)
    print(' ')

    print('Test4: 5 words')
    start_time = time.time()
    result = AddressesAustralia.objects.annotate(search=search_vector).filter(search=searchtext4)[:10]
    #print(len(result))
    time_exec = str(timedelta(seconds=time.time() - start_time))
    print(time_exec)
    print(' ')

I get the following results, which seems quite correct:

Test1: 1 word
0:00:00.001841

Test2: 2 words
0:00:00.001422

Test3: 3 words
0:00:00.001574

Test4: 5 words
0:00:00.001360

However if I uncomment the print(len(results)) lines, I get the following results:

Test1: 1 word
10
0:00:00.046392

Test2: 2 words
10
0:00:06.544732

Test3: 3 words
10
0:01:12.367157

Test4: 5 words
10
0:01:17.786596

This is obviously not suitable for an autocomplete feature.

Could someone be able to explain why the execution takes longer when performing an operation on the queryset result? It seems the Database retrieval is always fast but then going through the results takes time, which does not make sense to me as because I'm limiting the results to 10, the queryset returned is always the same size.

Also, although I have created a GIN index, this index does not seem to be used. It seems it has been created correctly:

=# \d public_data_au_addresses
                                   Table 
"public.public_data_au_addresses"
Column    |           Type           | Collation | Nullable |                            
Default                            
-------------+--------------------------+-----------+----------+------ 
---------------------------------------------------------
id          | integer                  |           | not null | 
nextval('public_data_au_addresses_id_seq'::regclass)
date_update | timestamp with time zone |           |          | 
longitude   | numeric(9,6)             |           |          | 
latitude    | numeric(9,6)             |           |          | 
number      | character varying(16)    |           |          | 
street      | character varying(60)    |           |          | 
unit        | character varying(50)    |           |          | 
city        | character varying(50)    |           |          | 
district    | character varying(10)    |           |          | 
region      | character varying(5)     |           |          | 
postcode    | character varying(5)     |           |          | 
addr_id     | character varying(20)    |           | not null | 
addr_hash   | character varying(20)    |           | not null | 
Indexes:
"public_data_au_addresses_pkey" PRIMARY KEY, btree (id)
"public_data_au_addresses_addr_hash_key" UNIQUE CONSTRAINT, btree (addr_hash)
"public_data_au_addresses_addr_id_key" UNIQUE CONSTRAINT, btree (addr_id)
"public_data_au_addresses_addr_hash_e8c67a89_like" btree (addr_hash varchar_pattern_ops)
"public_data_au_addresses_addr_id_9ee00c76_like" btree (addr_id varchar_pattern_ops)
"search_idx" gin (number, street, unit, city, region, postcode)

When I run the explain() method on my query I get that:

Test1: 1 word
Limit  (cost=0.00..1110.60 rows=10 width=140)
->  Seq Scan on public_data_au_addresses  (cost=0.00..8081472.41 rows=72767 width=140)
    Filter: (to_tsvector((((((((((((COALESCE(number, ''::character varying))::text || ' '::text) || (COALESCE(street, ''::character varying))::text) || ' '::text) || (COALESCE(unit, ''::character varying))::text) || ' '::text) || (COALESCE(city, ''::character varying))::text) || ' '::text) || (COALESCE(region, ''::character varying))::text) || ' '::text) || (COALESCE(postcode, ''::character varying))::text)) @@ plainto_tsquery('north'::text))

So it still shows a Sequential scan instead of using an Index scan. Does anyone know how to fix or debug that?

Would a GIN index still be efficient with so many fields to search on anyway?

And lastly, does anyone know how I can improve the code to improve the performance further?

Thank you! Regards

Update

I have tried to create a search vector as suggested by Paolo below, but it seems the search is still sequential and not leveraging the GIN index.

class AddressesQuerySet(CopyQuerySet):

    def update_search_vector(self):
        return self.update(search_vector=SearchVector('number', 'street', 'unit', 'city', 'region', 'postcode', config='english'))


class AddressesAustralia(models.Model):
    date_update = models.DateTimeField(auto_now=True, null=True)
    longitude = models.DecimalField(max_digits=9, decimal_places=6 , null=True)
    latitude = models.DecimalField(max_digits=9, decimal_places=6 , null=True)
    number = models.CharField(max_length=16, null=True, default='')
    street = models.CharField(max_length=60, null=True, default='')
    unit = models.CharField(max_length=50, null=True, default='')
    city = models.CharField(max_length=50, null=True, default='')
    district = models.CharField(max_length=10, null=True, default='')
    region = models.CharField(max_length=5, null=True, default='')
    postcode = models.CharField(max_length=5, null=True, default='')
    addr_id = models.CharField(max_length=20, unique=True)
    addr_hash = models.CharField(max_length=20, unique=True)
    search_vector = SearchVectorField(null=True, editable=False)

    objects = AddressesQuerySet.as_manager()

    class Meta:
        indexes = [
            GinIndex(fields=['search_vector'], name='search_vector_idx')
        ]

I have then updated the search_vector field using the update command:

AddressesAustralia.objects.update_search_vector()

Then I ran a query to test with the same search vector:

class Command(BaseCommand):

    def handle(self, *args, **options):

        search_vector = SearchVector('number', 'street', 'unit', 'city', 'region', 'postcode', config='english')

        searchtext1 = "north"

        print('Test1: 1 word')
        start_time = time.time()
        result = AddressesAustralia.objects.filter(search_vector=searchtext1)[:10].explain(verbose=True)
        print(len(result))
        print(result)
        time_exec = str(timedelta(seconds=time.time() - start_time))
        print(time_exec)

And I get the following results, still showing a sequential search:

Test1: 1 word
532
Limit  (cost=0.00..120.89 rows=10 width=235)
  Output: id, date_update, longitude, latitude, number, street, unit, city, district, region, postcode, addr_id, addr_hash, search_vector
  ->  Seq Scan on public.public_data_au_addressesaustralia  (cost=0.00..5061078.91 rows=418651 width=235)
        Output: id, date_update, longitude, latitude, number, street, unit, city, district, region, postcode, addr_id, addr_hash, search_vector
        Filter: (public_data_au_addressesaustralia.search_vector @@ plainto_tsquery('north'::text))
0:00:00.075262

I have also tried:

  • With and without the config="english" in the search vector (both in the update and in the query)

  • To delete the GIN index, then recreate it and then re-run the update_search_Vector

But still the same results. Any idea on what I am doing wrong or how I can troubleshoot further?

Ecclesia answered 16/12, 2018 at 4:59 Comment(5)
What's the definition of AddressesAustralia model? You show the definition of Addresses model.Spermatic
Querysets are lazy, your first test isn't measuring any database operations. Either way, check the performance section of the full text search docs.Inconvenience
@RedCricket Addresses and AddressesAustralia are the same model, just a typo while posting sorryEcclesia
@Inconvenience Thanks for pointing out that Querysets are lazy, it does explain why it's actually not doing any database operations on my first tests. Regarding the performance, I did check the link you provided before but it says "In the event that all the fields you’re querying on are contained within one particular model, you can create a functional index which matches the search vector you wish to use" which is what I have done with my GIN index. My understanding is that SearchVectorField are mostly useful when searching across several models, which is not the case for me.Ecclesia
@Ecclesia as I explain in my answer SearchVectorField is useful for increase your search performance and not only when searching across several models. Have you tried my code ? it works for you ?Hemphill
H
6

As already suggested by @knbk for performance improvement you have to read the Full-text search Performance section in the Django documentation.

"If this approach becomes too slow, you can add a SearchVectorField to your model."

In your code you can add a search vector field in your model with a related GIN index and a queryset with a new method to update the field:

from django.contrib.postgres.indexes import GinIndex
from django.contrib.postgres.search import SearchVector, SearchVectorField
from django.db import models
from postgres_copy import CopyQuerySet


class AddressesQuerySet(CopyQuerySet):

    def update_search_vector(self):
        return self.update(search_vector=SearchVector(
            'number', 'street', 'unit', 'city', 'region', 'postcode'
        ))


class Addresses(models.Model):
    date_update = models.DateTimeField(auto_now=True, null=True)
    longitude = models.DecimalField(max_digits=9, decimal_places=6, null=True)
    latitude = models.DecimalField(max_digits=9, decimal_places=6, null=True)
    number = models.CharField(max_length=16, null=True, default='')
    street = models.CharField(max_length=60, null=True, default='')
    unit = models.CharField(max_length=50, null=True, default='')
    city = models.CharField(max_length=50, null=True, default='')
    district = models.CharField(max_length=10, null=True, default='')
    region = models.CharField(max_length=5, null=True, default='')
    postcode = models.CharField(max_length=5, null=True, default='')
    addr_id = models.CharField(max_length=20, unique=True)
    addr_hash = models.CharField(max_length=20, unique=True)
    search_vector = SearchVectorField(null=True, editable=False)

    objects = AddressesQuerySet.as_manager()

    class Meta:
        indexes = [
            GinIndex(fields=['search_vector'], name='search_vector_idx')
        ]

You can update your new search vector field using the new queryset method:

>>> Addresses.objects.update_search_vector()
UPDATE "addresses_addresses"
SET "search_vector" = to_tsvector(
  COALESCE("addresses_addresses"."number", '') || ' ' ||
  COALESCE("addresses_addresses"."street", '') || ' ' ||
  COALESCE("addresses_addresses"."unit", '') || ' ' ||
  COALESCE("addresses_addresses"."city", '') || ' ' ||
  COALESCE("addresses_addresses"."region", '') || ' ' ||
  COALESCE("addresses_addresses"."postcode", '')
)

If you execute a query and read the explain you can see your GIN index used:

>>> print(Addresses.objects.filter(search_vector='north').values('id').explain(verbose=True))
EXPLAIN (VERBOSE true)
SELECT "addresses_addresses"."id"
FROM "addresses_addresses"
WHERE "addresses_addresses"."search_vector" @@ (plainto_tsquery('north')) = true [0.80ms]
Bitmap Heap Scan on public.addresses_addresses  (cost=12.25..16.52 rows=1 width=4)
  Output: id
  Recheck Cond: (addresses_addresses.search_vector @@ plainto_tsquery('north'::text))
  ->  Bitmap Index Scan on search_vector_idx  (cost=0.00..12.25 rows=1 width=0)
        Index Cond: (addresses_addresses.search_vector @@ plainto_tsquery('north'::text))

If you want to deepen further you can read an article that I wrote on the subject:

"Full-Text Search in Django with PostgreSQL"

Update

I tried execute the SQL generate by Django ORM: http://sqlfiddle.com/#!17/f9aa9/1

Hemphill answered 18/12, 2018 at 14:32 Comment(6)
Thanks for the help! And apologies for the time to respond, my database is very large and it takes few hours to run an update and I made few tries before updating this thread. Unfortunately I still have a sequential search and the GIN index does not seem to work. I have updated my post above with the updated code and the results. Let me know if you have further ideas to troubleshoot this! Thanks againEcclesia
I tried again the code also in SQL Fiddle as I reported in my answer. Do you a sample CSV to import and to test the same SQL code with a lot of rows ?Hemphill
It is actually working now! It seems the problem was that I was doing my test with the search 'north' (as you did) and for some reasons as soon as I entered another word it worked! If I put 'north' back then it reverts back to sequential... Not sure why, it may re-use some sort of cache on that particular search back when I hadn't implemented the search_vector properly. Anyway works now, thanks very much for the help!!Ecclesia
You're welcome. I'm happy it worked for you too. FTS with Django and Postgres is very powerful. Enjoy using it and please share your experience.Hemphill
It seems to work well using the default search, which only return results using the whole words (using plainto_tsquery). I have try to modify my search to using to_tsquery instead to allow for incomplete search tsquery = " & ".join(searchtext.split()) + ":*" and AddressesAustralia.objects.extra(where=["search_vector @@ (to_tsquery(%s)) = true"], params=[tsquery]), but unfortunately, although it works well, it takes 3 times the time to complete the search on my database of 14million rows (1.5s vs 0.5s)! Not sure if there is still room for improvement anywhere!.Ecclesia
I think it's better if you add another question with all your details and test code related on your new query to permit everyone to answer and help you.Hemphill
I
1

You need to create a functional index on the search vector. Right now you have an index on the underlying fields, but it still has to create the search vector for every row before it can filter the results. That's why it's doing a sequential scan.

Django currently does not support functional indexes in Meta.indexes, so you need to create it manually, for example with a RunSQL operation.

RunSQL(
    """
    CREATE INDEX ON public_data_au_addresses USING GIN 
    (to_tsvector(...))
    """
)

The to_tsvector() expression has to match the expression used in your query. Be sure to read through the Postgres docs for all the details.

Inconvenience answered 16/12, 2018 at 18:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.