I am trying to create a Full Text Search for an address autocomplete feature, leveraging Django (v2.1) and Postgres (9.5), but the performance is not suitable for an auto-complete at the moment and I don't get the logic behind the performance results I get. For info the table is quite sizeable, with 14 million rows.
My model:
from django.db import models
from postgres_copy import CopyManager
from django.contrib.postgres.indexes import GinIndex
class Addresses(models.Model):
date_update = models.DateTimeField(auto_now=True, null=True)
longitude = models.DecimalField(max_digits=9, decimal_places=6 , null=True)
latitude = models.DecimalField(max_digits=9, decimal_places=6 , null=True)
number = models.CharField(max_length=16, null=True, default='')
street = models.CharField(max_length=60, null=True, default='')
unit = models.CharField(max_length=50, null=True, default='')
city = models.CharField(max_length=50, null=True, default='')
district = models.CharField(max_length=10, null=True, default='')
region = models.CharField(max_length=5, null=True, default='')
postcode = models.CharField(max_length=5, null=True, default='')
addr_id = models.CharField(max_length=20, unique=True)
addr_hash = models.CharField(max_length=20, unique=True)
objects = CopyManager()
class Meta:
indexes = [
GinIndex(fields=['number', 'street', 'unit', 'city', 'region', 'postcode'], name='search_idx')
]
I created a little test to check performance based on number of words in the search:
search_vector = SearchVector('number', 'street', 'unit', 'city', 'region', 'postcode')
searchtext1 = "north"
searchtext2 = "north bondi"
searchtext3 = "north bondi blair"
searchtext4 = "north bondi blair street 2026"
print('Test1: 1 word')
start_time = time.time()
result = AddressesAustralia.objects.annotate(search=search_vector).filter(search=searchtext1)[:10]
#print(len(result))
time_exec = str(timedelta(seconds=time.time() - start_time))
print(time_exec)
print(' ')
#print(AddressesAustralia.objects.annotate(search=search_vector).explain(verbose=True))
print('Test2: 2 words')
start_time = time.time()
result = AddressesAustralia.objects.annotate(search=search_vector).filter(search=searchtext2)[:10]
#print(len(result))
time_exec = str(timedelta(seconds=time.time() - start_time))
print(time_exec)
print(' ')
print('Test3: 3 words')
start_time = time.time()
result = AddressesAustralia.objects.annotate(search=search_vector).filter(search=searchtext3)[:10]
#print(len(result))
time_exec = str(timedelta(seconds=time.time() - start_time))
print(time_exec)
print(' ')
print('Test4: 5 words')
start_time = time.time()
result = AddressesAustralia.objects.annotate(search=search_vector).filter(search=searchtext4)[:10]
#print(len(result))
time_exec = str(timedelta(seconds=time.time() - start_time))
print(time_exec)
print(' ')
I get the following results, which seems quite correct:
Test1: 1 word
0:00:00.001841
Test2: 2 words
0:00:00.001422
Test3: 3 words
0:00:00.001574
Test4: 5 words
0:00:00.001360
However if I uncomment the print(len(results)) lines, I get the following results:
Test1: 1 word
10
0:00:00.046392
Test2: 2 words
10
0:00:06.544732
Test3: 3 words
10
0:01:12.367157
Test4: 5 words
10
0:01:17.786596
This is obviously not suitable for an autocomplete feature.
Could someone be able to explain why the execution takes longer when performing an operation on the queryset result? It seems the Database retrieval is always fast but then going through the results takes time, which does not make sense to me as because I'm limiting the results to 10, the queryset returned is always the same size.
Also, although I have created a GIN index, this index does not seem to be used. It seems it has been created correctly:
=# \d public_data_au_addresses
Table
"public.public_data_au_addresses"
Column | Type | Collation | Nullable |
Default
-------------+--------------------------+-----------+----------+------
---------------------------------------------------------
id | integer | | not null |
nextval('public_data_au_addresses_id_seq'::regclass)
date_update | timestamp with time zone | | |
longitude | numeric(9,6) | | |
latitude | numeric(9,6) | | |
number | character varying(16) | | |
street | character varying(60) | | |
unit | character varying(50) | | |
city | character varying(50) | | |
district | character varying(10) | | |
region | character varying(5) | | |
postcode | character varying(5) | | |
addr_id | character varying(20) | | not null |
addr_hash | character varying(20) | | not null |
Indexes:
"public_data_au_addresses_pkey" PRIMARY KEY, btree (id)
"public_data_au_addresses_addr_hash_key" UNIQUE CONSTRAINT, btree (addr_hash)
"public_data_au_addresses_addr_id_key" UNIQUE CONSTRAINT, btree (addr_id)
"public_data_au_addresses_addr_hash_e8c67a89_like" btree (addr_hash varchar_pattern_ops)
"public_data_au_addresses_addr_id_9ee00c76_like" btree (addr_id varchar_pattern_ops)
"search_idx" gin (number, street, unit, city, region, postcode)
When I run the explain() method on my query I get that:
Test1: 1 word
Limit (cost=0.00..1110.60 rows=10 width=140)
-> Seq Scan on public_data_au_addresses (cost=0.00..8081472.41 rows=72767 width=140)
Filter: (to_tsvector((((((((((((COALESCE(number, ''::character varying))::text || ' '::text) || (COALESCE(street, ''::character varying))::text) || ' '::text) || (COALESCE(unit, ''::character varying))::text) || ' '::text) || (COALESCE(city, ''::character varying))::text) || ' '::text) || (COALESCE(region, ''::character varying))::text) || ' '::text) || (COALESCE(postcode, ''::character varying))::text)) @@ plainto_tsquery('north'::text))
So it still shows a Sequential scan instead of using an Index scan. Does anyone know how to fix or debug that?
Would a GIN index still be efficient with so many fields to search on anyway?
And lastly, does anyone know how I can improve the code to improve the performance further?
Thank you! Regards
Update
I have tried to create a search vector as suggested by Paolo below, but it seems the search is still sequential and not leveraging the GIN index.
class AddressesQuerySet(CopyQuerySet):
def update_search_vector(self):
return self.update(search_vector=SearchVector('number', 'street', 'unit', 'city', 'region', 'postcode', config='english'))
class AddressesAustralia(models.Model):
date_update = models.DateTimeField(auto_now=True, null=True)
longitude = models.DecimalField(max_digits=9, decimal_places=6 , null=True)
latitude = models.DecimalField(max_digits=9, decimal_places=6 , null=True)
number = models.CharField(max_length=16, null=True, default='')
street = models.CharField(max_length=60, null=True, default='')
unit = models.CharField(max_length=50, null=True, default='')
city = models.CharField(max_length=50, null=True, default='')
district = models.CharField(max_length=10, null=True, default='')
region = models.CharField(max_length=5, null=True, default='')
postcode = models.CharField(max_length=5, null=True, default='')
addr_id = models.CharField(max_length=20, unique=True)
addr_hash = models.CharField(max_length=20, unique=True)
search_vector = SearchVectorField(null=True, editable=False)
objects = AddressesQuerySet.as_manager()
class Meta:
indexes = [
GinIndex(fields=['search_vector'], name='search_vector_idx')
]
I have then updated the search_vector field using the update command:
AddressesAustralia.objects.update_search_vector()
Then I ran a query to test with the same search vector:
class Command(BaseCommand):
def handle(self, *args, **options):
search_vector = SearchVector('number', 'street', 'unit', 'city', 'region', 'postcode', config='english')
searchtext1 = "north"
print('Test1: 1 word')
start_time = time.time()
result = AddressesAustralia.objects.filter(search_vector=searchtext1)[:10].explain(verbose=True)
print(len(result))
print(result)
time_exec = str(timedelta(seconds=time.time() - start_time))
print(time_exec)
And I get the following results, still showing a sequential search:
Test1: 1 word
532
Limit (cost=0.00..120.89 rows=10 width=235)
Output: id, date_update, longitude, latitude, number, street, unit, city, district, region, postcode, addr_id, addr_hash, search_vector
-> Seq Scan on public.public_data_au_addressesaustralia (cost=0.00..5061078.91 rows=418651 width=235)
Output: id, date_update, longitude, latitude, number, street, unit, city, district, region, postcode, addr_id, addr_hash, search_vector
Filter: (public_data_au_addressesaustralia.search_vector @@ plainto_tsquery('north'::text))
0:00:00.075262
I have also tried:
With and without the config="english" in the search vector (both in the update and in the query)
To delete the GIN index, then recreate it and then re-run the update_search_Vector
But still the same results. Any idea on what I am doing wrong or how I can troubleshoot further?
AddressesAustralia
model? You show the definition ofAddresses
model. – Spermatic