unaccent() preventing index usage in Postgres
Asked Answered
L

1

9

I want to retrieve a way with a given name from an OpenStreetMap database imported into PostgreSQL 9.3.5, the OS is Win7 64-bit. In order to be a bit failure tolerant, I use the unaccent extension of Postgres.

My query looks as follows:

SELECT * FROM germany.ways
WHERE lower(tags->'name') like lower(unaccent('unaccent','Weststrasse'))

Query plan:

Seq Scan on ways  (cost=0.00..2958579.31 rows=122 width=465)
  Filter: (lower((tags -> 'name'::text)) ~~ lower(unaccent('unaccent'::regdictionary, 'Weststrasse'::text)))

The strange thing is that this query uses a sequential scan on ways, although an index is present on lower(tags->'name'):

CREATE INDEX ways_tags_name ON germany.ways (lower(tags -> 'name'));

Postgres uses the index as soon as I remove unaccent from the query:

SELECT * FROM germany.ways
WHERE lower(tags->'name') like lower('Weststrasse')

Query plan:

Index Scan using ways_tags_name on ways  (cost=0.57..495.43 rows=122 width=465)
  Index Cond: (lower((tags -> 'name'::text)) = 'weststrasse'::text)
  Filter: (lower((tags -> 'name'::text)) ~~ 'weststrasse'::text)

Why is unaccent preventing Postgres from using the index? In my opinion this doesn't make sense because the result of unaccent (diacritics removal, etc.) should already be completely known before the actual query is executed. So Postgres should be able to use the index. How can the seq scan be avoided when using unaccent?

Lipscomb answered 6/3, 2015 at 12:42 Comment(1)
The column tags must be of data type hstore. This should be in the question, best by providing the table definition (what you get from \d tbl in psql).Septillion
S
19

IMMUTABLE variant of unaccent()

To clarify the misinformation in the accepted, now deleted, incorrect answer:
Expression indexes only allow IMMUTABLE functions (for obvious reasons) and unaccent() is only STABLE. The solution you suggested in the the comment is also problematic. Detailed explanation and a proper solution for that:

Depending on the content of tags->name it may be useful to add unaccent() to the expression index, but that's orthogonal to the question why the index wasn't being used:

Actual problem / solution

The operator LIKE in your query is subtly wrong (most likely). You do not want to interpret 'Weststrasse' as search pattern, you want to match the (normalized) string as is. Replace with the = operator, and you will see a (bitmap) index scan with your current index, regardless of the function volatility of unaccent():

SELECT * FROM germany.ways
WHERE lower(tags->'name') = lower(unaccent('unaccent','Weststrasse'))

Why?

The right operand of LIKE is a pattern. Postgres cannot use a plain btree index for pattern matching (exceptions apply). A LIKE with a plain string as pattern (no special characters) can be optimized with an equality check on the btree index. But if there are special characters in the string, this index is out.

If there is an IMMUTABLE function to the right of LIKE, it can be evaluated immediately and the said optimization is still possible. Quoting the manual on Function Volatility Categories:

IMMUTABLE ...
This category allows the optimizer to pre-evaluate the function when a query calls it with constant arguments.

The same is not possible with a lesser function volatility (STABLE or VOLATILE). That's why your "solution" of faking an IMMUTABLE unaccent() seemed to work, but it's really putting lipstick on a pig.

To reiterate:

  • If you want to work with LIKE and patterns, use a trigram GIN index.
  • If you don't want to work with LIKE and patterns, use the equality operator = (and a B-tree index on the expression lower(tags->'name').
Septillion answered 6/3, 2015 at 17:1 Comment(2)
Wow, a quite detailed explanation. Thanks. However, I do use LIKE for a purpose. This should probably be put in another question, but names in OSM often tend to contain funny characters. So I'm replacing them by an underscore _ to allow for a more fuzzy match. I simplified the query a bit before I posted it, but the LIKE operand remained. Any idea how to use an index-supported = operator and to ignore chars like `/()-.' in the query?Lipscomb
@AlfKortig: I can only answer the question you posted. You asked why and I explained it. Many people read here and hopefully learn something. The currently accepted answer is incorrect and misleading. You are right, your new question should be put in a new question. Comments are not the place. The simplification went wrong in the case at hand. Be sure to give the full, correct picture and add the output of \d tbl in psql. You can always link to this one for context. And you can comment here to get my attention.Septillion

© 2022 - 2024 — McMap. All rights reserved.