PostgreSQL accent + case insensitive search
Asked Answered
P

2

8

I'm looking for a way to support with good performances case insensitive + accent insensitive search. Till now we had no issue on this using MSSql server, on Oracle we had to use OracleText, and now we need it on PostgreSQL.

I've found this post about it, but we need to combine it with case insensitive. We also need to use indexes, otherwise performances could be impacted. Any real experience about the best approach for large databases?

Pak answered 20/2, 2015 at 11:31 Comment(2)
Did you check the full text search functions in PostgreSQL? postgresql.org/docs/current/interactive/textsearch.htmlWould
The upcoming PostgreSQL 10 release will add support for this via ICU Collation Support. rhaas.blogspot.com/2017/04/…Soerabaja
M
14

If you need to "combine with case insensitive", there are a number of options, depending on your exact requirements.

Maybe simplest, make the expression index case insensitive.

Building on the function f_unaccent() laid out in the referenced answer:

CREATE INDEX users_lower_unaccent_name_idx ON users(lower(f_unaccent(name)));

Then:

SELECT *
FROM   users
WHERE  lower(f_unaccent(name)) = lower(f_unaccent('João'));

Or you could build the lower() into the function f_unaccent(), to derive something like f_lower_unaccent().

Or (especially if you need to do fuzzy pattern matching anyways) you can use a trigram index provided by the additional module pg_trgm building on above function, which also supports ILIKE. Details:

I added a note to the referenced answer.

Or you could use the additional module citext (but I rather avoid it):

Marella answered 20/2, 2015 at 18:42 Comment(1)
that's quite a good list of hint to look into. Our requrements are "simple" in theory:if a column contains let's say Firstname = “Aloïse”, we want to be able to find the row using for instance: "Aloise" or "aloise" or even "lois".Pak
P
5

Full-Text-Search Dictionary that Unaccent case-insensitive

FTS is naturally case-insensitive by default,

Converting tokens into lexemes. A lexeme is a string, just like a token, but it has been normalized so that different forms of the same word are made alike. For example, normalization almost always includes folding upper-case letters to lower-case, and often involves removal of suffixes (such as s or es in English).

And you can define your own dictionary using unaccent,

CREATE EXTENSION unaccent;

CREATE TEXT SEARCH CONFIGURATION mydict ( COPY = simple );
ALTER TEXT SEARCH CONFIGURATION mydict
  ALTER MAPPING FOR hword, hword_part, word
  WITH unaccent, simple;

Which you can then index with a functional index,

-- Just some sample data...
CREATE TABLE myTable ( myCol )
  AS VALUES ('fóó bar baz'),('qux quz');

-- No index required, but feel free to create one
CREATE INDEX ON myTable
  USING GIST (to_tsvector('mydict', myCol));

You can now query it very simply

SELECT *
FROM myTable
WHERE to_tsvector('mydict', myCol) @@ 'foo & bar'

    mycol    
-------------
 fóó bar baz
(1 row)

See also

Plastid answered 30/5, 2018 at 2:0 Comment(1)
Very clean solution. Works great.Alpestrine

© 2022 - 2024 — McMap. All rights reserved.