PostgreSQL Full Text Search Spanish character Ñ
Asked Answered
M

1

7

I am facing an issue when doing full text search with PostgreSQL on text that contains de Spanish character 'Ñ'

When I try to tokenize the Spanish word 'AÑO' (year) I get the following results depending on if input is upper or lower case:

SELECT to_tsvector('spanish','AÑO'),to_tsquery('spanish','año')
"to_tsvector"   "to_tsquery"
"'aÑo':1"   "'año'"

As you can see result is not the same and it is case sensitive, so it makes my application full text search queries case sensitive if they contain this character.

Is there any way to overcome this issue? I have been searching on PostgreSQL documentation about full text search, and I don't know how to change this behaviour on installed dictionaries.

Thank you so much. Martí

Manzo answered 8/8, 2017 at 12:7 Comment(4)
what is your client_encoding?.. and server one?.. t=# SELECT to_tsvector('spanish','AÑO'),to_tsvector('spanish','año'); to_tsvector | to_tsvector -------------+------------- 'año':1 | 'año':1 (1 row) Time: 69.873 ms t=# show client_encoding; client_encoding ----------------- UTF8 (1 row) cant reproduce with UTF8Ellita
my client and server encoding is UTF8.Import
my db parametres are: CREATE DATABASE cda_repository WITH OWNER = cda_repository_owner ENCODING = 'UTF8' TABLESPACE = cda_repository_fast_ts LC_COLLATE = 'C' LC_CTYPE = 'C' CONNECTION LIMIT = -1; ALTER DATABASE cda_repository SET default_text_search_config = 'pg_catalog.spanish'; Thank you so muchImport
please read Daniel's brilliant answerEllita
R
5

The ability for to_tsvector to convert Ñ into ñ depends on the locale, and specifically on lc_ctype. Presumably your database is using an LC_CTYPE such as C whose knowledge is limited to US-ASCII.

Example with an LC_CTYPE compatible with Unicode:

test=> show lc_ctype;
  lc_ctype   
-------------
 fr_FR.UTF-8
(1 row)

test=> SELECT to_tsvector('spanish','AÑO'),to_tsquery('spanish','año');
 to_tsvector | to_tsquery 
-------------+------------
 'año':1     | 'año'
(1 row)

Note that the downcasing is what you expect.

Opposite example with C:

creation:

CREATE DATABASE cc lc_ctype 'C' template template0;

Note the lack of downcasing, as in the question:

cc=> show lc_ctype ;
 lc_ctype 
----------
 C
(1 row)

cc=> SELECT to_tsvector('spanish','AÑO'),to_tsquery('spanish','año');
 to_tsvector | to_tsquery 
-------------+------------
 'aÑo':1     | 'año'
(1 row)
Reprehensible answered 8/8, 2017 at 21:25 Comment(1)
Yes you are right Daniel: DB parameters are CREATE DATABASE cda_repository WITH OWNER = cda_repository_owner ENCODING = 'UTF8' TABLESPACE = cda_repository_fast_ts LC_COLLATE = 'C' LC_CTYPE = 'C' CONNECTION LIMIT = -1; ALTER DATABASE cda_repository SET default_text_search_config = 'pg_catalog.spanish'; At the beginning we should store international text and for this we use these values. But finally all text are in Spanish. May be we have to think about recreating Database and change LC_COLlATE and LC_CTYPE Thank you so muchImport

© 2022 - 2024 — McMap. All rights reserved.