PostgreSQL 9.1 using collate in select statements
Asked Answered
I

2

7

I have a postgresql 9.1 database table, "en_US.UTF-8":

CREATE TABLE branch_language
(
    id serial NOT NULL,
    name_language character varying(128) NOT NULL,
    branch_id integer NOT NULL,
    language_id integer NOT NULL,
    ....
)

The attribute name_language contains names in various languages. The language is specified by the foreign key language_id.

I have created a few indexes:

/* us english */
CREATE INDEX idx_branch_language_2
    ON branch_language
    USING btree
    (name_language COLLATE pg_catalog."en_US" );

/* catalan */
CREATE INDEX idx_branch_language_5
    ON branch_language
    USING btree
    (name_language COLLATE pg_catalog."ca_ES" );

/* portuguese */
CREATE INDEX idx_branch_language_6
    ON branch_language
    USING btree
    (name_language COLLATE pg_catalog."pt_PT" );

Now when I do a select I am not getting the results I am expecting.

select name_language from branch_language
where language_id=42 -- id of catalan language
order by name_language collate "ca_ES" -- use ca_ES collation

This generates a list of names but not in the order I expected:

Aficions i Joguines
Agència de viatges
Aliments i Subministraments
Aparells elèctrics i il luminació
Art i Antiguitats
Articles de la llar
Bars i Restaurants
...
Tabac
Àudio, Vídeo, CD i DVD
Òptica

As I expected the last two entries to appear in different positions in the list.

Creating the indexes works. I don't think they are really necessary unless you want to optimize for performance.

The select statement however seems to ignore the part: collate "ca_ES".

This problem also exists when I select other collations. I have tried "es_ES" and "pt_PT" but the results are similar.

Invercargill answered 17/10, 2011 at 14:25 Comment(3)
+1 it has everything a good question needs.Mantooth
Unfortunately not enough to invoke a good answer...Invercargill
@Invercargill Maybe try the pgsql-general mailing list (archives.postgresql.org/pgsql-general)? Reproduce your question in your post, but also link back to here if you post on the mailing list.Chico
M
3

I can't find a flaw in your design. I have tried.

Locales and collation

I revisited this question. Consider this test case on sqlfiddle. It seems to work just fine. I even created the locale ca_ES.utf8 in my local test server (PostgreSQL 9.1.6 on Debian Squeeze) and added the locale to my DB cluster:

CREATE COLLATION "ca_ES" (LOCALE = 'ca_ES.utf8');

I get the same results as can be seen in the sqlfiddle above.

Note that collation names are identifiers and need to be double-quoted to preserve CamelCase spelling like "ca_ES". Maybe there has been some confusion with other locales in your system? Check your available collations:

SELECT * FROM pg_collation;

Generally, collation rules are derived from system locales. Read about the details in the manual here. If you still get incorrect results, I would try to update your system and regenerate the locale for "ca_ES". In Debian (and related Linux distributions) this can be done with:

dpkg-reconfigure locales

NFC

I have one other idea: unnormalized UNICODE strings.

Could it be that your 'Àudio' is in fact '̀ ' || 'Audio'? That would be this character:

SELECT U&'\0300A';
SELECT ascii(U&'\0300A');
SELECT chr(768);

Read more about the acute accent in wikipedia.
You have to SET standard_conforming_strings = TRUE to use Unicode strings like in the first line.

Note that some browsers cannot display unnormalized Unicode characters correctly and many fonts have no proper glyph for the special characters, so you may see nothing here or gibberish. But UNICODE allows for that nonsense. Test to see what you got:

SELECT octet_length('̀A')  -- returns 3 (!)
SELECT octet_length('À')  -- returns 2

If that's what your database has contracted, you need to get rid of it or suffer the consequences. The cure is to normalize your strings to NFC. Perl has superior UNICODE-foo skills, you can make use of their libraries in a plperlu function to do it in PostgreSQL. I have done that to save me from madness.

Read installation instructions in this excellent article about UNICODE normalization in PostgreSQL by David Wheeler.
Read all the gory details about Unicode Normalization Forms at unicode.org.

Mantooth answered 20/10, 2011 at 1:24 Comment(6)
I checked the first character from "Àudio, Vídeo, CD i DVD", resulting in: select octet_length('À') returning 2. The same for "Òptica", select octet_length('Ò') resulting in 2.Invercargill
In the file postgresql.conf under VERSION/PLATFORM COMPATIBILITY # - Previous PostgreSQL Versions - it reads #standard_conforming_strings = onInvercargill
I thought and hoped it would be similar to download.oracle.com/docs/cd/E17952_01/refman-5.0-en/…Invercargill
@Henri: Sorry, my idea was a bit of a shot in the dark. No luck. I am out of ideas. For all I know, your setup should work.Mantooth
Your remarks gave me some ideas. No solution to my problem however but a slight improvement to my application anyhow. This database is used in a Django application. I don't have plperl installed in postgresql so I can't use David Wheelers setup but I have changed the code in the models (ORM) so all unicode strings are stored NFC normalized. At least some improvement...:-)Invercargill
@Henri: Glad it didn't all go to waste. ;)Mantooth
P
1

The problem is with accentuation. You have to use an AI (accent insensitve) collation. Check to see how to do it in Postgre. In some dbms, it is something like ca_ES_AI.

Planimeter answered 17/12, 2020 at 12:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.