postgresql 11 - create database with ICU locale
Asked Answered
R

2

9

I want to use an ICU system-insensitive sorting collation, to avoid sorting differences between postgres11-on-mac vs postgres11-on-Ubuntu. My first test was to dump out my existing Collate=en_US.UTF-8 and pg_restore them into a db created with Collate=en-US-x-icu

Create Database doc has this to say:

To create a database music with a different locale:

CREATE DATABASE music LC_COLLATE 'sv_SE.utf8' LC_CTYPE 'sv_SE.utf8' TEMPLATE template0;

I seem to have the required icu locales already:

select collname, collprovider from pg_collation where collname like 'en_US%';

        collname        | collprovider
------------------------+--------------
 en_US.UTF-8            | c
 en_US                  | c
 en_US.ISO8859-15       | c
 en_US.ISO8859-1        | c
 en_US                  | c
 en_US                  | c
 en-US-x-icu            | i 👈
 en-US-u-va-posix-x-icu | i 👈
(8 rows)

But no luck when creating a database with either icu locales.

CREATE DATABASE test LC_COLLATE = 'en-US-x-icu' TEMPLATE template0;
ksysdb=# CREATE DATABASE test LC_COLLATE = 'en-US-x-icu' TEMPLATE template0;
ERROR:  invalid locale name: "en-US-x-icu"

I can use LC_COLLATE with other locales:

The LC_COLLATE clause does seem to come with some strings attached, such as watching your encoding and specifying an appropriate template. But it seems to give error hints w non-ICU locales.

This works, for example: CREATE DATABASE test LC_COLLATE = 'en_US' TEMPLATE template0;

and this one gives a helpful user message:

ksysdb=# CREATE DATABASE test LC_COLLATE = 'en_US.ISO8859-15' TEMPLATE template0;
ERROR:  encoding "UTF8" does not match locale "en_US.ISO8859-15"
DETAIL:  The chosen LC_COLLATE setting requires encoding "LATIN9".

Note: a related question, PostgreSQL 10 on Linux - LC_COLLATE locale en_US.utf-8 not valid, doesn't seem all that relevant, as the answer talks about generating an OS-level locale to fix the issue. While the ICU locales, as far as I understand, are expressly intended to be separated from the underlying OS.

Roup answered 5/4, 2020 at 20:8 Comment(8)
Currently you can't use ICU locales as the default localeLethbridge
so.... specify it via ALTER COLUMN only? or at CREATE TABLE time? And is this also true on postgresql12, even though the question is for 11 right now?Roup
Yes, that's correctLethbridge
if you write it up and link to some postgresql doc, so that people can see when/if the restriction is lifted, I'll accept.Roup
The only documentation is the source: check_locale in src/backend/utils/adt/pg_locale.c uses setlocale to check if the locale name is valid. That is a C library function, consequently it will only work with C library collations.Amand
This seems to be a feature of the commercial 'EDB Postgres Advanced Server'. It seems you can set default ICU collations per database like CREATE DATABASE collation_db TEMPLATE template0 ENCODING 'UTF8' ICU_SHORT_FORM = 'AN_CU_EX_NX_LROOT'; and even per cluster via initdb --icu-short-form. Nice examples in the old docs.Stepaniestepbrother
Over at stackoverflow.com/questions/59168841 a nice conversion script is given to convert all text columns to icu collations.Stepaniestepbrother
Is this still true for version 15?Sonny
A
5

This is possible from PostgreSQL v15 on:

CREATE DATABASE test
   LOCALE_PROVIDER icu
   ICU_LOCALE "en-US"
   LOCALE "en_US.utf8"
   TEMPLATE template0;

You still need to provide the libc locale, but the ICU library is used for collations.

Amand answered 15/5, 2023 at 12:21 Comment(7)
This doesn't work in PG16. I guess you should skip LOCALE completely. The docs "locale [...] Can be overridden by setting lc_collate, lc_ctype, or icu_locale individually." postgresql.org/docs/16/sql-createdatabase.htmlStrabismus
@Strabismus Oh yes, that works in PostgreSQL v16. But perhaps your PostgreSQL is built without ICU support, or the C library locales have a different name on your operating system.Amand
I have a fresh install of PG 16.2 on Ubuntu and it doesn't work. It does work when I remove LOCALE "en_US.utf8".Strabismus
@Strabismus As I said, that C library locale does not exist on your system. Run locale -a in your shell.Amand
dpkg says otherwise (dpkg -l | grep libicu). I have libicu66:amd64 "International Components for Unicode". If I use bogus ICU_LOCALE it fails, but this works fine: CREATE DATABASE "test_mlnet1" WITH OWNER = molnet LOCALE_PROVIDER icu ICU_LOCALE "pl-PL" TABLESPACE = pg_default TEMPLATE template0 ;Strabismus
@Strabismus Then look at \l and see what C library locale you got.Amand
I did. You might want to check my answer. ICU should not depend on the system. That is the whole point AFAIK. And so far it seem to work as advertised :)Strabismus
S
0

PG15 and later

From PG16 (at least in version 16.2) you might want to skip "LOCALE" all together.

This will work fine and should be enough:

CREATE DATABASE temp_test
LOCALE_PROVIDER icu ICU_LOCALE "pl-PL" TEMPLATE template0;

Postgres 16.2 on Ubuntu

This is how you can test that the ICU_LOCALE actually works and override collation:

DROP TABLE IF EXISTS table1;
CREATE TABLE table1 (
    id SERIAL PRIMARY KEY,
    text_column VARCHAR(255) NOT NULL
);
INSERT INTO table1 (text_column) VALUES ('Zażółć');
INSERT INTO table1 (text_column) VALUES ('Abecki');
INSERT INTO table1 (text_column) VALUES ('Ącki');
INSERT INTO table1 (text_column) VALUES ('definicja');
INSERT INTO table1 (text_column) VALUES ('Żółć');
INSERT INTO table1 (text_column) VALUES ('<nieokreślony>');


SELECT * FROM table1 ORDER BY text_column;

Expected order (notice how national characters are mixed in):

  • <nieokreślony>
  • Abecki
  • Ącki
  • definicja
  • Zażółć
  • Żółć

Note that - quite confusingly - a database listing might show a different collation. But the test above shows that the collation is in fact provided by ICU (the listing is from PGv16.2, this might change in future). enter image description here

PG10 up to PG15

I should mention that for PG version 10 (and later) this would work, but only works on Windows:

CREATE DATABASE "temp_test"
    ENCODING 'UTF8'
    LC_COLLATE 'pl-PL-x-icu'
    LC_CTYPE 'pl-PL-x-icu'
    TEMPLATE template0
;

On Linux and Mac OS you would have to use the COLLATE keyword on each column separately:

CREATE TABLE table1 (
    id SERIAL PRIMARY KEY,
    text_column VARCHAR(255) COLLATE "pl-PL-x-icu" NOT NULL
);

So updating to Postgres version 15 or later is advised if you want to have a consistent sorting on each operating system.

Strabismus answered 20/3, 2024 at 11:48 Comment(5)
I don't know if that's so great, because you ended up with the default C library locale from the shell, which happened to be British English. So you would get for example an English decimal separator for to_char(12.34, '99D99'), which may not be what you want. It would have been better to explicitly specify "pl_PL.UTF-8" for LOCALE (note the spelling).Amand
The spelling of utf8 is not problem. It still fails: ERROR: invalid LC_COLLATE locale name: "pl_PL.UTF-8"; HINT: If the locale name is specific to ICU, use ICU_LOCALE.Strabismus
Since you are on Debian/Ubuntu: sudo locale-gen pl_PL.utf8. Then restart PostgreSQL and run SELECT pg_import_system_collations('pg_catalog');. Then you'll have the locale.Amand
The app I develop is multilingual and formatting is done in the application, but that does seem to work. Kind of weird that formatting still relays on system components in PG.Strabismus
Open source projects tend not to reinvent the wheel. We write a database server, and the resources are limited.Amand

© 2022 - 2025 — McMap. All rights reserved.