Unicode normalization in Postgres
Asked Answered
U

2

20

I have a large number of Scottish and Welsh accented place names (combining grave, acute, circumflex and diareses) which I need to update to their unicode normalized form, eg, the shorter form 00E1 (\xe1) for á instead of 0061 + 0301 (\x61\x301)

I have found a solution from an old Postgres nabble mail list from 2009, using pl/python,

create or replace function unicode_normalize(str text) returns text as $$
  import unicodedata
  return unicodedata.normalize('NFC', str.decode('UTF-8'))
$$ LANGUAGE PLPYTHONU;

This works, as expected, but made me wonder if there was any way of doing it directly with built-in Postgres functions. I tried various conversions using convert_to, all in vain.

EDIT: As Craig has pointed out, and one of the things I tried:

SELECT convert_to(E'\u00E1', 'iso-8859-1');

returns \xe1, whereas

SELECT convert_to(E'\u0061\u0301', 'iso-8859-1');

fails with the ERROR: character 0xcc81 of encoding "UTF8" has no equivalent in "LATIN1"

Urbano answered 21/7, 2014 at 11:14 Comment(7)
So you want precomposed instead of decomposed form? As in SELECT E'\u00E1', E'\u0061\u0301';Zaratite
Huh. Interesting. I expected Pg to compose and normalize before convert_to but it doesn't. That seems like a bug, frankly, as it means convert_to(E'\u0061\u0301', 'iso-8859-1'); fails but convert_to(E'\u00E1', 'iso-8859-1') succeeds.Zaratite
@Craig, yes, to the first comment. Yes, I tried those transforms and got the "has no equivalent in LATIN1" error. If you think that is a bug, could you post that as an answer, so I can accept it?Urbano
Lets wait and see if anyone else has ideas first.Zaratite
@CraigRinger, good idea. I am glad I wasn't missing anything obvious in the convert_to functions.Urbano
@CraigRinger. A decent amount of time has past on this. Do you still think it is a bug?Urbano
Alternative: feed with good UTF8. Example, before feed with SQL COPY command, for a Microsoft CSV file, use uconv -x any-nfc bugText.csv > goodText.csv at Ubuntu terminal.Adlai
Z
15

I think this is a Pg bug.

In my opinion, PostgreSQL should be normalizing utf-8 into pre-composed form before performing encoding conversions. The result of the conversions shown are wrong.

I'll raise it on pgsql-bugs ... done.

http://www.postgresql.org/message-id/[email protected]

You should be able to follow the thread there.

Edit: pgsql-hackers doesn't appear to agree, so this is unlikely to change in a hurry. I strongly advise you to normalise your UTF-8 at your application input boundaries.

BTW, this can be simplified down to:

regress=> SELECT 'á' = 'á';
 ?column? 
----------
 f
(1 row)

which is plain crazy-talk, but is permitted. The first is precomposed, the second is not. (To see this result you'll have to copy & paste, and it'll only work if your browser or terminal don't normalize utf-8).

If you're using Firefox you might not see the above correctly; Chrome renders it correctly. Here's what you should see if your browser handles decomposed Unicode correctly:

Decomposed vs precomposed unicode á showing false for equality

Zaratite answered 6/8, 2014 at 0:3 Comment(9)
I have spent a lot of time trying to normalize unicode anomalies like this, but never could resolve it properly. NFD, NFC, NFKD, NFKC?. Each of these solved some problems and created new ones. I came to believe that the concept of UNICODE itself is seriously flawed. Too many ways to deceive human eyes and no proper way to normalize.Goss
@ErwinBrandstetter I tend to agree. There really needs to be one way of doing things. Take a look at Tatsuo Ishii's post about Composition Exclusion characters. Horrible. Alas, that ship's sailed.Zaratite
@ErwinBrandstetter. I'm glad I'm not the only one who has had my head bent out of shape by unicode. It does seem overly complex sometimes, though expressing all knows writing schemes in a computer is no trivial feat, obviously :DUrbano
@CraigRinger. I see from Tom Lane's response it is considered to be a feature not a bug. As I said, I have a workaround and have changed this at client level so decomposed form doesn't get inserted in the first place. Still, thanks for filing, anyway.Urbano
I do not agree with Tom on this point. (This is frequently the case). However, it's true that there's no simple solution to the bug.Zaratite
Firefox (version 33.1.1) renders it perfectly. I remember using Ff for rendering Unicode characters since before Chrome was a "matured" browser, so I think it was a somewhat prejudiced comment. Actually, I still use Ff for rendering my Unicode experiments, together with SciTE (a programmer's text editor).Tomb
The typical non-NFC have more characters, so to detect the problem you can use SELECT length('á'), length('á'); that results 2 | 1.Adlai
To feed database with "good UTF8 = NFC characters" (e. g. before feed with SQL COPY command a Microsoft CSV file), use uconv -x any-nfc bugText.csv > goodText.csv at Ubuntu terminal.Adlai
I agree its a bug. Our system, DuckDB, now normalises all strings to NFC on entry.Segmentation
K
4

PostgreSQL 13 has introduced string function normalize ( text [, form ] ) → text, which is available when the server encoding is UTF8.

> select 'päivää' = 'päivää' as without, normalize('päivää') = normalize('päivää') as with_norm ;
 without | with_norm
---------+-----------
 f       | t
(1 row)

Note that I am expecting this to miss any indices, and therefore using this blindly in a hot production query is prone to be a recipe for disaster.

Great news for us who have naively stored NFD filenames from Mac users in our databases.

Killingsworth answered 6/4, 2022 at 13:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.