I have a large number of Scottish and Welsh accented place names (combining grave, acute, circumflex and diareses) which I need to update to their unicode normalized form, eg, the shorter form 00E1 (\xe1) for á
instead of 0061 + 0301 (\x61\x301)
I have found a solution from an old Postgres nabble mail list from 2009, using pl/python,
create or replace function unicode_normalize(str text) returns text as $$
import unicodedata
return unicodedata.normalize('NFC', str.decode('UTF-8'))
$$ LANGUAGE PLPYTHONU;
This works, as expected, but made me wonder if there was any way of doing it directly with built-in Postgres functions. I tried various conversions using convert_to, all in vain.
EDIT: As Craig has pointed out, and one of the things I tried:
SELECT convert_to(E'\u00E1', 'iso-8859-1');
returns \xe1
, whereas
SELECT convert_to(E'\u0061\u0301', 'iso-8859-1');
fails with the ERROR: character 0xcc81 of encoding "UTF8" has no equivalent in "LATIN1"
SELECT E'\u00E1', E'\u0061\u0301';
– Zaratiteconvert_to
but it doesn't. That seems like a bug, frankly, as it meansconvert_to(E'\u0061\u0301', 'iso-8859-1');
fails butconvert_to(E'\u00E1', 'iso-8859-1')
succeeds. – ZaratiteCOPY
command, for a Microsoft CSV file, useuconv -x any-nfc bugText.csv > goodText.csv
at Ubuntu terminal. – Adlai