Convert escaped Unicode character back to actual character in PostgreSQL
Asked Answered
U

1

6

Is there a way how I can convert the following string back to the human-readable value? I have some external data where all non-ascii characters are escaped.

Example strings:

16 StringProvider_111=Telefon\u00ED kontakty
17 StringProvider_116=Odpov\u011Bdn\u00E1 osoba

Required Result:

16 StringProvider_111=Telefoní kontakty
17 StringProvider_116=Odpovědná osoba

SQLFiddle

The database has UTF8 encoding and collation cs_CZ.UTF-8

Utrillo answered 21/11, 2013 at 15:1 Comment(1)
B
9

One old trick is using parser for this purpose:

postgres=# select e'Telefon\u00ED kontakty';
     ?column?      
-------------------
 Telefoní kontakty
(1 row)

CREATE OR REPLACE FUNCTION public.unescape(text)
RETURNS text
LANGUAGE plpgsql
AS $function$
DECLARE result text;
BEGIN
  EXECUTE format('SELECT e''%s''', $1) INTO result;
  RETURN result;
END;
$function$

It works, but it is SQL injection vulnerable - so you should to sanitize input text first!

Here is less readable, but safe version - but you have to manually specify one char as escape symbol:

CREATE OR REPLACE FUNCTION public.unescape(text, text) 
 RETURNS text
 LANGUAGE plpgsql
 AS $function$
 DECLARE result text;
 BEGIN
   EXECUTE format('SELECT U&%s UESCAPE %s', 
                         quote_literal(replace($1, '\u','^')),
                         quote_literal($2)) INTO result;
   RETURN result;
 END;
 $function$

Result

postgres=# select unescape('Odpov\u011Bdn\u00E1 osoba','^');
    unescape     
-----------------
 Odpovědná osoba
(1 row)
Baklava answered 21/11, 2013 at 15:44 Comment(5)
Hi, how to solve non-canonic UTF8? Seems de case of "Cecília" at SELECT E'Ceci\u008Dlia';, where instead í it use diacrilic "i+´"Stylography
@PeterKrauss - are you sure so \u008D is correct? I had to use SELECT e'Ceci\u0301lia';Baklava
Sorry (all readers) the comment before check UTF code points (I showed a case that is not diacrilic)... Thanks @PavelStehule, \u008D is not correct, not the expected result for Cecília. It is a bug encode on my input, seems aleatory code, only sometimes input result in correct diacrilic...I am soving here; and using a kind of spell corrector after clean the strange character.Stylography
Important about your 2013 parse-string function unescape() and PostgreSQL project: why in nowadays (2020) there is no internal function for string unescape? The 5k pageviews here + 3k pageviews here shows that many programmers was looking for it (!)... Small continuous quality improvement needs small democracy, but there is no "user community demand" consensus mechanism to vote on small optimizations for pg library. Perhaps we can use Stakoverflow as a first clue for it.Stylography
@PeterKrauss - PostgreSQL 14 will have an unistr function git.postgresql.org/gitweb/…Baklava

© 2022 - 2024 — McMap. All rights reserved.