How to convert literal \u sequences into UTF-8? [duplicate]
Asked Answered
H

1

2

I am loading data dump from external source and some strings contain \uXXXX sequences for the UTF8 chars, like this one:

\u017D\u010F\u00E1r nad S\u00E1zavou

I can check the contents by using E'' constant in psql, but cannot find any function/operator to return me proper value.

I'd like to ask, if it's possible to convert this string with unicode escapes into normal UTF8 without using PL/pgSQL functions?

Harbour answered 11/4, 2012 at 18:17 Comment(9)
The \E'' literal format is not a psql client thing -- any SQL code running on the PostgreSQL server can use that syntax. You can put the string into a literal like E'\u017D\u010F\u00E1r nad S\u00E1zavou' to have it properly interpreted; what else would you like to do?Karikaria
I want to be able to do something like: UPDATE table SET proper = somefunc('\u017D\u010F\u00E1r nad S\u00E1zavou') WHERE id=1;. And get the expected UTF8 string.Harbour
If the database is using UTF8 encoding, you should be able to do something like: UPDATE table SET proper = E'\u017D\u010F\u00E1r nad S\u00E1zavou' WHERE id=1; What do you get when you run SHOW server_encoding;? How about SHOW client_encoding;?Karikaria
Encoding is fine, UTF8. Problem is that I cannot use literal constants. I need a table-wide query, like: UPDATE table SET proper = somefunc(badtext); for a 8M row table. Copy-paste is not an option.Harbour
Where is badtext coming from? What encoding is it in? Is this in a disk file where you could just run a encoding conversion utility on it? Depending on context, one of the convert_* functions in this table might possibly help: postgresql.org/docs/9.1/static/…Karikaria
The mentioned \uXXXX sequences are not the result of the bad encoding conversion. They're just there, literally.Harbour
I found and fixed a weakness in my answer.Vikiviking
@CraigRinger, this question was asked 2 years ago, yet you've marked it as a duplicate for a quite recent one. Should be the other way around, don't you think so?Harbour
It's not just age, it's also how searchable it is, how clear the question is, and how good the answer(s) are. In this case I'd prefer to merge them entirely, but SO doesn't provide for that. Note that there's nothing bad about something being marked a duplicate, it's not a statement that there's something wrong with the question, just that there are similar ones out there.Littlefield
V
4

I don't think there is a built in method for that. Easiest way I can think of is the plpgsql function you wanted to avoid:

CREATE OR REPLACE FUNCTION str_eval(text, OUT t text)
  LANGUAGE plpgsql IMMUTABLE STRICT PARALLEL SAFE AS
$func$
BEGIN
   EXECUTE 'SELECT E''' || replace($1, '''', '''''') || ''''
   USING $1
   INTO t;
END
$func$;

The updated version safeguards against SQLi and is faster, too.

Vikiviking answered 12/4, 2012 at 2:57 Comment(4)
Hum... It is not working, no string change at SELECT E'Ceci\u008Dlia'. Expected Cecília. Same problem when using in select str_eval(x) from t(x).Consternation
@PeterKrauss: E'Ceci\u00EDia' works for me.Vikiviking
oops, sorry Erwin (and readers) my fast and non-checked comment. \u008D is not a diacrilic, it is a non-printable ASCII... Was a "clean bad encode" problem, solved here.Consternation
Now a really important comment... About your 2012 parse-string function str_eval() and PostgreSQL project: why in nowadays (2020) there are no internal function for string unescape? The 3k pageviews here + 5k pageviews here shows that many programmers was looking for it (!)... Small continuous quality improvement needs small democracy, but there is no "user community demand" consensus mechanism to vote on small optimizations for pg library.Consternation

© 2022 - 2024 — McMap. All rights reserved.