Select strange characters on text, not working with LIKE operator
Asked Answered
F

2

0

I try to use this solution and this (for str_eval()) but seems other encode or other UTF8's Normalization Form, perhaps combining diacritical marks...

select distinct  logradouro, str_eval(logradouro)  
from logradouro where logradouro like '%CECi%';
--         logradouro         |          str_eval
------------------------------+----------------------------
-- AV CECi\u008DLIA MEIRELLES | AV CECi\u008DLIA MEIRELLES

PROBLEM: how to select all rows of the table where the problem exists?
That is, where \u occurs?

  • not works with like '%CECi\u%' neither like '%CECi\\u%'
  • works with like E'%CECi\u008D%' but is not generic

For Google, edited after solved question: this is a typical XY problem. In the original question (above) I used ~wrong hypothesis. All the solutions bellow are answers to the following (objective) question:

How to select only printable ASCII text?

"Printable ASCII" is a subset of UTF8, it is "all ASCII that is not a 'control character'".

The "non-printable" control characters are UNICODE hexadecimal 00 to 1F and 7F
(HTML entity � to  +  or decimal 0 to 31 + 127).

PS1: the zero (�) is the "end of text" mark of PostgreSQL text datatype internal representation, so not need to be checked, but no problems to include it in the range.

PS2: about the secondary question "how to convert a word with encode bug to a valid word?",
see an heuristic at my answer.

Freudian answered 16/6, 2020 at 19:54 Comment(2)
Exactly which code points do you want to exclude? Only 1 to 31?Halicarnassus
Hi @LaurenzAlbe, was an ugly input with an encode bug, but 90% of strings are ASCII, so to detect the rows with bug I need only to check non-ASCII, them add some simple filter. The encode problem is that it prints as strange text on Postgresql but when export with COPY it is not visible, and when check by LIKE not selectable... It prints with \u.... but no regular or previsible code point.Freudian
H
1

This condition will exclude any strings that do not entirely consist of printable ASCII characters:

logradouro ~ '[^\u0020-\u007E]'
Halicarnassus answered 18/6, 2020 at 6:55 Comment(6)
Thanks, same count(), so confirmed solution! A secondary question: why psql prints some words as ITA\u0081LIA with \u and others, like DIONÍZIO and , with good visual representation? It is only for "control characters" of a specific range?Freudian
Right. It's the "non-printable" control characters UNICODE 01 - 1F.Halicarnassus
oops, 7F is DEL, see HTML  must be excluded.Freudian
There are more characters, changed the answer.Halicarnassus
Oops, was near perfect, the edit introduced "invalid character range" error. You can use \u0000-\u001F instead of \u0020-\u001F... But the list of strings using not-printable characters is in the condiction logradouro ~ '[\u0000-\u001F\u007F-\u00A0]', and negation to excludes is !~Freudian
Brain fart, sorry. Fixed.Halicarnassus
F
1

Solving with workaround

select distinct  logradouro, str_eval(logradouro)
from logradouro where not(logradouro ~ E'^[a-zA-Z0-9_,;\\- \\.\\(\\)\\/"\'\\*]+$');

There is a systematic bug on encode, no way to convert to correct UTF8... Even converting, the problem is that "CECi\u008DLIA" is not "CECíLIA".

The solution is to use a kind of "heuristic spell corrector" on

regexp_replace(logradouro, E'[^a-zA-Z0-9_,;\\- \\.\\(\\)\\/"\'\\*]+', '!')

Example: the i! of "Ceci!lia" is corrected to í.


NOTICE. Any heuristic solution (or neural network) trained with a specific dataset (specific systematic error source) is a black box solution, valid only for that type of systematic error. There is no generalization for this type of problem.

Freudian answered 16/6, 2020 at 20:6 Comment(0)
H
1

This condition will exclude any strings that do not entirely consist of printable ASCII characters:

logradouro ~ '[^\u0020-\u007E]'
Halicarnassus answered 18/6, 2020 at 6:55 Comment(6)
Thanks, same count(), so confirmed solution! A secondary question: why psql prints some words as ITA\u0081LIA with \u and others, like DIONÍZIO and , with good visual representation? It is only for "control characters" of a specific range?Freudian
Right. It's the "non-printable" control characters UNICODE 01 - 1F.Halicarnassus
oops, 7F is DEL, see HTML  must be excluded.Freudian
There are more characters, changed the answer.Halicarnassus
Oops, was near perfect, the edit introduced "invalid character range" error. You can use \u0000-\u001F instead of \u0020-\u001F... But the list of strings using not-printable characters is in the condiction logradouro ~ '[\u0000-\u001F\u007F-\u00A0]', and negation to excludes is !~Freudian
Brain fart, sorry. Fixed.Halicarnassus

© 2022 - 2024 — McMap. All rights reserved.