How can I find any non ASCII characters in Redshift database
Asked Answered
H

3

5

I've a database table I'd like to return all values where a column contains a non ASCII character anywhere in the string.

Is there an easy way to do this?

I've tried this

select col_name,  regexp_instr(col_name,'[^[:ascii:]]') 
from test_table s
where created > sysdate - 1 
and regexp_instr(col_name,'[^[:ascii:]]') > 0 
limit 5;

but get this error:

error:  Invalid character class name, collating name, or character range.  The error occured while parsing the regular expression: '[^[:>>>HERE>>>ascii:]]'.
  code:      8002
  context:   T_regexp_init
  query:     5059536
  location:  funcs_expr.cpp:130
  process:   query20_31 [pid=7903]

Thanks!

Houchens answered 7/9, 2016 at 16:50 Comment(0)
V
8

I was trying to accomplish something similar recently but @BigDataKid's solution (writing '[^\x00-\x7F]' in the regex expression) won't work.
Usually, a backslash in combination with a literal character can create a regex token with a special meaning, in this case \x represents "the character whose hexadecimal value is" where 00 and 7F are the hex values.
While Postgres supports it (see 9.7.3.3. Regular Expression Escapes), it seems that Redshift's regex engine does not. You can check what exactly Redshift supports here.

The shortest and cleanest workaround I've found for this is:

SELECT column_name,
       CASE
           WHEN regexp_instr(column_name, '[^[:print:][:cntrl:]]') > 0 THEN TRUE
           ELSE FALSE END AS has_non_ascii_char
FROM table_name
WHERE has_non_ascii_char;

Regex explanation:
The outer brackets [ , ] (or a list if you will) is called "bracket expression", the caret (^) means negation, so [^] translates to "match any character not in the list".
In the list we have [:print:] and [:cntrl:] which are called "POSIX character class". [:print:] refers to any printable ASCII character (including space) and [:cntrl:] refers to all ASCII control characters (e.g. tab), together they should cover all ASCII characters.

Ventriculus answered 14/9, 2019 at 19:52 Comment(2)
RiskyMaor, what does [^[:print:][:cntrl:]] do?Opportina
@MarkDickerson I made an edit that added the explanation for the regex expression.Ventriculus
H
0

You can select the range or characters you wish to use by looking at this table: http://unicode-table.com/en/ .

Look at the code below, where I filter based on a specific character range.

dev=# create table tb1 ( id integer, col1 varchar(10) , col2 varchar(10) ) distkey(id);
CREATE TABLE

dev=# insert into tb1 values (1, 'ABC', 'DEF');
INSERT 0 1

dev=# insert into tb1 select 1, chr(638), chr(640);
INSERT 0 1
dev=# insert into tb1 select 4, concat(concat('AB', chr(638)),'CD') , chr(640);                                                                                  
INSERT 0 1

dev=# select * from tb1;
 id | col1  | col2 
----+-------+------
  4 | ABɾCD | ʀ
  1 | ABC   | DEF
  1 | ɾ     | ʀ
(3 rows)

dev=# select id, col1,  regexp_instr(col1,'[^\x00-\x7F]') , col2,  regexp_instr(col2,'[^\x00-\x7F]')  from tb1;
 id | col1  | regexp_instr | col2 | regexp_instr 
----+-------+--------------+------+--------------
  4 | ABɾCD |            3 | ʀ    |            1
  1 | ABC   |            0 | DEF  |            0
  1 | ɾ     |            1 | ʀ    |            1
(3 rows)
Hawkie answered 7/9, 2016 at 21:18 Comment(1)
Thanks, that helped me along the way. I ended up with this (but for some reason it still does not seem to work, but I got what I needed anyway) 'select name2, regexp_instr(name2,'[^(a-z|A-Z|0-9|\w|\s|\t)]'), count () from reporting.data2 s where created > sysdate - 1 and regexp_instr(name2,'[^(a-z|A-Z|0-9|\w|\s|\t)]') > 0 group by artistname, regexp_instr(name2,'[^(a-z|A-Z|0-9|\w|\s|\t)]') order by count () desc'Houchens
G
0
SELECT regexp_replace(column_name, '[^'||chr(32)||'-'||chr(126)||']','') 
FROM table_name 
WHERE regexp_count(column_name, '[^'||chr(32)||'- '||chr(126)||']')> 0 

Perhaps it will be useful to someone and https://www.asciitable.com/

Gate answered 22/12, 2021 at 9:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.