Finding and removing Non-ASCII characters from an Oracle Varchar2
Asked Answered
R

18

37

We are currently migrating one of our oracle databases to UTF8 and we have found a few records that are near the 4000 byte varchar limit. When we try and migrate these record they fail as they contain characters that become multibyte UF8 characters. What I want to do within PL/SQL is locate these characters to see what they are and then either change them or remove them.

I would like to do :

SELECT REGEXP_REPLACE(COLUMN,'[^[:ascii:]],'')

but Oracle does not implement the [:ascii:] character class.

Is there a simple way doing what I want to do?

Reincarnate answered 10/2, 2010 at 11:42 Comment(3)
You'd probably want to replace ç by c etcetera. Throwing the entire character away is worse than discarding the diacritical mark.Titrate
We 1st need to find out what the characters are before deciding what to do with them.Reincarnate
This is a destructive process and would you want to preserve with ascii replacements of some characters? dashes, single quotes, double quotes, etc? unistr 0013 -, 0018 ', 0019 ', 001C ", 001D "Aholla
G
7

In a single-byte ASCII-compatible encoding (e.g. Latin-1), ASCII characters are simply bytes in the range 0 to 127. So you can use something like [\x80-\xFF] to detect non-ASCII characters.

Galacto answered 10/2, 2010 at 11:44 Comment(3)
I tried using the hex codes as suggested however:- regexp_replace(column,'[\x00-\xFF]','') Removes nothing by the Capital letters -- do I have escape something or is there something else I need to do?Reincarnate
I run into issues using your solution. This answer has been accepted so I believe it is not outright wrong but 1.) oracle does not support the regex syntax to specify code points/characters by their hex representation (ie.'\x80'); instead you have to specify the characters themselves ( however, the regex pattern is a string expression so you may use something like '['||chr(128)||'-'||chr(255)||']'), 2.) trying to replace all chars in '['||chr(32)||'-'||chr(127)||']' results in an ora-12728 error (invalid range in regex). my db charset is al32utf8. any ideas?Deciare
I should add that 1.) the DB is oracle 11.2.0.3.0, 2.) the ranges 32-122, 32-255 do not cause the error but 3.) applied to a string composed of mixed-case alphabet letters and digits show inverse behaviour to what you expect (ie. REGEXP_REPLACE ( 'abc', '['||chr(32)||'-'||chr(128)||']' , '_' ) produces abc, while REGEXP_REPLACE ( 'abc', '[^'||chr(32)||'-'||chr(128)||']' , '_' ) returns ___ ).Deciare
P
34

I think this will do the trick:

SELECT REGEXP_REPLACE(COLUMN, '[^[:print:]]', '')
Prospector answered 19/11, 2015 at 8:8 Comment(7)
This is neat and works well. Just as an adendum you can also use REGEXP_REPLACE(Column,'[^ -~]','') rather than all those Chr() functions and string concatenations mentioned above.Marella
@Ciaran: REGEXP_REPLACE(Column,'[^ -~]','') is excellent, since Oracle doesn't support '[\x80-\xFF]'. Should be in an answer.Backstairs
@Ciarán Works for me thanks. What does -~ actually mean? Is this standard regex or something unique to Oracle?Witticism
If you want to keep new lines, use regexp_replace(column, '[^ -~|[:space:]]', '')Witticism
It specifies an ascii character range, i.e. Space (character 32) - (to) tilda "~" (character 126) asciitable.comMarella
Expression [ -~] may fail, see #50915430Coagulum
This is a good start, but there are plenty of characters in the "print" class that are not found/removed. This definitely got me going down the right track, so thank you for adding this!Ldopa
F
28

If you use the ASCIISTR function to convert the Unicode to literals of the form \nnnn, you can then use REGEXP_REPLACE to strip those literals out, like so...

UPDATE table SET field = REGEXP_REPLACE(ASCIISTR(field), '\\[[:xdigit:]]{4}', '')

...where field and table are your field and table names respectively.

Froebel answered 29/7, 2011 at 22:42 Comment(3)
If the length of the string is close to 4000 then ASCIISTR() will extend the string beyond this limit and the string will be truncated to 4000 characters (losing the excess characters from the end). SQLFIDDLEFatling
This picks up the backslash character as well which is not desirable as it is asciiKassa
To code around this, where replace(asciistr(field),asciistr('\'),'\') <> fieldGeriatrician
C
12

I wouldn't recommend it for production code, but it makes sense and seems to work:

SELECT REGEXP_REPLACE(COLUMN,'[^' || CHR(1) || '-' || CHR(127) || '],'')
Cantlon answered 14/8, 2013 at 14:34 Comment(1)
Note that you should normally start at 32 instead of 1, since that is the first printable ascii character. The rest are control characters, which would be weird inside text columns (even weirder than >127 I'd say). But yeah technically the answer is correct, this would detect non-ascii characters, given the original 7-bit ascii standard.Selfish
E
8

The select may look like the following sample:

select nvalue from table
where length(asciistr(nvalue))!=length(nvalue)  
order by nvalue;
Eichelberger answered 9/9, 2010 at 21:34 Comment(2)
Good idea, but with this you are actually identifying fields having data where the size in bytes is not the same of the number of the symbols represented by them.Shadshadberry
Also incorrectly returns the "\" key as a non ascii character.Voracity
G
7

In a single-byte ASCII-compatible encoding (e.g. Latin-1), ASCII characters are simply bytes in the range 0 to 127. So you can use something like [\x80-\xFF] to detect non-ASCII characters.

Galacto answered 10/2, 2010 at 11:44 Comment(3)
I tried using the hex codes as suggested however:- regexp_replace(column,'[\x00-\xFF]','') Removes nothing by the Capital letters -- do I have escape something or is there something else I need to do?Reincarnate
I run into issues using your solution. This answer has been accepted so I believe it is not outright wrong but 1.) oracle does not support the regex syntax to specify code points/characters by their hex representation (ie.'\x80'); instead you have to specify the characters themselves ( however, the regex pattern is a string expression so you may use something like '['||chr(128)||'-'||chr(255)||']'), 2.) trying to replace all chars in '['||chr(32)||'-'||chr(127)||']' results in an ora-12728 error (invalid range in regex). my db charset is al32utf8. any ideas?Deciare
I should add that 1.) the DB is oracle 11.2.0.3.0, 2.) the ranges 32-122, 32-255 do not cause the error but 3.) applied to a string composed of mixed-case alphabet letters and digits show inverse behaviour to what you expect (ie. REGEXP_REPLACE ( 'abc', '['||chr(32)||'-'||chr(128)||']' , '_' ) produces abc, while REGEXP_REPLACE ( 'abc', '[^'||chr(32)||'-'||chr(128)||']' , '_' ) returns ___ ).Deciare
B
3

There's probably a more direct way using regular expressions. With luck, somebody else will provide it. But here's what I'd do without needing to go to the manuals.

Create a PLSQL function to receive your input string and return a varchar2.

In the PLSQL function, do an asciistr() of your input. The PLSQL is because that may return a string longer than 4000 and you have 32K available for varchar2 in PLSQL.

That function converts the non-ASCII characters to \xxxx notation. So you can use regular expressions to find and remove those. Then return the result.

Britt answered 10/2, 2010 at 14:15 Comment(0)
K
3

The following also works:

select dump(a,1016), a from (
SELECT REGEXP_REPLACE (
          CONVERT (
             '3735844533120%$03  ',
             'US7ASCII',
             'WE8ISO8859P1'),
          '[^!@/\.,;:<>#$%&()_=[:alnum:][:blank:]]') a
  FROM DUAL);
Kike answered 19/1, 2012 at 18:32 Comment(0)
B
3

You can try something like following to search for the column containing non-ascii character :

select * from your_table where your_col <> asciistr(your_col);
Bluster answered 22/11, 2016 at 10:55 Comment(0)
F
3

I had similar requirement (to avoid this ugly ORA-31061: XDB error: special char to escaped char conversion failed. ), but had to keep the line breaks.

I tried this from an excellent comment

'[^ -~|[:space:]]'

but got this ORA-12728: invalid range in regular expression .

but it lead me to my solution:

select t.*, regexp_replace(deta, '[^[:print:]|[:space:]]', '#') from  
    (select '-   <- strangest thing here, and I want to keep line break after
-' deta from dual ) t

displays (in my TOAD tool) as

in my toad tool

  • replace all that ^ => is not in the sets (of printing [:print:] or space |[:space:] chars)
Florous answered 16/4, 2020 at 7:13 Comment(0)
G
2

I had a similar issue and blogged about it here. I started with the regular expression for alpha numerics, then added in the few basic punctuation characters I liked:

select dump(a,1016), a, b
from
 (select regexp_replace(COLUMN,'[[:alnum:]/''%()> -.:=;[]','') a,
         COLUMN b
  from TABLE)
where a is not null
order by a;

I used dump with the 1016 variant to give out the hex characters I wanted to replace which I could then user in a utl_raw.cast_to_varchar2.

Gametophyte answered 10/2, 2010 at 22:29 Comment(0)
B
2

I found the answer here:

http://www.squaredba.com/remove-non-ascii-characters-from-a-column-255.html

CREATE OR REPLACE FUNCTION O1DW.RECTIFY_NON_ASCII(INPUT_STR IN VARCHAR2)
RETURN VARCHAR2
IS
str VARCHAR2(2000);
act number :=0;
cnt number :=0;
askey number :=0;
OUTPUT_STR VARCHAR2(2000);
begin
str:=’^'||TO_CHAR(INPUT_STR)||’^';
cnt:=length(str);
for i in 1 .. cnt loop
askey :=0;
select ascii(substr(str,i,1)) into askey
from dual;
if askey < 32 or askey >=127 then
str :=’^'||REPLACE(str, CHR(askey),”);
end if;
end loop;
OUTPUT_STR := trim(ltrim(rtrim(trim(str),’^'),’^'));
RETURN (OUTPUT_STR);
end;
/

Then run this to update your data

update o1dw.rate_ipselect_p_20110505
set NCANI = RECTIFY_NON_ASCII(NCANI);
Berri answered 9/8, 2011 at 20:47 Comment(0)
D
2

Try the following:

-- To detect
select 1 from dual
where regexp_like(trim('xx test text æ¸¬è© ¦ “xmx” number²'),'['||chr(128)||'-'||chr(255)||']','in')

-- To strip out
select regexp_replace(trim('xx test text æ¸¬è© ¦ “xmxmx” number²'),'['||chr(128)||'-'||chr(255)||']','',1,0,'in')
from dual
Darice answered 24/4, 2012 at 2:27 Comment(0)
A
1

Thanks, this worked for my purposes. BTW there is a missing single-quote in the example, above.

REGEXP_REPLACE (COLUMN,'[^' || CHR (32) || '-' || CHR (127) || ']', ' '))

I used it in a word-wrap function. Occasionally there was an embedded NewLine/ NL / CHR(10) / 0A in the incoming text that was messing things up.

Aculeus answered 29/8, 2014 at 20:25 Comment(0)
U
0

Answer given by Francisco Hayoz is the best. Don't use pl/sql functions if sql can do it for you.

Here is the simple test in Oracle 11.2.03

select s
     , regexp_replace(s,'[^'||chr(1)||'-'||chr(127)||']','') "rep ^1-127"
     , dump(regexp_replace(s,'['||chr(127)||'-'||chr(225)||']','')) "rep 127-255"
from (
select listagg(c, '') within group (order by c) s
  from (select 127+level l,chr(127+level) c from dual connect by level < 129))

And "rep 127-255" is

Typ=1 Len=30: 226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255

i.e for some reason this version of Oracle does not replace char(226) and above. Using '['||chr(127)||'-'||chr(225)||']' gives the desired result. If you need to replace other characters just add them to the regex above or use nested replace|regexp_replace if the replacement is different then '' (null string).

Udele answered 23/4, 2014 at 0:4 Comment(0)
H
0

Please note that whenever you use

regexp_like(column, '[A-Z]')

Oracle's regexp engine will match certain characters from the Latin-1 range as well: this applies to all characters that look similar to ASCII characters like Ä->A, Ö->O, Ü->U, etc., so that [A-Z] is not what you know from other environments like, say, Perl.

Instead of fiddling with regular expressions try changing for the NVARCHAR2 datatype prior to character set upgrade.

Another approach: instead of cutting away part of the fields' contents you might try the SOUNDEX function, provided your database contains European characters (i.e. Latin-1) characters only. Or you just write a function that translates characters from the Latin-1 range into similar looking ASCII characters, like

  • å => a
  • ä => a
  • ö => o

of course only for text blocks exceeding 4000 bytes when transformed to UTF-8.

Hollishollister answered 31/10, 2015 at 21:51 Comment(0)
L
0

As noted in this comment, and this comment, you can use a range.
Using Oracle 11, the following works very well:

SELECT REGEXP_REPLACE(dummy, '[^ -~|[:space:]]', '?') AS dummy FROM DUAL;

This will replace anything outside that printable range as a question mark.

This will run as-is so you can verify the syntax with your installation.
Replace dummy and dual with your own column/table.

Ldopa answered 6/10, 2021 at 14:12 Comment(0)
C
-2

Do this, it will work.

trim(replace(ntwk_slctor_key_txt, chr(0), ''))
Crossfade answered 4/8, 2015 at 22:17 Comment(1)
Welcome to Stack Overflow! This answer turned up in the low quality review queue, presumably because you didn't explain the code. If you do explain it (in your answer), you are far more likely to get more upvotes—and the questioner is more likely to learn something!Bull
P
-3

I'm a bit late in answering this question, but had the same problem recently (people cut and paste all sorts of stuff into a string and we don't always know what it is). The following is a simple character whitelist approach:

SELECT est.clients_ref
  ,TRANSLATE (
              est.clients_ref
             ,   'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
              || REPLACE (
                          TRANSLATE (
                                     est.clients_ref
                                    ,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
                                    ,'~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'
                                    )
                         ,'~'
                         )
             ,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
             )
      clean_ref

FROM edms_staging_table est

Papule answered 5/11, 2015 at 23:20 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.