Skip/remove non-ascii character with sed
Asked Answered
T

6

14

Chip,Dirkland,DrobæSphere Inc,[email protected],usa

I've been trying to use sed to modify email addresses in a .csv but the line above keeps tripping me up, using commands like:

sed -i 's/[\d128-\d255]//' FILENAME

from this stackoverflow question

doesn't seem to work as I get an 'invalid collation character' error.

Ideally I don't want to change that combined AE character at all, I'd rather sed just skip right over it as I'm not trying to manipulate that text but rather the email addresses. As long as that AE is in there though it causes my sed substitution to fail after one line, delete the character and it processes the whole file fine.

Any ideas?

Trapezohedron answered 20/12, 2011 at 6:34 Comment(0)
A
6

This might work for you (GNU sed):

echo "Chip,Dirkland,DrobæSphere Inc,[email protected],usa" |
sed 's/\o346/a+e/g'
Chip,Dirkland,Droba+eSphere Inc,[email protected],usa

Then do what you have to do and after to revert do:

echo "Chip,Dirkland,Droba+eSphere Inc,[email protected],usa" | 
sed 's/a+e/\o346/g'
Chip,Dirkland,DrobæSphere Inc,[email protected],usa

If you have tricky characters in strings and want to understand how sed sees them use the l0 command (see here). Also very useful for debugging difficult regexps.

echo "Chip,Dirkland,DrobæSphere Inc,[email protected],usa" | 
sed -n 'l0'
Chip,Dirkland,Drob\346Sphere Inc,[email protected],usa$
Aranyaka answered 20/12, 2011 at 10:52 Comment(5)
+1 for the l0. There is another sedsed.py script too, available here. Useful to inspect pattern and hold spaces. Might not help in this case but a useful debugging tool none the less. :)Collective
that sed -n 'l0' command is interesting, what it prints out for company is: Drob\357\277\275Sphere IncTrapezohedron
and I still can't get the examples above to work with it, perhaps the character (which shows as an AE in Windows LibreOffice but nowhere else) is actually a special character saying it can't be represented in unicode? fileformat.info/info/unicode/char/fffd/index.htmTrapezohedron
I never did get any of the answers on this page to work perfectly, but potong's solution got me the closest and the command provided some more exact detail on what was going wrongTrapezohedron
Does not help to remove all non-ASCII characters. Only helps to remove specific one given in example.Urushiol
H
5
sed -i 's/[^[:print:]]//' FILENAME

Also, this acts like dos2unix

Hanna answered 17/1, 2012 at 18:48 Comment(1)
Does not work. [:print:] is not the same as ASCII, e.g. ü is printable but not ASCII.Urushiol
H
3

The issue you are having is the local.

if you want to use a collation range like that you need to change the character type and the collation type.

This fails as \x80 -> \xff are invalid in a utf-8 string. note \u0080 != \x80 for utf8.

anyway to get this to work just do

LC_ALL=C sed -i 's/[\d128-\d255]//' FILENAME

this will override LC_CTYPE and LC_COLLATE for the one command and do what you want.

Hindward answered 11/9, 2020 at 3:50 Comment(0)
E
2

I came here trying this sed command s/[\x00-\x1F]/ /g;, which gave me the same error message.

in this case it simply suffices to remove the \x00 from the collation, yielding s/[\x01-\x1F]/ /g;

Unfortunately it seems like all characters above and including \x7F and some others are disallowed, as can be seen with this short script:

for (( i=0; i<=255; i++ )); do 
    printf "== $i - \x$(echo "ibase=10;obase=16;$i" | bc) =="
    echo '' | sed -E "s/[\d$i-\d$((i+1))]]//g"
done

Note that the problem is only the use of those characters to specify a range. You can still list them all manually or per script. E.g. to come back to your example:

sed -i 's/[\d128-\d255]//' FILENAME

would become

c=; for (( i=128; i<255; i++ )); do c="$c\d$i"; done
sed -i 's/['"$c"']//' FILENAME

which would translate to:

sed -i 's/[\d128\d129\d130\d131\d132\d133\d134\d135\d136\d137\d138\d139\d140\d141\d142\d143\d144\d145\d146\d147\d148\d149\d150\d151\d152\d153\d154\d155\d156\d157\d158\d159\d160\d161\d162\d163\d164\d165\d166\d167\d168\d169\d170\d171\d172\d173\d174\d175\d176\d177\d178\d179\d180\d181\d182\d183\d184\d185\d186\d187\d188\d189\d190\d191\d192\d193\d194\d195\d196\d197\d198\d199\d200\d201\d202\d203\d204\d205\d206\d207\d208\d209\d210\d211\d212\d213\d214\d215\d216\d217\d218\d219\d220\d221\d222\d223\d224\d225\d226\d227\d228\d229\d230\d231\d232\d233\d234\d235\d236\d237\d238\d239\d240\d241\d242\d243\d244\d245\d246\d247\d248\d249\d250\d251\d252\d253\d254\d255]//' FILENAME
Entail answered 2/5, 2016 at 20:43 Comment(2)
"Unfortunately it seems like all characters above and including \x7F and some others are disallowed". Thanks! That explained why I'm getting the Invalid collation character error.Hospitality
Very helpful to identify that \u0000 can't be used as part of a range as well.Socha
H
1

In this case there is a way to just skip non-ASCII chars, not bothering with removing.

LANG=C sed /someemailpattern/

See https://bugzilla.redhat.com/show_bug.cgi?id=440419 and Will sed (and others) corrupt non-ASCII files?.

Hayes answered 3/4, 2012 at 15:0 Comment(0)
C
0

How about using awk for this. We setup the Field Separator to nothing. Then loop over each character. Use an if loop to check if it matches our character class. If it does we print it else we ignore it.

awk -v FS="" '{for(i=1;i<=NF;i++) if($i ~ /[A-Za-z,.@ ]/) printf $i}'

Test:

[jaypal:~/Temp] echo "Chip,Dirkland,DrobæSphere Inc,[email protected],usa" | 
awk -v FS="" '{for(i=1;i<=NF;i++) if($i ~ /[A-Za-z,.@ ]/) printf $i}'
Chip,Dirkland,DrobSphere Inc,[email protected],usa

Update:

awk -v FS="" '{for(i=1;i<=NF;i++) if($i ~ /[A-Za-z,.@ ]/) printf $i; printf "\n"}' < datafile.csv > asciidata.csv

I have added printf "\n" after the loop to keep the lines separate.

Collective answered 20/12, 2011 at 7:47 Comment(2)
Thanks Jaypal, how would this be modified if you wanted to process datafile.csv and output asciidata.csv?Trapezohedron
If you only want e-mail address extracted from your input file then awk can do that in a breeze without any complex regex. Let me know how it works out.Collective

© 2022 - 2024 — McMap. All rights reserved.