Remove non-ASCII characters from CSV

J

11

69

I want to remove all the non-ASCII characters from a file in place.

I found one solution with tr, but I guess I need to write back that file after modification.

I need to do it in place with relatively good performance.

Any suggestions?

Jacquelynejacquelynn answered 26/7, 2010 at 18:47 Comment(2)

can you provide a link to the one liner with tr? – Learning 28/6, 2016 at 19:0

The OP probably(?) meant non-printable characters (ctrl-c, unicode number U+0002, is an ASCII character). The question should also specify the locale - without that information one could(should?) assume he meant the "C" locale. A naive answer would be to strip any byte greater than 0x7f - that would preserve characters that are not printable in the C locale, but are perfectly legitimate ASCII characters. I'm downvoting the question because of these reasons which make the it too vague. – Theatrics 7/3, 2018 at 0:58

P

53

# -i (inplace)

sed -i 's/[\d128-\d255]//g' FILENAME

Petta answered 26/7, 2010 at 18:51 Comment(11)

@Sujit: Note that sed -i still creates an intermediate file. It just does it behind the scenes. – Duhamel 26/7, 2010 at 19:57

@Dennis - then what would be the better solution? – Jacquelynejacquelynn 26/7, 2010 at 20:43

@Sujit: There's not a better solution. I just wanted to point out that an intermediate file is still created. Sometimes that matters. I just didn't want you to be under the assumption that it was doing it literally in place. – Duhamel 26/7, 2010 at 21:22

On MacOSX, sed: 1: "FILENAME": unterminated substitute pattern – Decentralization 8/8, 2012 at 15:1

sed -i "s/[\d128-\d255]//g" FILE works for me on centos w/ GNU sed. You may have to use different quoting strategy (double quotes instead of single) depending on your OS/shell. – Admiralty 9/8, 2013 at 16:27

Prints "Invalid collation character" on GNU sed 4.2.1. – Latrishalatry 18/6, 2014 at 15:16

I can avoid the "invalid collation character" error with LANG=C sed -i 's/[\d128-\d255]//g' FILE – Dardan 30/12, 2014 at 21:58

@Dardan then your setup is broken. C locale implies 7-bit characters, and should generate that error with that pattern space. I recommend using a locale that has 8-bit characters, like iso-8859-1. That worked for me. – Poon 26/1, 2015 at 18:39

On cygwin I got the same problem as @JasonC and Patrick's solution didn't fix it for me. I used the Perl solution below. – Carlitacarlo 10/11, 2016 at 21:15

@Carlitacarlo Try using double backslashes with cygwin. related discussion – Transformism 19/1, 2018 at 16:47

I fixed the "Invalid collation character" error by prefixing the sed invocation with LC_ALL=C. – Smetana 2/1, 2021 at 12:16

A

87

A perl oneliner would do: perl -i.bak -pe 's/[^[:ascii:]]//g' <your file>

-i says that the file is going to be edited inplace, and the backup is going to be saved with extension .bak.

Argonaut answered 26/7, 2010 at 18:52 Comment(5)

This one is also usable with stdin as input. – Decentralization 8/8, 2012 at 14:59

The perl solution is faster than the sed solution. Trying to update a 122 GB file using sed took 3 hours, while perl took about less than 2 hours for me. – Empiric 15/9, 2014 at 19:1

I couldn't get the sed solution to work in my environment (Ubuntu gnu sed 4.2.2) but this worked like a charm. – Communize 1/6, 2015 at 12:2

Tried everything and this was the only one that worked for me. Gotta love the power of Perl. Thanks! – Taveda 20/12, 2016 at 19:0

However, when attempting to replace a non ascii character with say '?', '??' comes out as I speculate, perl replaces the two bytes of the Unicode character, thus one '?' per byte. $ echo "é" | perl -pe 's/[^[:ascii:]]/?/g' ?? – Hak 24/4, 2023 at 13:9

P

53

# -i (inplace)

sed -i 's/[\d128-\d255]//g' FILENAME