Remove non-ASCII characters from CSV
Asked Answered
J

11

69

I want to remove all the non-ASCII characters from a file in place.

I found one solution with tr, but I guess I need to write back that file after modification.

I need to do it in place with relatively good performance.

Any suggestions?

Jacquelynejacquelynn answered 26/7, 2010 at 18:47 Comment(2)
can you provide a link to the one liner with tr?Learning
The OP probably(?) meant non-printable characters (ctrl-c, unicode number U+0002, is an ASCII character). The question should also specify the locale - without that information one could(should?) assume he meant the "C" locale. A naive answer would be to strip any byte greater than 0x7f - that would preserve characters that are not printable in the C locale, but are perfectly legitimate ASCII characters. I'm downvoting the question because of these reasons which make the it too vague.Theatrics
P
53
# -i (inplace)

sed -i 's/[\d128-\d255]//g' FILENAME
Petta answered 26/7, 2010 at 18:51 Comment(11)
@Sujit: Note that sed -i still creates an intermediate file. It just does it behind the scenes.Duhamel
@Dennis - then what would be the better solution?Jacquelynejacquelynn
@Sujit: There's not a better solution. I just wanted to point out that an intermediate file is still created. Sometimes that matters. I just didn't want you to be under the assumption that it was doing it literally in place.Duhamel
On MacOSX, sed: 1: "FILENAME": unterminated substitute patternDecentralization
sed -i "s/[\d128-\d255]//g" FILE works for me on centos w/ GNU sed. You may have to use different quoting strategy (double quotes instead of single) depending on your OS/shell.Admiralty
Prints "Invalid collation character" on GNU sed 4.2.1.Latrishalatry
I can avoid the "invalid collation character" error with LANG=C sed -i 's/[\d128-\d255]//g' FILEDardan
@Dardan then your setup is broken. C locale implies 7-bit characters, and should generate that error with that pattern space. I recommend using a locale that has 8-bit characters, like iso-8859-1. That worked for me.Poon
On cygwin I got the same problem as @JasonC and Patrick's solution didn't fix it for me. I used the Perl solution below.Carlitacarlo
@Carlitacarlo Try using double backslashes with cygwin. related discussionTransformism
I fixed the "Invalid collation character" error by prefixing the sed invocation with LC_ALL=C.Smetana
A
87

A perl oneliner would do: perl -i.bak -pe 's/[^[:ascii:]]//g' <your file>

-i says that the file is going to be edited inplace, and the backup is going to be saved with extension .bak.

Argonaut answered 26/7, 2010 at 18:52 Comment(5)
This one is also usable with stdin as input.Decentralization
The perl solution is faster than the sed solution. Trying to update a 122 GB file using sed took 3 hours, while perl took about less than 2 hours for me.Empiric
I couldn't get the sed solution to work in my environment (Ubuntu gnu sed 4.2.2) but this worked like a charm.Communize
Tried everything and this was the only one that worked for me. Gotta love the power of Perl. Thanks!Taveda
However, when attempting to replace a non ascii character with say '?', '??' comes out as I speculate, perl replaces the two bytes of the Unicode character, thus one '?' per byte. $ echo "é" | perl -pe 's/[^[:ascii:]]/?/g' ??Hak
P
53
# -i (inplace)

sed -i 's/[\d128-\d255]//g' FILENAME
Petta answered 26/7, 2010 at 18:51 Comment(11)
@Sujit: Note that sed -i still creates an intermediate file. It just does it behind the scenes.Duhamel
@Dennis - then what would be the better solution?Jacquelynejacquelynn
@Sujit: There's not a better solution. I just wanted to point out that an intermediate file is still created. Sometimes that matters. I just didn't want you to be under the assumption that it was doing it literally in place.Duhamel
On MacOSX, sed: 1: "FILENAME": unterminated substitute patternDecentralization
sed -i "s/[\d128-\d255]//g" FILE works for me on centos w/ GNU sed. You may have to use different quoting strategy (double quotes instead of single) depending on your OS/shell.Admiralty
Prints "Invalid collation character" on GNU sed 4.2.1.Latrishalatry
I can avoid the "invalid collation character" error with LANG=C sed -i 's/[\d128-\d255]//g' FILEDardan
@Dardan then your setup is broken. C locale implies 7-bit characters, and should generate that error with that pattern space. I recommend using a locale that has 8-bit characters, like iso-8859-1. That worked for me.Poon
On cygwin I got the same problem as @JasonC and Patrick's solution didn't fix it for me. I used the Perl solution below.Carlitacarlo
@Carlitacarlo Try using double backslashes with cygwin. related discussionTransformism
I fixed the "Invalid collation character" error by prefixing the sed invocation with LC_ALL=C.Smetana
P
41

I tried all the solutions and nothing worked. The following, however, does:

tr -cd '\11\12\15\40-\176'

Which I found here:

https://alvinalexander.com/blog/post/linux-unix/how-remove-non-printable-ascii-characters-file-unix

My problem needed it in a series of piped programs, not directly from a file, so modify as needed.

Pantomime answered 21/12, 2017 at 5:39 Comment(0)
B
20

Try tr instead of sed

tr -cd '[:print:]' < file.txt
Blocky answered 28/2, 2018 at 10:24 Comment(2)
The OP specifically mentioned he didn't want to use tr (because he wanted an "in place" conversion which sed -i pretends to be - really writes to a temp file and renames behind the scenes). So this answer doesn't help the OP. BUT... for those who want to use tr, you might want to preserver newlines (the 20180228 version shown here does not). A simple tweak however preserves newlines and carriage returns: tr -cd '[:print:]\n\r' < file.txtTheatrics
tr -cd '[:print:]' <file.txt | sponge file.txtRutter
S
16
sed -i 's/[^[:print:]]//' FILENAME

Also, this acts like dos2unix

Schnurr answered 17/1, 2012 at 18:59 Comment(3)
Does not work. [:print:] is not the same as ASCII. There are many printable non-ASCII characters.Latrishalatry
Also the g modifier is missing. Only the first non-printable character would be removed.Uranic
@JasonC There are also many non-printable ASCII characters. It's likely the original question was poorly formed.Theatrics
R
10
# -i (inplace)

LANG=C sed -i -E "s|[\d128-\d255]||g" /path/to/file(s)

The LANG=C part's role is to avoid a Invalid collation character error.

Based on Ivan's answer and Patrick's comment.

Retroversion answered 2/5, 2018 at 3:41 Comment(0)
S
6

I'm using a very minimal busybox system, in which there is no support for ranges in tr or POSIX character classes, so I have to do it the crappy old-fashioned way. Here's the solution with sed, stripping ALL non-printable non-ASCII characters from the file:

sed -i 's/[^a-zA-Z 0-9`~!@#$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' FILE
Scrubby answered 28/10, 2014 at 16:40 Comment(3)
I don't have your system to test it on, but considering <SPACE> is character 32 (decimal) and tilde "~" is character 126, all of the printable ASCII characters fall between these. If your sed supports [a-z] type ranges, and [^ type "not in" syntax, you should be able to replace that long string of characters with: sed -i 's/[^ -~]//g' FILE (that's /[^<SPACE>-~]/)Trulatrull
@Trulatrull Excellent, this does indeed work! A much better solution, albeit six years down the road :)Scrubby
Sorry for the laggy response ;-)Trulatrull
D
6

This worked for me:

sed -i 's/[^[:print:]]//g'
Dix answered 1/5, 2017 at 20:22 Comment(2)
I'm still getting unicode characters like 007F in my terminal.Pantomime
@KatasticVoyage What is your locale set to (LANG, LC_CTYPE)?Theatrics
H
3

As an alternative to sed or perl you may consider to use ed(1) and POSIX character classes.

Note: ed(1) reads the entire file into memory to edit it in-place, so for really large files you should use sed -i ..., perl -i ...

# see:
# - http://wiki.bash-hackers.org/doku.php?id=howto:edit-ed
# - http://en.wikipedia.org/wiki/Regular_expression#POSIX_character_classes

# test
echo $'aaa \177 bbb \200 \214 ccc \254 ddd\r\n' > testfile
ed -s testfile <<< $',l' 
ed -s testfile <<< $'H\ng/[^[:graph:][:space:][:cntrl:]]/s///g\nwq'
ed -s testfile <<< $',l'
Haydeehayden answered 28/7, 2010 at 13:5 Comment(0)
F
3
awk '{ sub("[^a-zA-Z0-9\"!@#$%^&*|_\[](){}", ""); print }' MYinputfile.txt > pipe_out_to_CONVERTED_FILE.txt
Flap answered 19/8, 2014 at 16:56 Comment(1)
This answer is missing its educational explanation.Tester
M
0

I appreciate the tips I found on this site.

But, on my Windows 10, I had to use double quotes for this to work ...

sed -i "s/[\d128-\d255]//g" FILENAME

Noticed these things ...

  1. For FILENAME the entire path\name needs to be quoted This didn't work -- %TEMP%\"FILENAME" This did -- %TEMP%\FILENAME"

  2. sed leaves behind temp files in the current directory, named sed*

Milline answered 7/3, 2017 at 22:22 Comment(1)
Note: this answer works with gnu sed, but is not portable to other versions of sed (e.g., bsd). Given the side effects mentioned in this answer, it seems like a weird windows compiled version that tries to emulate gnu sed. Or the user is muddying the water with unrelated shell issues.Theatrics

© 2022 - 2024 — McMap. All rights reserved.