I want to remove all the non-ASCII characters from a file in place.
I found one solution with tr, but I guess I need to write back that file after modification.
I need to do it in place with relatively good performance.
Any suggestions?
I want to remove all the non-ASCII characters from a file in place.
I found one solution with tr, but I guess I need to write back that file after modification.
I need to do it in place with relatively good performance.
Any suggestions?
# -i (inplace)
sed -i 's/[\d128-\d255]//g' FILENAME
sed -i
still creates an intermediate file. It just does it behind the scenes. –
Duhamel sed: 1: "FILENAME": unterminated substitute pattern
–
Decentralization sed -i "s/[\d128-\d255]//g" FILE
works for me on centos w/ GNU sed. You may have to use different quoting strategy (double quotes instead of single) depending on your OS/shell. –
Admiralty LANG=C sed -i 's/[\d128-\d255]//g' FILE
–
Dardan LC_ALL=C
. –
Smetana A perl oneliner would do: perl -i.bak -pe 's/[^[:ascii:]]//g' <your file>
-i
says that the file is going to be edited inplace, and the backup is going to be saved with extension .bak
.
stdin
as input. –
Decentralization sed
solution to work in my environment (Ubuntu gnu sed 4.2.2) but this worked like a charm. –
Communize # -i (inplace)
sed -i 's/[\d128-\d255]//g' FILENAME
sed -i
still creates an intermediate file. It just does it behind the scenes. –
Duhamel sed: 1: "FILENAME": unterminated substitute pattern
–
Decentralization sed -i "s/[\d128-\d255]//g" FILE
works for me on centos w/ GNU sed. You may have to use different quoting strategy (double quotes instead of single) depending on your OS/shell. –
Admiralty LANG=C sed -i 's/[\d128-\d255]//g' FILE
–
Dardan LC_ALL=C
. –
Smetana I tried all the solutions and nothing worked. The following, however, does:
tr -cd '\11\12\15\40-\176'
Which I found here:
https://alvinalexander.com/blog/post/linux-unix/how-remove-non-printable-ascii-characters-file-unix
My problem needed it in a series of piped programs, not directly from a file, so modify as needed.
Try tr
instead of sed
tr -cd '[:print:]' < file.txt
tr -cd '[:print:]\n\r' < file.txt
–
Theatrics tr -cd '[:print:]' <file.txt | sponge file.txt
–
Rutter sed -i 's/[^[:print:]]//' FILENAME
Also, this acts like dos2unix
# -i (inplace)
LANG=C sed -i -E "s|[\d128-\d255]||g" /path/to/file(s)
The LANG=C
part's role is to avoid a Invalid collation character
error.
Based on Ivan's answer and Patrick's comment.
I'm using a very minimal busybox system, in which there is no support for ranges in tr
or POSIX character classes, so I have to do it the crappy old-fashioned way. Here's the solution with sed
, stripping ALL non-printable non-ASCII characters from the file:
sed -i 's/[^a-zA-Z 0-9`~!@#$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' FILE
sed -i 's/[^ -~]//g' FILE
(that's /[^<SPACE>-~]/) –
Trulatrull This worked for me:
sed -i 's/[^[:print:]]//g'
As an alternative to sed or perl you may consider to use ed(1) and POSIX character classes.
Note: ed(1) reads the entire file into memory to edit it in-place, so for really large files you should use sed -i ..., perl -i ...
# see:
# - http://wiki.bash-hackers.org/doku.php?id=howto:edit-ed
# - http://en.wikipedia.org/wiki/Regular_expression#POSIX_character_classes
# test
echo $'aaa \177 bbb \200 \214 ccc \254 ddd\r\n' > testfile
ed -s testfile <<< $',l'
ed -s testfile <<< $'H\ng/[^[:graph:][:space:][:cntrl:]]/s///g\nwq'
ed -s testfile <<< $',l'
awk '{ sub("[^a-zA-Z0-9\"!@#$%^&*|_\[](){}", ""); print }' MYinputfile.txt > pipe_out_to_CONVERTED_FILE.txt
I appreciate the tips I found on this site.
But, on my Windows 10, I had to use double quotes for this to work ...
sed -i "s/[\d128-\d255]//g" FILENAME
Noticed these things ...
For FILENAME the entire path\name needs to be quoted
This didn't work -- %TEMP%\"FILENAME"
This did -- %TEMP%\FILENAME"
sed leaves behind temp files in the current directory, named sed*
© 2022 - 2024 — McMap. All rights reserved.