How would an awk
script (presumably a one-liner) for removing a BOM look like?
Specification:
- print every line after the first (
NR > 1
) - for the first line: If it starts with
#FE #FF
or#FF #FE
, remove those and print the rest
How would an awk
script (presumably a one-liner) for removing a BOM look like?
Specification:
NR > 1
)#FE #FF
or #FF #FE
, remove those and print the restTry this:
awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}{print}' INFILE > OUTFILE
On the first record (line), remove the BOM characters. Print every record.
Or slightly shorter, using the knowledge that the default action in awk is to print the record:
awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}1' INFILE > OUTFILE
1
is the shortest condition that always evaluates to true, so each record is printed.
Enjoy!
-- ADDENDUM --
Unicode Byte Order Mark (BOM) FAQ includes the following table listing the exact BOM bytes for each encoding:
Bytes | Encoding Form
--------------------------------------
00 00 FE FF | UTF-32, big-endian
FF FE 00 00 | UTF-32, little-endian
FE FF | UTF-16, big-endian
FF FE | UTF-16, little-endian
EF BB BF | UTF-8
Thus, you can see how \xef\xbb\xbf
corresponds to EF BB BF
UTF-8
BOM bytes from the above table.
awk '{if(NR==1)sub(/^\xef\xbb\xbf/,"");print}' INFILE > OUTFILE
and make sure INFILE and OUTFILE are different! –
Bigley perl -i.orig -pe 's/^\x{FFFE}//' badfile
you could rely on your PERL_UNICODE and/or PERLIO envariables for the encoding. PERL_UNICODE=SD would work for UTF-8; for the others, you’d need PERLIO. –
Oatmeal awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}1'
–
Sinotibetan 10.11.6
. –
Cornejo -i inplace
and not put the input file as the output file, which will erase the file! –
Boondocks Using GNU sed
(on Linux or Cygwin):
# Removing BOM from all text files in current directory:
sed -i '1 s/^\xef\xbb\xbf//' *.txt
On FreeBSD:
sed -i .bak '1 s/^\xef\xbb\xbf//' *.txt
Advantage of using GNU or FreeBSD sed
: the -i
parameter means "in place", and will update files without the need for redirections or weird tricks.
On Mac:
This awk
solution in another answer works, but the sed
command above does not work. At least on Mac (Sierra) sed
documentation does not mention supporting hexadecimal escaping ala \xef
.
A similar trick can be achieved with any program by piping to the sponge
tool from moreutils:
awk '…' INFILE | sponge INFILE
10.11.6
, this doesn't work, but the official answer https://mcmap.net/q/20801/-using-awk-to-remove-the-byte-order-mark works fine. –
Cornejo Try this:
awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}{print}' INFILE > OUTFILE
On the first record (line), remove the BOM characters. Print every record.
Or slightly shorter, using the knowledge that the default action in awk is to print the record:
awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}1' INFILE > OUTFILE
1
is the shortest condition that always evaluates to true, so each record is printed.
Enjoy!
-- ADDENDUM --
Unicode Byte Order Mark (BOM) FAQ includes the following table listing the exact BOM bytes for each encoding:
Bytes | Encoding Form
--------------------------------------
00 00 FE FF | UTF-32, big-endian
FF FE 00 00 | UTF-32, little-endian
FE FF | UTF-16, big-endian
FF FE | UTF-16, little-endian
EF BB BF | UTF-8
Thus, you can see how \xef\xbb\xbf
corresponds to EF BB BF
UTF-8
BOM bytes from the above table.
awk '{if(NR==1)sub(/^\xef\xbb\xbf/,"");print}' INFILE > OUTFILE
and make sure INFILE and OUTFILE are different! –
Bigley perl -i.orig -pe 's/^\x{FFFE}//' badfile
you could rely on your PERL_UNICODE and/or PERLIO envariables for the encoding. PERL_UNICODE=SD would work for UTF-8; for the others, you’d need PERLIO. –
Oatmeal awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}1'
–
Sinotibetan 10.11.6
. –
Cornejo -i inplace
and not put the input file as the output file, which will erase the file! –
Boondocks Not awk, but simpler:
tail -c +4 UTF8 > UTF8.nobom
To check for BOM:
hd -n 3 UTF8
If BOM is present you'll see: 00000000 ef bb bf ...
cat file1.utf8 file2.utf8 file3.utf3 > allfiles.utf8
will be broken. Never use a BOM on UTF-8. Period. –
Oatmeal hd
is not available on OS X (as of 10.8.2), so to check for an UTF-8 BOM there you can use the following: head -c 3 file | od -t x1
. –
Huggermugger if [[ "
file a.txt | grep -o 'with BOM'" == "BOM" ]];
can also be used –
Fredi hexdump
and xdd
should work in place of hd
, if that's not available on your system. –
Footmark In addition to converting CRLF line endings to LF, dos2unix
also removes BOMs:
dos2unix *.txt
dos2unix
also converts UTF-16 files with a BOM (but not UTF-16 files without a BOM) to UTF-8 without a BOM:
$ printf '\ufeffä\n'|iconv -f utf-8 -t utf-16be>bom-utf16be
$ printf '\ufeffä\n'|iconv -f utf-8 -t utf-16le>bom-utf16le
$ printf '\ufeffä\n'>bom-utf8
$ printf 'ä\n'|iconv -f utf-8 -t utf-16be>utf16be
$ printf 'ä\n'|iconv -f utf-8 -t utf-16le>utf16le
$ printf 'ä\n'>utf8
$ for f in *;do printf '%11s %s\n' $f $(xxd -p $f);done
bom-utf16be feff00e4000a
bom-utf16le fffee4000a00
bom-utf8 efbbbfc3a40a
utf16be 00e4000a
utf16le e4000a00
utf8 c3a40a
$ dos2unix -q *
$ for f in *;do printf '%11s %s\n' $f $(xxd -p $f);done
bom-utf16be c3a40a
bom-utf16le c3a40a
bom-utf8 c3a40a
utf16be 00e4000a
utf16le e4000a00
utf8 c3a40a
I know the question was directed at unix/linux, thought it would be worth to mention a good option for the unix-challenged (on windows, with a UI).
I ran into the same issue on a WordPress project (BOM was causing problems with rss feed and page validation) and I had to look into all the files in a quite big directory tree to find the one that was with BOM. Found an application called Replace Pioneer and in it:
Batch Runner -> Search (to find all the files in the subfolders) -> Replace Template -> Binary remove BOM (there is a ready made search and replace template for this).
It was not the most elegant solution and it did require installing a program, which is a downside. But once I found out what was going around me, it worked like a charm (and found 3 files out of about 2300 that were with BOM).
© 2022 - 2024 — McMap. All rights reserved.