Using awk to remove the Byte-order mark

M

5

107

How would an awk script (presumably a one-liner) for removing a BOM look like?

Specification:

print every line after the first (NR > 1)
for the first line: If it starts with #FE #FF or #FF #FE, remove those and print the rest

Middleweight answered 1/7, 2009 at 11:37 Comment(0)

P

118

Try this:

awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}{print}' INFILE > OUTFILE

On the first record (line), remove the BOM characters. Print every record.

Or slightly shorter, using the knowledge that the default action in awk is to print the record:

awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}1' INFILE > OUTFILE

1 is the shortest condition that always evaluates to true, so each record is printed.

Enjoy!

-- ADDENDUM --

Unicode Byte Order Mark (BOM) FAQ includes the following table listing the exact BOM bytes for each encoding:

Bytes         |  Encoding Form
--------------------------------------
00 00 FE FF   |  UTF-32, big-endian
FF FE 00 00   |  UTF-32, little-endian
FE FF         |  UTF-16, big-endian
FF FE         |  UTF-16, little-endian
EF BB BF      |  UTF-8

Thus, you can see how \xef\xbb\xbf corresponds to EF BB BF UTF-8 BOM bytes from the above table.

Proustite answered 1/7, 2009 at 11:45 Comment(9)

It seems that the dot in the middle of the sub statement is too much (at least, my awk complains about it). Beside this it's exactly what I searched, thanks! – Middleweight 1/7, 2009 at 12:21

This solution, however, works only for UTF-8 encoded files. For others, like UTF-16, see Wikipedia for the corresponding BOM representation: en.wikipedia.org/wiki/Byte_order_mark – Middleweight 1/7, 2009 at 12:36

I agree with the earlier comment; the dot does not belong in the middle of this statement and makes this otherwise great little snippet an example of an awk syntax error. – Everara 8/12, 2009 at 14:37

So: awk '{if(NR==1)sub(/^\xef\xbb\xbf/,"");print}' INFILE > OUTFILE and make sure INFILE and OUTFILE are different! – Bigley 12/2, 2010 at 20:30

If you used perl -i.orig -pe 's/^\x{FFFE}//' badfile you could rely on your PERL_UNICODE and/or PERLIO envariables for the encoding. PERL_UNICODE=SD would work for UTF-8; for the others, you’d need PERLIO. – Oatmeal 14/8, 2011 at 23:38

Maybe a little bit shorter version: awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}1' – Sinotibetan 6/6, 2013 at 10:2

Works great on OS X El Capitan 10.11.6. – Cornejo 13/9, 2016 at 15:41

/!\ Both commands erased my file, as a "side-effect" of changing the encoding... Quite fortunate to have had them backed up first. – Kelleekelleher 20/7, 2018 at 12:48

If you're trying to just change the file (not create a new one) and for some reason can't use sed (as per the answer below), make sure to use -i inplace and not put the input file as the output file, which will erase the file! – Boondocks 26/11, 2020 at 12:12

F

127

Using GNU sed (on Linux or Cygwin):

# Removing BOM from all text files in current directory:
sed -i '1 s/^\xef\xbb\xbf//' *.txt

On FreeBSD:

sed -i .bak '1 s/^\xef\xbb\xbf//' *.txt

Advantage of using GNU or FreeBSD sed: the -i parameter means "in place", and will update files without the need for redirections or weird tricks.

On Mac:

This awk solution in another answer works, but the sed command above does not work. At least on Mac (Sierra) sed documentation does not mention supporting hexadecimal escaping ala \xef.

A similar trick can be achieved with any program by piping to the sponge tool from moreutils:

awk '…' INFILE | sponge INFILE

Fireweed answered 1/9, 2010 at 21:6 Comment(5)

I tried the second command precisely on Mac OS X and the result was "success", but the substitution didn't actually occur. – Sprout 6/12, 2012 at 5:52

It is worth noting these commands replace one specific byte sequence, which is one of the possible byte-order-marks. Maybe your file had a different BOM sequence. (I can't help other than that, as I don't have a Mac) – Yolondayon 7/12, 2012 at 17:4

When I tried the second command on OS X on a file that used 0xef 0xbb 0xbf as the BOM, it did not actually do the substitution. – Upstate 13/10, 2015 at 20:33

In OSX, I could only get this to work via perl, as shown here: https://mcmap.net/q/20887/-remove-multiple-boms-from-a-file – Cessation 19/8, 2016 at 18:41

On OS X El Capitan 10.11.6, this doesn't work, but the official answer https://mcmap.net/q/20801/-using-awk-to-remove-the-byte-order-mark works fine. – Cornejo 13/9, 2016 at 15:54

P

118