Remove BOM from string with Perl

Asked 24/6, 2014 at 15:1 Answered 30/9, 2016 at 17:11

Solved string perl text utf-8 byte-order-mark

I have the following problem: I am reading from a UTF-8 text file (and I am telling Perl that I am doing so by ":encoding(utf-8)").

The file looks like this in a hex viewer: EF BB BF 43 6F 6E 66 65 72 65 6E 63 65

This translates to "∩╗┐Conference" when printed. I understand the "wide character" which I am being warned about is the BOM. I want to get rid of it (not because of the warning, but because it messes up a string comparison that I undertake later).

So I tried to remove it using the following code, but I fail miserably:

$line =~ s/^\xEF\xBB\xBF//;

Can anyone enlighten me as to how to remove the UTF-8 BOM from a string which I obtained by reading the first line of the UTF-8 file?

Thanks!

Tussis answered 24/6, 2014 at 15:1 Comment(1)

As long as you have the output encoding set correctly there should be no need to remove the BOM, because a zero-width space will have no effect on the result – Tragicomedy 24/6, 2014 at 16:48

EF BB BF is the UTF-8 encoding of the BOM, but you decoded it, so you must look for its decoded form. The BOM is a ZERO WIDTH NO-BREAK SPACE (U+FEFF) used at the start of a file, so any of the following will do:

s/^\x{FEFF}//;
s/^\N{U+FEFF}//;
s/^\N{ZERO WIDTH NO-BREAK SPACE}//;
s/^\N{BOM}//;   # Convenient alias

See also: File::Bom.

I understand the "wide character" which I am being warned about is the BOM. I want to get rid of it

You're getting wide character because you forgot to add an :encoding layer on your output file handle. The following adds :encoding(UTF-8) to STDIN, STDOUT, STDERR, and makes it the default for open().

use open ':std', ':encoding(UTF-8)';

Uric answered 24/6, 2014 at 15:8 Comment(5)

to use the shorthand, I needed to add use charnames ':full'; – Tussis 24/6, 2014 at 15:20

I think 5.12 is needed for \N{...} I think 5.14 is needed for \N{BOM}. use charnames ':full'; is needed before 5.16. – Uric 24/6, 2014 at 15:28

@user1769925: Note that the problem is that you have decoded the data from the file (because of your :encoding(utf-8) open mode) so the first character of the input string is Unicode U+FEFF, but you are using raw UTF-8-encoded data bytes in your substitution – Tragicomedy 24/6, 2014 at 16:41

These solutions caused compile time errors until I added this code: use charnames ":full";. After that, the solutions still failed in making any change. What ultimately solved this for me: use Encode; my $value = decode('UTF-8', $value); $value =~ s/\N{U+FEFF}//; – Crellen 30/9, 2016 at 17:14

@HoldOffHunger, It was already mentioned that one needs use charnames ":full"; in old versions of Perl. /// A crucial part of the question is the equivalent of decode('UTF-8', $value) had already been performed -- their code would have worked if they hadn't already decoded the text -- so adding decode('UTF-8', $value) would actually be wrong here. – Uric 30/9, 2016 at 17:39

To defuse the BOM, you have to know it's not 3 characters, it's 1 in UTF (U+FEFF):

s/^\x{FEFF}//;

Chinchilla answered 24/6, 2014 at 15:10 Comment(1)

As @Uric noted, decoding to UTF-8 is needed, i.e. decode_utf8(). – Dulcine 12/1, 2022 at 17:4

If you open the file using File::BOM, it will remove the BOM for you.

use File::BOM;

open_bom(my $fh, $path, ':utf8')

Cassondra answered 24/6, 2014 at 15:19 Comment(1)

not in the standard library – Johathan 24/4, 2022 at 21:50

Ideally, your filehandle should be doing this for you automatically. But if you're not in an ideal situation, this worked for me:

use Encode;

my $value = decode('UTF-8', $originalvalue);
$value =~ s/\N{U+FEFF}//;

Crellen answered 30/9, 2016 at 17:11 Comment(1)

There is decode_utf8() shortcut. – Dulcine 12/1, 2022 at 17:2

Recommended topics

Hot tags