Force UTF-8 Byte Order Mark in Perl file output
Asked Answered
G

2

12

I'm writing out a CSV file using Perl. The data going into the CSV contains Unicode characters. I'm using the following to write the CSV out:

open(my $fh, ">:utf8", "rpt-".$datestring.".csv")
or die "cannot open < rpt.csv: $!";

The characters are being written correctly inside the file but it doesn't appear to be including the UTF8 Byte Order Mark. This throws off my users, when they try to open the file in Excel. Is there a way to force the Byte Order Mark to be written?

I attempted it the following way:

print $fh "\x{EFBBBF};

I ended up with gibberish at the top of the file.

Gigantean answered 14/9, 2011 at 15:27 Comment(5)
A 'Byte Order Mark' for UTF-8 makes no logical sense - there is only one possible byte order for UTF8. I am aware that various Windows apps rely on the presence of the 'BOM' to trigger the use of a Unicode encoding rather than a Microsoft codepage but if you're not dealing with broken MS apps there is no value in adding a BOM to a UTF8 document.Upchurch
@Grant: Or, to be pedantic: Since UTF-8 encodes as a stream of bytes there is no byte order. Byte order (or Endianness) only makes sense for multi byte numbers.Hickory
@Grant I agree with you in principle. However my users are using broken MS apps. Hence the need to force the BOM.Gigantean
Forcing the BOM sounds like a good idea anyway, as otherwise there is no way to tell from just the stream what its encoding is.Sufi
"A 'Byte Order Mark' for UTF-8 makes no logical sense" -- false. And while predicated on the faulty notion that a name determines a thing's semantics, it's wrong even if that notion were true ... because presence/ absence of a BOM in a utf8 file can be taken to imply presence/absence of a BOM in a utf16 or utf32 file it is converted to, allowing transparent round trip conversion. "if you're not dealing with broken MS apps" The OP explicitly mentioned Excel. The question was not about whether BOM's should be used, but how to output them, so that entire pedantic excursion is out of place.Fishbein
R
14

Try doing this:

print $fh chr(65279);

after opening the file.

Risarise answered 14/9, 2011 at 15:52 Comment(4)
use File::BOM (); open my $fh, '> :utf8 :via(File::BOM)', … would be even morer clearerer.Terminator
Isn't that the UTF-16 BOM? Shouldn't he be doing print $fh pack("CCC",0xef,0xbb,0xbf); Although saying that, I could only get FusionCharts (which expects the BOM) to understand your example.Goatee
@Cosmicnet: No: the same codepoint is used for BOM for all UTF- charsets. What will make the difference is the encoding layer enabled on the filehandle. See :utf8 in the open call in the question.Pilgrimage
@MooingDuck Both the title of the question and its content repeatedly mention UTF-8; there's no UTF-16 involved. Your assumption seems based on misunderstandings of Unicode.Fishbein
E
0

Is there a way to force the Byte Order Mark to be written?

To write this out, you must use File::BOM to write the Byte Order Mark out when the file is opened.

For example, writing a little-endian UTF-16 file with BOM:

use File::BOM ();
my $filename = "out.bin";
open(FH, '>:encoding(UTF-8):via(File::BOM)', $filename);
print FH "ʇsǝ⊥\n";

Then run the program and check the output:

% file out.bin
out.bin: Unicode text, UTF-8 (with BOM) text

Prior to perl 5.8.7, there were bugs with wide characters.

Erfert answered 30/3, 2023 at 0:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.