How to avoid tripping over UTF-8 BOM when reading files
Asked Answered
G

3

51

I'm consuming a data feed that has recently added a Unicode BOM header (U+FEFF), and my rake task is now messed up by it.

I can skip the first 3 bytes with file.gets[3..-1] but is there a more elegant way to read files in Ruby which can handle this correctly, whether a BOM is present or not?

Godfather answered 12/2, 2009 at 20:59 Comment(2)
Thats a Unicode BOM not a UTF-8 one.Fite
Thanks, I just realized that. It's actually 3 bytes, not one... I edited the question to say as much.Godfather
F
83

With ruby 1.9.2 you can use the mode r:bom|utf-8

text_without_bom = nil #define the variable outside the block to keep the data
File.open('file.txt', "r:bom|utf-8"){|file|
  text_without_bom = file.read
}

or

text_without_bom = File.read('file.txt', encoding: 'bom|utf-8')

or

text_without_bom = File.read('file.txt', mode: 'r:bom|utf-8')

It doesn't matter, if the BOM is available in the file or not.


You may also use the encoding option with other commands:

text_without_bom = File.readlines(@filename, "r:utf-8")

(You get an array with all lines).

Or with CSV:

require 'csv'
CSV.open(@filename, 'r:bom|utf-8'){|csv|
  csv.each{ |row| p row }
}
Fiorenza answered 15/10, 2011 at 20:48 Comment(7)
Is there a way to do this with CSV files using the CSV library built into ruby? I've tried passing :encoding => "r:bom|utf-8" to CSV's foreach but it still reads the BOM as if it is part of the header's first column.Inchon
I think it is possible. With CVS.read(filename, :encoding => 'utf-8') you can set the encoding with CSV (or is it CSV.load?). I think this shold be also possible with the bom-logic: :encoding => 'bom|utf-8'). I can't test it actually myself - sorry.Fiorenza
The following worked for me: file = File.open(@filename, 'r:bom|utf-8') csv = CSV.new(file, faster_csv_options) csv.each do |row| ... file.closeInchon
You may also use the block.version of File#open: File.open(@filename, 'r:bom|utf-8'){|file| csv = CSV.new(file, faster_csv_options) csv.each{ |row| p row } } or even shorter, I tested successfull: CSV.open(@filename, 'r:bom|utf-8', faster_csv_options){|csv| csv.each{ |row| p row } }and CSV.read(@filename, 'r:bom|utf-8').each{|row| p row }Fiorenza
Why not just text_without_bom = File.read('file.txt', mode: 'r:bom|utf-8')Oates
@MattHuggins For me it is a matter of habit, there is no special reason in it. I think in meantime I would also prefer the File.read.Fiorenza
Most simple CSV read that worked for me: zips = CSV.read('zip_codes.csv', col_sep: "\t", encoding: 'bom|utf-8')Objection
M
12

I wouldn't blindly skip the first three bytes; what if the producer stops adding the BOM again? What you should do is examine the first few bytes, and if they're 0xEF 0xBB 0xBF, ignore them. That's the form the BOM character (U+FEFF) takes in UTF-8; I prefer to deal with it before trying to decode the stream because BOM handling is so inconsistent from one language/tool/framework to the next.

In fact, that's how you're supposed to deal with a BOM. If a file has been served as UTF-16, you have to examine the first two bytes before you start decoding so you know whether to read it as big-endian or little-endian. Of course, the UTF-8 BOM has nothing to do with byte order, it's just there to let you know that the encoding is UTF-8, in case you didn't already know that.

Monomolecular answered 13/2, 2009 at 15:4 Comment(0)
U
0

I'd not "trust" some file to be encoded as UTF-8 when a BOM of 0xEF 0xBB 0xBF is present, you might fail. Usually when detecting the UTF-8 BOM, it should really be a UTF-8 encoded file of course. But, if for example someone has just added the UTF-8 BOM to an ISO file, you'd fail to encode such file so bad if there are bytes in it that are above 0x0F. You can trust the file if you have only bytes up to 0x0F inside, because in this case it's a UTF-8 compatible ASCII file and at the same time it is a valid UTF-8 file.

If there are not just bytes <= 0x0F within the file (after the BOM), to be sure it is properly UTF-8 encoded you'll have to check for valid sequences and - even when all sequences are valid - check also if each codepoint from a sequence uses the shortest sequence possible and check also if there is no codepoint that matches a high- or low-surrogate. Also check if the maximum bytes of a sequence is not more than 4 and the highest codepoint is 0x10FFFF. The highest codepoint limits also the startbyte's payload bits to be not higher than 0x4 and the first following byte's payload not higher than 0xF. If all the mentioned checks pass successfully, your UTF-8 BOM tells the truth.

Ursola answered 3/6, 2013 at 15:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.