Ruby CSV BOM|UTF-8 encoding for StringIO

Asked 25/9, 2019 at 15:47 Answered 12/5, 2020 at 19:49

Ruby 2.6.3.

I have been trying to parse a StringIO object into a CSV instance with the bom|utf-8 encoding, so that the BOM character (undesired) is stripped and the content is encoded to UTF-8:

require 'csv'

CSV_READ_OPTIONS = { headers: true, encoding: 'bom|utf-8' }.freeze

content = StringIO.new("\xEF\xBB\xBFid\n123")
first_row = CSV.parse(content, CSV_READ_OPTIONS).first

first_row.headers.first.include?("\xEF\xBB\xBF")     # This returns true

Apparently the bom|utf-8 encoding does not work for StringIO objects, but I found that it does work for files, for instance:

require 'csv'

CSV_READ_OPTIONS = { headers: true, encoding: 'bom|utf-8' }.freeze

# File content is: "\xEF\xBB\xBFid\n12"
first_row = CSV.read('bom_content.csv', CSV_READ_OPTIONS).first

first_row.headers.first.include?("\xEF\xBB\xBF")     # This returns false

Considering that I need to work with StringIO directly, why does CSV ignores the bom|utf-8 encoding? Is there any way to remove the BOM character from the StringIO instance?

Thank you!

Viburnum answered 25/9, 2019 at 15:47 Comment(2)

Is it not possible to remove the BOM before creating the StringIO instance or creating another one based on a UTF-8 string without BOM? All released StringIO versions don't support BOM handling. – Keegan 25/9, 2019 at 16:40

The problem is that (since Ruby 2.4) BOM is a property of files, not an encoding. If you already have an encoded string, there is no such thing as BOM because the characters have already been properly read according to the BOM, and it is now unneeded. Since StringIO is backed by a string--not a file--it also does not understand BOM. – Nonparous 26/9, 2019 at 16:20

Ruby 2.7 added the set_encoding_by_bom method to IO. This methods consumes the byte order mark and sets the encoding.

require 'csv'
require 'stringio'

CSV_READ_OPTIONS = { headers: true }.freeze

content = StringIO.new("\xEF\xBB\xBFid\n123")
content.set_encoding_by_bom

first_row = CSV.parse(content, CSV_READ_OPTIONS).first
first_row.headers.first.include?("\xEF\xBB\xBF")
#=> false

Drawn answered 12/5, 2020 at 19:49 Comment(0)

Ruby doesn't like BOMs. It only handles them when reading a file, never anywhere else, and even then it only reads them so that it can get rid of them. If you want a BOM for your string, or a BOM when writing a file, you have to handle it manually.

There are probably gems for doing this, though it's easy to do yourself

if string[0...3] == "\xef\xbb\xbf"
  string = string[3..-1].force_encoding('UTF-8')
elsif string[0...2] == "\xff\xfe"
  string = string[2..-1].force_encoding('UTF-16LE')
# etc

Nonparous answered 26/9, 2019 at 16:34 Comment(0)

I found out that forcing encoding to utf8 on the StringIO string and removing the BOM to generate a new StringIO worked:

require 'csv'
CSV_READ_OPTIONS = { headers: true}.freeze
content = StringIO.new("\xEF\xBB\xBFid\n123")
csv_file = StringIO.new(content.string.force_encoding('utf-8').sub("\xEF\xBB\xBF", ''))
first_row = CSV.parse(csv_file, CSV_READ_OPTIONS).first

first_row.headers.first.include?("\xEF\xBB\xBF") # => false

The encoding option is no more needed. It may not be the best option memory-wise, but it works.

Liverpool answered 8/10, 2019 at 12:9 Comment(0)

Recommended topics

Hot tags