Ruby `split': invalid byte sequence in UTF-8 (ArgumentError)
Asked Answered
S

2

22

I am trying to populate the movie object, but when parsing through the u.item file I get this error:

`split': invalid byte sequence in UTF-8 (ArgumentError)

File.open("Data/u.item", "r") do |infile|
            while line = infile.gets
                line = line.split("|")
            end
end

The error occurs only when trying to split the lines with fancy international punctuation.

Here's a sample

543|Misérables, Les (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Mis%E9rables%2C%20Les%20%281995%29|0|0|0|0|0|0|0|0|1|0|0|0|1|0|0|0|0|0|0

Is there a work around??

Shumaker answered 16/6, 2012 at 18:22 Comment(2)
What does od -c say about the line in question?Moult
It works for me with the corpus as posted. @IgnacioVazquez-Abrams is probably right: you need to use a hex editor to see if you have hidden characters in your data file.Johnnie
S
21

I had to force the encoding of each line to iso-8859-1 (which is the European character set)... http://en.wikipedia.org/wiki/ISO/IEC_8859-1

a=[]
IO.foreach("u.item") {|x| a << x}
m=[]
a.each_with_index {|line,i| x=line.force_encoding("iso-8859-1").split("|"); m[i]=x}
Shumaker answered 17/6, 2012 at 14:7 Comment(1)
You can specify what encoding Ruby should use when using open, e.g. File.open 'data.txt', 'r:iso-8859-1' do .... See the docs.Gennie
J
15

Ruby is somewhat sensitive to character encoding issues. You can do a number of things that might solve your problem. For example:

  1. Put an encoding comment at the top of your source file.

    # encoding: utf-8
    
  2. Explicitly encode your line before splitting.

    line = line.encode('UTF-8').split("|")
    
  3. Replace invalid characters, instead of raising an Encoding::InvalidByteSequenceError exception.

    line.encode('UTF-8', :invalid => :replace).split("|")
    

Give these suggestions a shot, and update your question if none of them work for you. Hope it helps!

Johnnie answered 16/6, 2012 at 18:42 Comment(4)
The error he's getting implies the encoding already is UTF-8.Schiedam
So, I inspected the each line before the program tries to split it. It turns out that the error occurs in lines with fancy punctuations Here is the record where the error occurred: 543|Misérables, Les (1995)|01-Jan-1995||us.imdb.com/M/… I tried the third option as well, didn't work out...Any ideas? or alternative ways...Shumaker
This seems to address your edge case: https://mcmap.net/q/588370/-ruby-string-encode-still-gives-quot-invalid-byte-sequence-in-utf-8-quotJohnnie
Found a working solution from this question: #7048444Banff

© 2022 - 2024 — McMap. All rights reserved.