Ruby `split': invalid byte sequence in UTF-8 (ArgumentError)

Asked 16/6, 2012 at 18:22 Answered 17/6, 2012 at 14:7

I am trying to populate the movie object, but when parsing through the u.item file I get this error:

`split': invalid byte sequence in UTF-8 (ArgumentError)

File.open("Data/u.item", "r") do |infile|
            while line = infile.gets
                line = line.split("|")
            end
end

The error occurs only when trying to split the lines with fancy international punctuation.

Here's a sample

543|Misérables, Les (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Mis%E9rables%2C%20Les%20%281995%29|0|0|0|0|0|0|0|0|1|0|0|0|1|0|0|0|0|0|0

Is there a work around??

Shumaker answered 16/6, 2012 at 18:22 Comment(2)

What does od -c say about the line in question? – Moult 16/6, 2012 at 18:30

It works for me with the corpus as posted. @IgnacioVazquez-Abrams is probably right: you need to use a hex editor to see if you have hidden characters in your data file. – Johnnie 16/6, 2012 at 18:57

I had to force the encoding of each line to iso-8859-1 (which is the European character set)... http://en.wikipedia.org/wiki/ISO/IEC_8859-1

a=[]
IO.foreach("u.item") {|x| a << x}
m=[]
a.each_with_index {|line,i| x=line.force_encoding("iso-8859-1").split("|"); m[i]=x}

Shumaker answered 17/6, 2012 at 14:7 Comment(1)

You can specify what encoding Ruby should use when using open, e.g. File.open 'data.txt', 'r:iso-8859-1' do .... See the docs. – Gennie 17/6, 2012 at 16:46

Ruby is somewhat sensitive to character encoding issues. You can do a number of things that might solve your problem. For example:

Put an encoding comment at the top of your source file.
```
# encoding: utf-8
```
Explicitly encode your line before splitting.
```
line = line.encode('UTF-8').split("|")
```
Replace invalid characters, instead of raising an Encoding::InvalidByteSequenceError exception.
```
line.encode('UTF-8', :invalid => :replace).split("|")
```

Give these suggestions a shot, and update your question if none of them work for you. Hope it helps!

Johnnie answered 16/6, 2012 at 18:42 Comment(4)

The error he's getting implies the encoding already is UTF-8. – Schiedam 16/6, 2012 at 19:34

So, I inspected the each line before the program tries to split it. It turns out that the error occurs in lines with fancy punctuations Here is the record where the error occurred: 543|Misérables, Les (1995)|01-Jan-1995||us.imdb.com/M/… I tried the third option as well, didn't work out...Any ideas? or alternative ways... – Shumaker 16/6, 2012 at 20:9

This seems to address your edge case: https://mcmap.net/q/588370/-ruby-string-encode-still-gives-quot-invalid-byte-sequence-in-utf-8-quot – Johnnie 16/6, 2012 at 21:23

Found a working solution from this question: #7048444 – Banff 30/11, 2015 at 20:52

Recommended topics

Hot tags