Is there a way in ruby 1.9 to remove invalid byte sequences from strings?

Asked 3/1, 2012 at 9:57 Answered 26/4, 2017 at 16:22

Solved ruby encoding character-encoding ruby-1.9 utf

Suppose you have a string like "€foo\xA0", encoded UTF-8, Is there a way to remove invalid byte sequences from this string? ( so you get "€foo" )

In ruby-1.8 you could use Iconv.iconv('UTF-8//IGNORE', 'UTF-8', "€foo\xA0") but that is now deprecated. "€foo\xA0".encode('UTF-8') doesn't do anything, since it is already UTF-8. I tried:

"€foo\xA0".force_encoding('BINARY').encode('UTF-8', :undef => :replace, :replace => '')

which yields

"foo"

But that also loses the valid multibyte character €

Peonage answered 3/1, 2012 at 9:57 Comment(1)

See #12147949 for some more newer 2.1+ options. – Gigahertz 28/2, 2023 at 16:28

"€foo\xA0".chars.select(&:valid_encoding?).join

Inodorous answered 3/1, 2012 at 10:50 Comment(2)

It doesn't remove the \xF1 in this string "eEspa\xF1a;FB" – Munn 24/9, 2014 at 15:12

@Dorian, on 1.9.3 IRB console, "eEspa\xF1a;FB".chars.select{|i| i.valid_encoding?}.join returns "eEspaa;FB" ...do you not get that behavior or have I misunderstood? – Nullify 20/3, 2015 at 17:40

"€foo\xA0".encode('UTF-16le', invalid: :replace, replace: '').encode('UTF-8')

Easter answered 3/1, 2012 at 10:50 Comment(10)

I was under the impression it has a larger character set than UTF-8, meaning you don't loose any valid data. Unfortunately the following doesn't work: "€foo\xA0".encode('UTF-8', :invalid => :replace, :replace => '') because the string is already UTF-8, so it will not be encoded again. – Easter 29/4, 2012 at 18:9

FWIW, running a test on a large file I found this method to be an order of magnitude faster than the valid_encoding approach. – Madian 4/10, 2012 at 20:37

UTF-8 and UTF-16 can both represent all Unicode characters. The only difference is the way the characters are encoded. – Wall 10/11, 2012 at 11:10

UTF-32 is also an option, but UTF-16 seems to work well enough. The new emoji characters might need the extra space. – Faience 12/12, 2012 at 21:6

All UTF encodings are equally capable of encoding all possible Unicode characters; there's no difference in that regard between UTF-8, UTF-16 and UTF-32. The only practical difference is the output size. – Wall 2/6, 2013 at 7:9

Throws an error with this string: "eEspa\xF1a;FB" – Munn 24/9, 2014 at 15:12

@Dorian: what Ruby version? – Easter 5/3, 2015 at 23:50

@VanderHoorn: it was ruby < 2.1 because it works with ruby 2.1+ – Munn 10/3, 2015 at 13:5

@Dorian: I see. Could it be a Ruby 2.0.x issue? Because I think I used Ruby 1.9.3 when I answered the original question. – Easter 11/3, 2015 at 13:56

With ruby 2.1 encoding from "the same encoding to the same encoding" it no longer a no-op FWIW so doing the double encoding trick hopefully is no longer necessary? – Gigahertz 28/2, 2023 at 15:58

Ruby 2.0 and 1.9.3

"€foo\xA0".encode(Encoding::UTF_8, Encoding::UTF_8, :invalid => :replace)

Ruby 2.1+

"€foo\xA0".scrub

These replace the \xA0 with a � symbol by default, you can specify a different replacement parameter.

Axinomancy answered 26/4, 2017 at 16:22 Comment(0)

-2

    data = '' if not (data.force_encoding("UTF-8").valid_encoding?)

Seanseana answered 11/10, 2014 at 7:37 Comment(2)

This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post - you can always comment on your own posts, and once you have sufficient reputation you will be able to comment on any post. – Treacy 11/10, 2014 at 12:8

@Treacy how come not? It looks like an (incorrect) answer to the question. It removes all invalid byte sequence from a string. It just removes all valid ones as well. – Swipple 11/10, 2014 at 15:46

Recommended topics

Hot tags