Force strings to UTF-8 from any encoding
Asked Answered
H

4

51

In my rails app I'm working with RSS feeds from all around the world, and some feeds have links that are not in UTF-8. The original feed links are out of my control, and in order to use them in other parts of the app, they need to be in UTF-8.

How can I detect encoding and convert to UTF-8?

Hallo answered 18/10, 2012 at 5:48 Comment(1)
To detect an encoding, you need to parse the accompanying meta information of the documents, i.e. HTTP headers or <meta> tags.Luciano
R
71

Ruby 1.9

"Forcing" an encoding is easy, however it won't convert the characters just change the encoding:

str = str.force_encoding('UTF-8')

str.encoding.name # => 'UTF-8'

If you want to perform a conversion, use encode:

begin
  str.encode("UTF-8")
rescue Encoding::UndefinedConversionError
  # ...
end

I would definitely read the following post for more information:
http://graysoftinc.com/character-encodings/ruby-19s-string

Romansh answered 18/10, 2012 at 6:39 Comment(3)
Doesn't work: whois = whois.force_encoding("UTF-8") \n whois.encoding.name => "UTF-8" \n whois.scan(/role:\s+(.+)/i) -- Throws: ArgumentError: invalid byte sequence in UTF-8Harpy
As stated, force_encoding does not convert the characters and certainly cannot magically interpret invalid UTF-8 byte sequences.Romansh
Current syntax for Ruby 2.2.0 and above is: str.force_encoding(Encoding::UTF_8) EncodingPerish
I
42

This will ensure you have the correct encoding and won't error out because it replaces any invalid or undefined character with a blank string.

This will ensure no matter what, that you have a valid UTF-8 string

str.encode(Encoding.find('UTF-8'), {invalid: :replace, undef: :replace, replace: ''})

For Ruby 3.0+:

str.encode(Encoding.find('UTF-8'), invalid: :replace, undef: :replace, replace: '')
Interchange answered 4/6, 2015 at 16:53 Comment(2)
This will raise no implicit conversion of Hash into String on modern ruby (probably after 3.0) Use str.encode(Encoding.find('UTF-8'), invalid: :replace, undef: :replace, replace: '')Postprandial
thx i have this problem, your solution solve problem with ruby 3+ ;)Tarkington
O
5

Only this solution worked for me:

string.encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')

Note the binary argument.

Orozco answered 12/6, 2020 at 9:0 Comment(0)
L
4

Iconv

require 'iconv'
i = Iconv.new('UTF-8','LATIN1')
a_with_hat = i.iconv("\xc2")

Summary: the iconv gem does all the work of converting encodings. Make sure it's installed with:

gem install iconv

Now, you need to know what encoding your string is currently in as Ruby 1.8 treats Strings as an array of bytes (with no intrinsic encoding.) For example, say your string was in latin1 and you wanted to convert it to utf-8

require 'iconv'

string_in_utf8_encoding = Iconv.conv("UTF8", "LATIN1", string_in_latin1_encoding)
Lucy answered 18/10, 2012 at 5:56 Comment(2)
Thanks for the answer, but in my case the source data is inconsistent and I don't really have a reliable way to preempt encodingsHallo
Iconv should not be used anymore. (deprecated) #8149262Dumbhead

© 2022 - 2024 — McMap. All rights reserved.