Ruby, Nokogiri: how do i ensure UTF8 throughout nokogiri parsing, erb template, and encoding HTML file
Asked Answered
A

1

9

I finally managed to parse parts of a website:

get '/' do
  url = '<website>'
  data = Nokogiri::HTML(open(url))
  @rows = data.css("td[valign=top] table tr") 
  erb :muster
end

Now I am trying to extract a certain line in my view. Therefore i put in my HTML code:

<%= @rows[2] %> 

And it actually returns the code, but it has problems with UTF8:

<td class="class_name">&nbsp;</td>

instead it says

<td class="class_name">�</td>

How do I ensure UTF8 during nokogiri parsing, erb, and HTML generation?

Acromegaly answered 31/1, 2015 at 16:37 Comment(0)
D
20

See: http://www.nokogiri.org/tutorials/parsing_an_html_xml_document.html#encoding

It looks like in your case, the document is declaring that it's encoded using iso8859:

<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1">

You can do the following to force Nokogiri to treat the stream as UTF-8:

data = Nokogiri::HTML(open(url), nil, Encoding::UTF_8.to_s)
Defector answered 31/1, 2015 at 18:28 Comment(5)
Maybe the website you're hitting is not UTF-8, what's the URL?Defector
Updated my answer to show how you can force nokogiri to use UTF-8Defector
If you're doing a fragment, you can just do Nokogiri::HTML::DocumentFragment.parse(html, Encoding::UTF_8.to_s)Lauralauraceous
This doesn't seem to be enough , it seems like Nokogiri doesn't handle it as expected. I use the following to provide the protection I need " doc = Nokogiri::HTML( email.try(:force_encoding,'ISO-8859-1').try(:encode,'UTF-8').to_s )".Mise
doc.text works either way , but doc.text.match(/string/) doesn't , unless you add the extra force_encoding.Mise

© 2022 - 2024 — McMap. All rights reserved.