Ruby, Nokogiri: how do i ensure UTF8 throughout nokogiri parsing, erb template, and encoding HTML file

About

Asked 31/1, 2015 at 16:37 Answered 31/1, 2015 at 18:28

I finally managed to parse parts of a website:

get '/' do
  url = '<website>'
  data = Nokogiri::HTML(open(url))
  @rows = data.css("td[valign=top] table tr") 
  erb :muster
end

Now I am trying to extract a certain line in my view. Therefore i put in my HTML code:

<%= @rows[2] %>

And it actually returns the code, but it has problems with UTF8:

<td class="class_name">&nbsp;</td>

instead it says

<td class="class_name">�</td>

How do I ensure UTF8 during nokogiri parsing, erb, and HTML generation?

Acromegaly answered 31/1, 2015 at 16:37 Comment(0)

See: http://www.nokogiri.org/tutorials/parsing_an_html_xml_document.html#encoding

It looks like in your case, the document is declaring that it's encoded using iso8859:

<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1">

You can do the following to force Nokogiri to treat the stream as UTF-8:

data = Nokogiri::HTML(open(url), nil, Encoding::UTF_8.to_s)

Defector answered 31/1, 2015 at 18:28 Comment(5)

Maybe the website you're hitting is not UTF-8, what's the URL? – Defector 1/2, 2015 at 22:34

Updated my answer to show how you can force nokogiri to use UTF-8 – Defector 2/2, 2015 at 17:7

If you're doing a fragment, you can just do Nokogiri::HTML::DocumentFragment.parse(html, Encoding::UTF_8.to_s) – Lauralauraceous 23/3, 2016 at 23:59

This doesn't seem to be enough , it seems like Nokogiri doesn't handle it as expected. I use the following to provide the protection I need " doc = Nokogiri::HTML( email.try(:force_encoding,'ISO-8859-1').try(:encode,'UTF-8').to_s )". – Mise 30/11, 2016 at 16:32

doc.text works either way , but doc.text.match(/string/) doesn't , unless you add the extra force_encoding. – Mise 30/11, 2016 at 16:32

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags