I can't remove whitespaces from a string parsed by Nokogiri
Asked Answered
G

2

7

I can't remove whitespaces from a string.

My HTML is:

<p class='your-price'>
Cena pro Vás: <strong>139&nbsp;<small>Kč</small></strong>
</p>

My code is:

#encoding: utf-8
require 'rubygems'
require 'mechanize'

agent = Mechanize.new
site  = agent.get("http://www.astratex.cz/podlozky-pod-raminka/doplnky")
price = site.search("//p[@class='your-price']/strong/text()")

val = price.first.text  => "139 "
val.strip               => "139 "
val.gsub(" ", "")       => "139 "

gsub, strip, etc. don't work. Why, and how do I fix this?

val.class      => String
val.dump       => "\"139\\u{a0}\""      !
val.encoding   => #<Encoding:UTF-8>

__ENCODING__               => #<Encoding:UTF-8>
Encoding.default_external  => #<Encoding:UTF-8>

I'm using Ruby 1.9.3 so Unicode shouldn't be problem.

Galatia answered 2/1, 2013 at 18:52 Comment(2)
Tip: Instead of that XPath, you could use val = site.at('p.your-price > strong').text.Maziemazlack
Yup, but CSS is not my cup of tea. :)Galatia
G
23

strip only removes ASCII whitespace and the character you've got here is a Unicode non-breaking space.

Removing the character is easy. You can use gsub by providing a regex with the character code:

gsub(/\u00a0/, '')

You could also call

gsub(/[[:space:]]/, '')

to remove all Unicode whitespace. For details, check the Regexp documentation.

Grodin answered 2/1, 2013 at 19:24 Comment(9)
You could also use \p{Space} as an alternative to [[:space:]] if you prefer (I think they’re the same).Auroraauroral
An alternative is to use gsub('&nbsp;', '') or gsub('&nbsp;', ' ') before parsing and get them all in one pass.Daedalus
@theTinMan using gsub on an HTML document seems like a good idea provided that there are many elements to extract. An unnecessary effort, when there's only one... And a spectacularly bad one if you want to parse a page like this one to grab the content of your very comment :) Whoopsie Daisy, wasn't that one wrapped with <code>?Grodin
I'll try that @home, but I'm sure I tried gsub with /\s+/ even /\s+/u. And why strip (and probably others) works only with ASCII? Programmers like me assume that Ruby will care about this automatically ;)Galatia
@Galatia /\s/ is ASCII-only as wellGrodin
"Programmers like me assume that Ruby will care about this automatically" Don't assume, educate yourself on what your language does. If the language did everything, it would be worthless for those times we need it to do something different or new. As programmers we engineer solutions from smaller pieces of code that are designed to be general purpose tools. We plug them in, use them to shape data into whatever we need, and we don't blindly "assume" things will work magically. ASCII vs. UTF-8/Unicode will be a battle for years to come, as long as the internet is full of HTML.Daedalus
I agree that programmer can't assume everything, but Class: String says: "A String object holds and manipulates an arbitrary sequence of bytes, typically representing characters." And when documentation say "Removes leading and trailing whitespace from str." I assume it removes all whitespace. I have to dig deeper, maybe it's for another question.Galatia
Leading and trailing whitespace is much different than "all whitespace" as a string can contain spaces between characters to form words.Daedalus
C'mon. ;) I assume it removes all [leading/trailing] whitespace. Even back then I understood what is whitespace.Galatia
D
0

If I wanted to remove non-breaking spaces "\u00A0" AKA &nbsp; I'd do something like:

require 'nokogiri'

doc = Nokogiri::HTML("&nbsp;")

s = doc.text # => " "

# s is the NBSP
s.ord.to_s(16)                   # => "a0"

# and here's the translate changing the NBSP to a SPACE
s.tr("\u00A0", ' ').ord.to_s(16) # => "20"

So tr("\u00A0", ' ') gets you where you want to be and at this point, the NBSP is now a space:

tr is extremely fast and easy to use.

An alternate is to pre-process the actual encoded character "&nbsp;" before it's been extracted from the HTML. This is simplified but it'd work for an entire HTML file just as well as a single entity in the string:

s = "&nbsp;"

s.gsub('&nbsp;', ' ') # => " "

Using a fixed string for the target is faster than using a regular expression:

s = "&nbsp;" * 10000

require 'fruity'

compare do
  fixed { s.gsub('&nbsp;', ' ') }
  regex { s.gsub(/&nbsp;/, ' ') }
 end

# >> Running each test 4 times. Test will take about 1 second.
# >> fixed is faster than regex by 2x ± 0.1

Regular expressions are useful if you need their capability, but they can drastically slow code.

Daedalus answered 14/2, 2020 at 2:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.