I can't remove whitespaces from a string parsed by Nokogiri

Asked 2/1, 2013 at 18:52 Answered 14/2, 2020 at 2:8

Solved ruby nokogiri whitespace mechanize mechanize-ruby

I can't remove whitespaces from a string.

My HTML is:

<p class='your-price'>
Cena pro Vás: <strong>139&nbsp;<small>Kč</small></strong>
</p>

My code is:

#encoding: utf-8
require 'rubygems'
require 'mechanize'

agent = Mechanize.new
site  = agent.get("http://www.astratex.cz/podlozky-pod-raminka/doplnky")
price = site.search("//p[@class='your-price']/strong/text()")

val = price.first.text  => "139 "
val.strip               => "139 "
val.gsub(" ", "")       => "139 "

gsub, strip, etc. don't work. Why, and how do I fix this?

val.class      => String
val.dump       => "\"139\\u{a0}\""      !
val.encoding   => #<Encoding:UTF-8>

__ENCODING__               => #<Encoding:UTF-8>
Encoding.default_external  => #<Encoding:UTF-8>

I'm using Ruby 1.9.3 so Unicode shouldn't be problem.

Galatia answered 2/1, 2013 at 18:52 Comment(2)

Tip: Instead of that XPath, you could use val = site.at('p.your-price > strong').text. – Maziemazlack 4/1, 2013 at 22:20

Yup, but CSS is not my cup of tea. :) – Galatia 5/1, 2013 at 0:22

strip only removes ASCII whitespace and the character you've got here is a Unicode non-breaking space.

Removing the character is easy. You can use gsub by providing a regex with the character code:

gsub(/\u00a0/, '')

You could also call

gsub(/[[:space:]]/, '')

to remove all Unicode whitespace. For details, check the Regexp documentation.

Grodin answered 2/1, 2013 at 19:24 Comment(9)

You could also use \p{Space} as an alternative to [[:space:]] if you prefer (I think they’re the same). – Auroraauroral 2/1, 2013 at 19:29

An alternative is to use gsub(' ', '') or gsub(' ', ' ') before parsing and get them all in one pass. – Daedalus 2/1, 2013 at 20:12

@theTinMan using gsub on an HTML document seems like a good idea provided that there are many elements to extract. An unnecessary effort, when there's only one... And a spectacularly bad one if you want to parse a page like this one to grab the content of your very comment :) Whoopsie Daisy, wasn't that one wrapped with <code>? – Grodin 2/1, 2013 at 20:24

I'll try that @home, but I'm sure I tried gsub with /\s+/ even /\s+/u. And why strip (and probably others) works only with ASCII? Programmers like me assume that Ruby will care about this automatically ;) – Galatia 2/1, 2013 at 22:1

@Galatia /\s/ is ASCII-only as well – Grodin 2/1, 2013 at 22:37

"Programmers like me assume that Ruby will care about this automatically" Don't assume, educate yourself on what your language does. If the language did everything, it would be worthless for those times we need it to do something different or new. As programmers we engineer solutions from smaller pieces of code that are designed to be general purpose tools. We plug them in, use them to shape data into whatever we need, and we don't blindly "assume" things will work magically. ASCII vs. UTF-8/Unicode will be a battle for years to come, as long as the internet is full of HTML. – Daedalus 2/1, 2013 at 22:43

I agree that programmer can't assume everything, but Class: String says: "A String object holds and manipulates an arbitrary sequence of bytes, typically representing characters." And when documentation say "Removes leading and trailing whitespace from str." I assume it removes all whitespace. I have to dig deeper, maybe it's for another question. – Galatia 3/1, 2013 at 11:43

Leading and trailing whitespace is much different than "all whitespace" as a string can contain spaces between characters to form words. – Daedalus 14/2, 2020 at 1:43

C'mon. ;) I assume it removes all [leading/trailing] whitespace. Even back then I understood what is whitespace. – Galatia 14/2, 2020 at 14:2

If I wanted to remove non-breaking spaces "\u00A0" AKA   I'd do something like:

require 'nokogiri'

doc = Nokogiri::HTML("&nbsp;")

s = doc.text # => " "

# s is the NBSP
s.ord.to_s(16)                   # => "a0"

# and here's the translate changing the NBSP to a SPACE
s.tr("\u00A0", ' ').ord.to_s(16) # => "20"

So tr("\u00A0", ' ') gets you where you want to be and at this point, the NBSP is now a space:

tr is extremely fast and easy to use.

An alternate is to pre-process the actual encoded character " " before it's been extracted from the HTML. This is simplified but it'd work for an entire HTML file just as well as a single entity in the string:

s = "&nbsp;"

s.gsub('&nbsp;', ' ') # => " "

Using a fixed string for the target is faster than using a regular expression:

s = "&nbsp;" * 10000

require 'fruity'

compare do
  fixed { s.gsub('&nbsp;', ' ') }
  regex { s.gsub(/&nbsp;/, ' ') }
 end

# >> Running each test 4 times. Test will take about 1 second.
# >> fixed is faster than regex by 2x ± 0.1

Regular expressions are useful if you need their capability, but they can drastically slow code.

Daedalus answered 14/2, 2020 at 2:8 Comment(0)

Recommended topics

Hot tags