How to avoid creating non-significant white space text nodes when creating a `Nokogiri::XML` or `Nokogiri::HTML` object

Asked 14/1, 2014 at 13:37 Answered 2/8, 2014 at 17:29

Solved xml-parsing html-parsing nokogiri

While parsing an indented XML, non-significant white space text nodes are created from the white spaces between a closing and an opening tag. For example, from the following XML:

<note>
  <to>Tove</to>
  <from>Jani</from>
  <heading>Reminder</heading>
  <body>Don't forget me this weekend!</body>
</note>

whose string representation is as follows,

 "<note>\n  <to>Tove</to>\n  <from>Jani</from>\n  <heading>Reminder</heading>\n  <body>Don't forget me this weekend!</body>\n</note>\n"

the following Document is created:

#(Document:0x3fc07e4540d8 {
  name = "document",
  children = [
    #(Element:0x3fc07ec8629c {
      name = "note",
      children = [
        #(Text "\n  "),
        #(Element:0x3fc07ec8089c {
          name = "to",
          children = [ #(Text "Tove")]
          }),
        #(Text "\n  "),
        #(Element:0x3fc07e8d8064 {
          name = "from",
          children = [ #(Text "Jani")]
          }),
        #(Text "\n  "),
        #(Element:0x3fc07e8d588c {
          name = "heading",
          children = [ #(Text "Reminder")]
          }),
        #(Text "\n  "),
        #(Element:0x3fc07e8cf590 {
          name = "body",
          children = [ #(Text "Don't forget me this weekend!")]
          }),
        #(Text "\n")]
      })]
  })

Here, there are lots of white space nodes of type Nokogiri::XML::Text.

I would like to count the children of each node in a Nokogiri XML Document, and access the first or last child, excluding non-significant white spaces. I wish not to parse them, or distinguish between those and significant text nodes such as those inside the element <to>, like "Tove". Here is an rspec of what I am looking for:

require 'nokogiri'
require_relative 'spec_helper'

xml_text = <<XML
<note>
  <to>Tove</to>
  <from>Jani</from>
  <heading>Reminder</heading>
  <body>Don't forget me this weekend!</body>
</note>
XML

xml = Nokogiri::XML(xml_text)

def significant_nodes(node)
  return 0
end

describe "Stackoverflow Question" do
  it "should return the number of significant nodes in nokogiri." do
    expect(significant_nodes(xml.css('note'))).to eq 4
  end
end

I want to know how to create the significant_nodes function.

If I change the XML to:

<note>
  <to>Tove</to>
  <from>Jani</from>
  <heading>Reminder</heading>
  <body>Don't forget me this weekend!</body>
  <footer></footer>
</note>

then when I create the Document, I still would like the footer represented; using config.noblanks is not an option.

Schaller answered 14/1, 2014 at 13:37 Comment(2)

Tove is placed inside the tag to, so you shell find the tag, then just get text: doc.css( 'to' ).text – Ebbie 14/1, 2014 at 13:47

amolnpujari.wordpress.com/2012/03/31/reading_huge_xml-rb I also found ox is 5 times faster than nokogiri while reading a large xml. Plus I have a wrapper written which simply allow you to search through large xml using ox, allows you to iterate with specified element. gist.github.com/amolpujari/5966431 – Bundle 11/3, 2014 at 10:33

You can use the NOBLANKS option for parsing the XML string, consider this example:

require 'nokogiri'

string = "<foo>\n  <bar>bar</bar>\n</foo>"
puts string
# <foo>
#   <bar>bar</bar>
# </foo>

document_with_blanks = Nokogiri::XML.parse(s)

document_without_blanks = Nokogiri::XML.parse(s) do |config|
  config.noblanks
end

document_with_blanks.root.children.each { |child| p child }
#<Nokogiri::XML::Text:0x3ffa4e153dac "\n  ">
#<Nokogiri::XML::Element:0x3fdce3f78488 name="bar" children=[#<Nokogiri::XML::Text:0x3fdce3f781f4 "bar">]>
#<Nokogiri::XML::Text:0x3ffa4e15335c "\n">

document_without_blanks.root.children.each { |child| p child }
#<Nokogiri::XML::Element:0x3f81bef42034 name="bar" children=[#<Nokogiri::XML::Text:0x3f81bef43ee8 "bar">]>

The NOBLANKS shouldn't remove empty nodes:

doc = Nokogiri.XML('<foo><bar></bar></foo>') do |config|
  config.noblanks
end

doc.root.children.each { |child| p child }
#<Nokogiri::XML::Element:0x3fad0fafbfa8 name="bar">

As OP pointed out the documentation on the Nokogiri website (and also on the libxml website) about the parser options is quite cryptic, following a specification of the behaviour ot the NOBLANKS option:

require 'rspec/autorun'
require 'nokogiri'

def parse_xml(xml_string)
  Nokogiri.XML(xml_string) { |config| config.noblanks }
end

describe "Nokogiri NOBLANKS parser option" do

  it "removes whitespace nodes if they have siblings" do
    doc = parse_xml("<root>\n <child></child></root>")
    expect(doc.root.children.size).to eq(1)
    expect(doc.root.children.first).to be_kind_of(Nokogiri::XML::Node)
  end

  it "doesn't remove whitespaces nodes if they have no siblings" do
    doc = parse_xml("<root>\n </root>")
    expect(doc.root.children.size).to eq(1)
    expect(doc.root.children.first).to be_kind_of(Nokogiri::XML::Text)
  end

  it "doesn't remove empty nodes" do
    doc = parse_xml('<root><child></child></root>')
    expect(doc.root.children.size).to eq(1)
    expect(doc.root.children.first).to be_kind_of(Nokogiri::XML::Node)
  end

end

Akkadian answered 14/1, 2014 at 14:3 Comment(7)

Brilliant! Thanks ever so much. EDIT: Not the right answer, reason given below. – Schaller 14/1, 2014 at 15:7

Sorry, was not the correct answer. The reason for that is that if I then add an empty tag, such as <empty></empty>, it would not get parsed and represented. Empty nodes need to be included unfortunately. – Schaller 14/1, 2014 at 15:11

@Schaller Actually the NOBLANKS options should keep empty nodes, can you post the code that strips the footer node in your example? – Akkadian 14/1, 2014 at 15:31

You are right. It does seem like it includes empty nodes. Although, if you follow this link I give, it does say that it removes empty nodes. nokogiri.org/Nokogiri/XML/ParseOptions.html – Schaller 14/1, 2014 at 16:5

Thanks, this is the answer I'm looking for. Thanks! Seems like I misread the manual. – Schaller 14/1, 2014 at 16:8

@Schaller It's not your fault, it's the manual that is quite cryptic about this, the documentation for libxml isn't much more clear. I'll try to extend the answer to explain how NOBLANKS works. – Akkadian 14/1, 2014 at 16:20

@CodingMo, I just discovered Nokogiri's noblanks config option, but now I'm finding that it doesn't ignore every Text node that consists of only whitespace--as I would like. So finding your post was timely. However, there's another twist--see my post(feel free to add it to your post or some modification thereof). Also, the 'strict' config option is the default, so most people are probably going to want config.strict.noblanks. – Cedillo 2/8, 2014 at 17:27

You can create a query that only returns element nodes, and ignores text nodes. In XPath, * only returns elements, so the query could look like (querying the whole doc):

doc.xpath('//note/*')

or if you want to use CSS:

doc.css('note > *')

If you want to implement your significant_nodes method, you would need to make the query relative to the node passed in:

def significant_nodes(node)
  node.xpath('./*').size
end

I don’t know how to do a relative query with CSS, you might need to stick with XPath.

Affra answered 14/1, 2014 at 15:26 Comment(4)

The trouble with doing .xpath('./*'), is that if you do it on an element with a text node that has significant text, those text nodes won't be represented. So if we take ` #(Element:0x3fc07e8d8064 { name = "from", children = [ #(Text "Jani")]})` and do the .xpath('./*') on it, it will not return the text node that has "Jani" in it. – Schaller 14/1, 2014 at 15:50

@Schaller well don’t use it on such nodes then :-) – Affra 14/1, 2014 at 16:4

that's a fair point and this is a great answer, I'll find it useful in the future! – Schaller 14/1, 2014 at 16:6

@Schaller you could use an XPath query like '//note/node()[self::* or self::text()[normalize-space()]]' to get elements and non-blank text nodes, although in this specific example that’s pretty much the same as using the noblanks option. – Affra 14/1, 2014 at 16:28

Nokogiri's noblanks config option doesn't remove all whitespace Text nodes when they have siblings:

describe "Nokogiri NOBLANKS parser option" do

  it "doesn't remove whitespace Text nodes if they're surrounded by non-whitespace Text node siblings" do
    doc = parse_xml("<root>1 <two></two> \n <three></three> \n <four></four> 5</root>")
    children = doc.root.children

    expect(children.size).to_not eq(5)
    expect(children.size).to eq(7)  #Because the two newline Text nodes are not ignored
    expect(doc.root.children.first).to be_kind_of(Nokogiri::XML::Node)
  end
end

I'm not sure why Nokogiri was programmed to work that way. I think it would be better to either ignore all whitespace Text nodes are don't ignore any Text nodes.

Cedillo answered 2/8, 2014 at 17:29 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags