Cleaning XML document recursively from empty tags with Nokogiri?
Asked Answered
S

4

1

I have a nested XML document that looks like this:

<?xml version="1.0"?>
<phone>
  <name>test</name>
  <descr>description</descr>
  <empty/>
  <lines>
    <line>12345</line>
    <css/>
  </lines>
</phone>

I need to remove all empty XML nodes, like <empty/> and <css/>.

I ended up with something like:

doc = Nokogiri::XML::DocumentFragment.parse <<-EOXML
<phone>
  <name>test</name>
  <descr>description</descr>
  <empty/>
  <lines>
    <line>12345</line>
    <css/>
  </lines>
</phone>
EOXML

phone = doc.css("phone")
phone.children.each do | child |
    child.remove if child.inner_text == ''
end

The above code removes only the first empty tag, e.g. <empty/>. I'm not able to go inside the nested block. I think I need some recursive strategy here. I carefully read the Nokogiri documentation and checked a lot of examples but I didn't find a solution yet.

How can I fix this?

I'm using Ruby 1.9.3 and Nokogiri 1.5.10.

Strigose answered 21/11, 2013 at 14:10 Comment(0)
R
3

A latecomer with a different approach, hoping to add additional insight. This approach removes the annoying extra new lines and gives you the option to keep the empty fields that have attributes with values set.

require 'nokogiri'

doc = Nokogiri::XML::Document.parse <<-EOXML
<phone>
  <name>test</name>
  <descr>description</descr>
  <empty/>
  <lines>
    <line>12345</line>
    <css/>
  </lines>
</phone>
EOXML

def traverse_and_clean(kid)
  kid.children.map { |child| traverse_and_clean(child) }
  kid.remove if kid.content.blank?
end

traverse_and_clean(doc)

Output

<?xml version="1.0"?>
<phone>
  <name>test</name>
  <descr>description</descr>
  <lines>
    <line>12345</line>
  </lines>
</phone>

If you find yourself in a peculiar case needing to keep some empty fields that have certain attributes set. All you have to do is slightly change the traverse_and_clean method:

def traverse_and_clean(kid)
  kid.children.map { |child| traverse_and_clean(child) }
  kid.remove if kid.content.blank? && kid.attributes.blank?
end
Rezzani answered 18/2, 2015 at 3:20 Comment(1)
this is the only recursive solution, thanks! (it cleans nodes that got empty after deleting its empty childs)Caveator
T
2

You should be able find all nodes without any text using the xpath "/phone//*[not(text())]".

require 'nokogiri'

doc = Nokogiri::XML::Document.parse <<-EOXML
<phone>
  <name>test</name>
  <descr>description</descr>
  <empty/>
  <lines>
    <line>12345</line>
    <css/>
  </lines>
</phone>
EOXML

doc.xpath("/phone//*[not(text())]").remove

puts doc.to_s.gsub(/\n\s*\n/, "\n")
#=> <?xml version="1.0"?>
#=> <phone>
#=>   <name>test</name>
#=>   <descr>description</descr>
#=>   <lines>
#=>     <line>12345</line>
#=>   </lines>
#=> </phone>
Trimetrogon answered 21/11, 2013 at 14:42 Comment(2)
+1..Okay!! Nokogiri::XML::Nodeset also has the remove method, I was not aware of that!Governess
Something to note, this seems to alter the nodeset in place. I hadn't counted on it being invasive to my xml initially. Works great though! :) The return of the remove call is the tags that were pulled out... FWIWHoneyhoneybee
G
1
require 'nokogiri'

doc = Nokogiri::XML::Document.parse <<-EOXML
<phone>
  <name>test</name>
  <descr>description</descr>
  <empty/>
  <lines>
    <line>12345</line>
    <css/>
  </lines>
</phone>
EOXML

nodes = doc.xpath("//phone//*[not(text())]")

nodes.each{|n| n.remove if n.elem? }

puts doc

output

<?xml version="1.0"?>
<phone>
  <name>test</name>
  <descr>description</descr>

  <lines>
    <line>12345</line>

  </lines>
</phone>
Governess answered 21/11, 2013 at 14:25 Comment(2)
I don't know how to remove those blank lines.. If anybody do know the trick,please let me know! :(Governess
do you know why your solution doesn't work if the xml fragment is put entirely on one line? like this: code <phone><name>test</name><descr>description</descr><empty/><lines><line>12345</line><css/></lines></phone>Strigose
C
1

Similar to @JustinKo's answer only using CSS selectors:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<?xml version="1.0"?>
<phone>
  <name>test</name>
  <descr>description</descr>
  <empty/>
  <lines>
    <line>12345</line>
    <css/>
  </lines>
</phone>
EOT

doc.search(':empty').remove
puts doc.to_xml

Looking at what it did:

<?xml version="1.0"?>
<phone>
  <name>test</name>
  <descr>description</descr>

  <lines>
    <line>12345</line>

  </lines>
</phone>

Nokogiri implements a lot of jQuery's selectors, so it's always worth looking to see what those extensions can do.

Culverin answered 27/11, 2013 at 3:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.