Inserting and deleting XML nodes and elements using Nokogiri
Asked Answered
K

4

17

I want to extract parts of an XML file and make a note that I extracted some part in that file, like "here something was extracted".

I'm trying to do this with Nokogiri, but it seems to not really be documented on how to:

  1. delete all childs of a <Nokogiri::XML::Element>
  2. change the inner_text of that complete element

Any clues?

Kentiga answered 13/8, 2009 at 21:44 Comment(2)
Nokogiri's Tutorials for Modifying an HTML / XML Document cover this. Also node.unlink will remove it from a DOM.Oospore
See "How to Ask". As is, this is lacking important information such as a minimal XML example for input and the expected output, plus the code that was written toward solving the problem.Oospore
C
17

Nokogiri makes this pretty easy. Using this document as an example, the following code will find all vitamins tags, remove their children (and the children's children, etc.), and change their inner text to say "Children removed.":

require 'nokogiri'

io = File.open('sample.xml', 'r')
doc = Nokogiri::XML(io)
io.close

doc.search('//vitamins').each do |node|
  node.children.remove
  node.content = 'Children removed.'
end

A given food node will go from looking like this:

<food>
    <name>Avocado Dip</name>
    <mfr>Sunnydale</mfr>
    <serving units="g">29</serving>
    <calories total="110" fat="100"/>
    <total-fat>11</total-fat>
    <saturated-fat>3</saturated-fat>
    <cholesterol>5</cholesterol>
    <sodium>210</sodium>
    <carb>2</carb>
    <fiber>0</fiber>
    <protein>1</protein>
    <vitamins>
        <a>0</a>
        <c>0</c>
    </vitamins>
    <minerals>
        <ca>0</ca>
        <fe>0</fe>
    </minerals>
</food>

to this:

<food>
    <name>Avocado Dip</name>
    <mfr>Sunnydale</mfr>
    <serving units="g">29</serving>
    <calories total="110" fat="100"/>
    <total-fat>11</total-fat>
    <saturated-fat>3</saturated-fat>
    <cholesterol>5</cholesterol>
    <sodium>210</sodium>
    <carb>2</carb>
    <fiber>0</fiber>
    <protein>1</protein>
    <vitamins>Children removed.</vitamins>
    <minerals>
        <ca>0</ca>
        <fe>0</fe>
    </minerals>
</food>
Carbazole answered 14/8, 2009 at 13:57 Comment(0)
K
3

You can do it like this:

doc=Nokogiri::XML(your_document)
note=doc.search("note") # find all tags with the node_name "note"
note.remove

While that would remove all children within the <note> tag, I am not sure how to "change the inner_text" of all note elements. I think inner_text is not applicable for a Nokogiri::XML::Element.

Kryska answered 14/8, 2009 at 10:40 Comment(0)
H
3

The previous Nokogiri example set me in the right direction, but using doc.search left a malformed //vitamins, so I used CSS:

require "rubygems"
require "nokogiri"

f = File.open("food.xml")
doc = Nokogiri::XML(f)

doc.css("food vitamins").each do |node|
  puts "\r\n[debug] Before: vitamins= \r\n#{node}"
  node.children.remove
  node.content = "Children removed"
  puts "\r\n[debug] After: vitamins=\r\n#{node}"
end
f.close

Which results in:

debug] Before: vitamins= 
<vitamins>
        <a>0</a>
        <c>0</c>
    </vitamins>

[debug] After: vitamins=
<vitamins>Children removed</vitamins>
Hemicellulose answered 20/1, 2010 at 10:7 Comment(0)
O
2

Here's what I'd do:

Parse some XML first:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href="nutrition.css"?>
<nutrition>

  <daily-values>
    <total-fat units="g">65</total-fat>
    <saturated-fat units="g">20</saturated-fat>
    <cholesterol units="mg">300</cholesterol>
    <sodium units="mg">2400</sodium>
    <carb units="g">300</carb>
    <fiber units="g">25</fiber>
    <protein units="g">50</protein>
  </daily-values>

  <food>
    <name>Avocado Dip</name>
    <mfr>Sunnydale</mfr>
    <serving units="g">29</serving>
    <calories total="110" fat="100"/>
    <total-fat>11</total-fat>
    <saturated-fat>3</saturated-fat>
    <cholesterol>5</cholesterol>
    <sodium>210</sodium>
    <carb>2</carb>
    <fiber>0</fiber>
    <protein>1</protein>
    <vitamins>
      <a>0</a>
      <c>0</c>
    </vitamins>
    <minerals>
      <ca>0</ca>
      <fe>0</fe>
    </minerals>
  </food>

</nutrition>
EOT

If I want to delete a node's content, I can remove its children or assign nil to its content:

doc.at('total-fat').to_xml # => "<total-fat units=\"g\">65</total-fat>"
doc.at('total-fat').children.remove
doc.at('total-fat').to_xml # => "<total-fat units=\"g\"/>"

or:

doc.at('saturated-fat').to_xml # => "<saturated-fat units=\"g\">20</saturated-fat>"
doc.at('saturated-fat').content = nil
doc.at('saturated-fat').to_xml # => "<saturated-fat units=\"g\"/>"

If I want to extract the text from a node for use some other way:

food = doc.at('food').text
# => "\n    Avocado Dip\n    Sunnydale\n    29\n    \n    11\n    3\n    5\n    210\n    2\n    0\n    1\n    \n      0\n      0\n    \n    \n      0\n      0\n    \n  "

or:

food = doc.at('food').children.map(&:text)
# => ["\n    ",
#     "Avocado Dip",
#     "\n    ",
#     "Sunnydale",
#     "\n    ",
#     "29",
#     "\n    ",
#     "",
#     "\n    ",
#     "11",
#     "\n    ",
#     "3",
#     "\n    ",
#     "5",
#     "\n    ",
#     "210",
#     "\n    ",
#     "2",
#     "\n    ",
#     "0",
#     "\n    ",
#     "1",
#     "\n    ",
#     "\n      0\n      0\n    ",
#     "\n    ",
#     "\n      0\n      0\n    ",
#     "\n  "]

or however else you want to mangle the text.

And, if you want to mark that you've removed the text:

doc.at('food').content = 'REMOVED'
doc.at('food').to_xml # => "<food>REMOVED</food>"

You could also use an XML comment instead:

doc.at('food').children = '<!-- REMOVED -->'
doc.at('food').to_xml # => "<food>\n  <!-- REMOVED -->\n</food>"
Oospore answered 8/10, 2015 at 0:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.