How can I use Nokogiri to write a HUGE XML file?
Asked Answered
S

2

9

I have a Rails application that uses delayed_job in a reporting feature to run some very large reports. One of these generates a massive XML file and it can take literally days in the bad, old way the code is written. I thought that, having seen impressive benchmarks on the internet, Nokogiri could afford us some nontrivial performance gains.

However, the only examples I can find involve using the Nokogiri Builder to create an xml object, then using .to_xml to write the whole thing. But there isn't enough memory in my zip code to handle that for a file of this size.

So can I use Nokogiri to stream or write this data out to file?

Shaw answered 9/2, 2011 at 1:52 Comment(2)
usually string concatnation is sufficient for many xml writing tasks, avoid building a tree whenever you can...Bust
The string concatenation was taking forever. Regular old builder is already showing an improvement. File can be over a gigabyte.Shaw
R
5

Nokogiri is designed to build in memory because you build a DOM and it converts it to XML on the fly. It's easy to use, but there are trade-offs, and doing it in memory is one of them.

You might want to look into using Erubis to generate the XML. Rather than gather all the data before processing and keeping the logic in a controller, like we'd do with Rails, to save memory you can put your logic in the template and have it iterate over your data, which should help with the resource demands.

If you need the XML in a file you might need to do that using redirection:

erubis options templatefile.erb > xmlfile

This is a very simple example, but it shows you could easily define a template to generate XML:

<% 
asdf = (1..5).to_a 
%>
<xml>
  <element>
<% asdf.each do |i| %>
    <subelement><%= i %></subelement>
<% end %>
  </element>
</xml>

which, when I call erubis test.erb outputs:

<xml>
  <element>
    <subelement>1</subelement>
    <subelement>2</subelement>
    <subelement>3</subelement>
    <subelement>4</subelement>
    <subelement>5</subelement>
  </element>
</xml>

EDIT:

The string concatenation was taking forever...

Yes, it can simply because of garbage collection. You don't show any code example of how you're building your strings, but Ruby works better when you use << to append one string to another than when using +.

It also might work better to not try to keep everything in a string, but instead to write it immediately to disk, appending to an open file as you go.

Again, without code examples I'm shooting in the dark about what you might be doing or why things run slow.

Racket answered 9/2, 2011 at 2:50 Comment(2)
Your dark shooting is accurate, the Force is strong with this one. This code is actually for a ruby report that resides in lib, and I found that Builder, which does write to an io object on the fly, has mightily improved performance. Thanks!Shaw
Ah. Glad it helped. Just don't run out of disk or it will run infinitely slow.Racket
T
2

You don't need to build the whole XML document in memory with Nokogiri; just use Nokogiri to build whatever subtree of the document makes sense, and use Element#write_to to write one element at a time.

Here's an example that can write as long a document as you're willing to wait for:

#!/usr/bin/env ruby

require 'nokogiri'

if (count = ARGV[0].to_i) < 1
  $stderr.puts("Usage: #{File.basename(__FILE__)} <count>")
  exit 1
end

def build_child(index)
  builder = Nokogiri::XML::Builder.new do |xml|
    xml.child_element(index: index) do |child|
      child.text("This is child #{index}")
    end
  end
  builder.doc.root
end

nokogiri_options = { encoding: 'UTF-8' }

puts '<?xml version="1.0" encoding="UTF-8"?>'
puts '<root_element>'

(0...count).each do |index|
  child_element = build_child(index)
  child_element.write_to($stdout, nokogiri_options)
  puts
end

puts '</root_element>'

If you want to be extra-fancy (or support more complex Nokogiri options) you can even use Nokogiri to generate the XML declaration and root element by writing an empty root document to a StringIO:

def build_root_doc
  builder = Nokogiri::XML::Builder.new do |xml|
    xml.root_element do |root|
      root.text("\n") # ensure separate opening/closing tags
    end
  end
  builder.doc
end

root_xml = StringIO.open do |tmp|
  build_root_doc.write_to(tmp, nokogiri_options)
  tmp.string
end
# <?xml version="1.0" encoding="UTF-8"?>
# <root_element>
# </root_element>

# split at start of closing tag
header, footer = %r{([^/]+)(</.*)}.match(root_xml)[1..2]

puts header

(0...count).each do |index|
  child_element = build_child(index)
  child_element.write_to($stdout, nokogiri_options)
  puts
end

puts footer

Output:

$ ./big-nokogiri.rb 2
<?xml version="1.0" encoding="UTF-8"?>
<root_element>
<child_element index="0">This is child 0</child_element>
<child_element index="1">This is child 1</child_element>
</root_element>

$ ./big-nokogiri.rb 1000000 | tail -f
<child_element index="999996">This is child 999996</child_element>
<child_element index="999997">This is child 999997</child_element>
<child_element index="999998">This is child 999998</child_element>
<child_element index="999999">This is child 999999</child_element>
</root_element>
Tintometer answered 7/4, 2022 at 17:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.