Parsing large file with SaxMachine seems to be loading the whole file into memory
Asked Answered
B

4

5

I have a 1.6gb xml file, and when I parse it with Sax Machine it does not seem to be streaming or eating the file in chunks - rather it appears to be loading the whole file into memory (or maybe there is a memory leak somewhere?) because my ruby process climbs upwards of 2.5gb of ram. I don't know where it stops growing because I ran out of memory.

On a smaller file (50mb) it also appears to be loading the whole file. My task iterates over the records in the xml file and saves each record to a database. It takes about 30 seconds of "idling" and then all of a sudden the database queries start executing.

I thought SAX was supposed to allow you to work with large files like this without loading the whole thing in memory.

Is there something I am overlooking?

Many thanks

Update to add code sample

class FeedImporter

  class FeedListing
    include ::SAXMachine

    element :id
    element :title
    element :description
    element :url

    def to_hash
      {}.tap do |hash|
        self.class.column_names.each do |key|
          hash[key] = send(key)
        end
      end
    end
  end

  class Feed
    include ::SAXMachine
    elements :listing, :as => :listings, :class => FeedListing
  end

  def perform
    open('~/feeds/large_feed.xml') do |file|

      # I think that SAXMachine is trying to load All of the listing elements into this one ruby object.
      puts 'Parsing'
      feed = Feed.parse(file)

      # We are now iterating over each of the listing elements, but they have been "parsed" from the feed already.
      puts 'Importing'
      feed.listings.each do |listing|
        Listing.import(listing.to_hash)
      end

    end
  end

end

As you can see, I don't care about the <listings> element in the feed. I just want the attributes of each <listing> element.

The output looks like this:

Parsing
... wait forever
Importing (actually, I don't ever see this on the big file (1.6gb) because too much memory is used :(
Bobo answered 8/2, 2012 at 19:12 Comment(1)
Simple answer to your question: yes, there is something you are overlooking. Unfortunately you haven't told us what it is. No-one can find memory leaks in code they can't see.Coenobite
C
2

I forked sax-machine so that it uses constant memory: https://github.com/gregwebs/sax-machine

Good news: there is a new maintainer that is planning on merging my changes. Myself and the new maintainer have been using my fork without issue for a year now.

Campball answered 30/5, 2012 at 14:41 Comment(2)
This branch seems out of sync with the canonical repository and hasn't been touched in two years. It was also throwing errors about yielding from a root fiber...Wyon
I too get the "(FiberError) can't yield from root fiber" error, looks like this branch has been abandoned.Tarkington
F
4

Here's a Reader that will yield each listing's XML to a block, so you can process each Listing without loading the entire document into memory

reader = Nokogiri::XML::Reader(file)
while reader.read
  if reader.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT and reader.name == 'listing'
    listing = FeedListing.parse(reader.outer_xml)
    Listing.import(listing.to_hash)
  end
end

If listing elements could be nested, and you wanted to parse the outermost listings as single documents, you could do this:

require 'rubygems'
require 'nokogiri'


# Monkey-patch Nokogiri to make this easier
class Nokogiri::XML::Reader
  def element?
    node_type == TYPE_ELEMENT
  end

  def end_element?
    node_type == TYPE_END_ELEMENT
  end

  def opens?(name)
    element? && self.name == name
  end

  def closes?(name)
    (end_element? && self.name == name) || 
      (self_closing? && opens?(name))
  end

  def skip_until_close
    raise "node must be TYPE_ELEMENT" unless element?
    name_to_close = self.name

    if self_closing?
      # DONE!
    else
      level = 1
      while read
        level += 1 if opens?(name_to_close)
        level -= 1 if closes?(name_to_close)

        return if level == 0
      end
    end
  end

  def each_outer_xml(name, &block)
    while read
      if opens?(name)
        yield(outer_xml)
        skip_until_close
      end
    end
  end

end

once you have it monkey-patched, it's easy to deal with each listing individually:

open('~/feeds/large_feed.xml') do |file|
  reader = Nokogiri::XML::Reader(file)
  reader.each_outer_xml('listing') do |outer_xml|

    listing = FeedListing.parse(outer_xml)
    Listing.import(listing.to_hash)

  end
end
Finance answered 10/2, 2012 at 6:40 Comment(2)
Awesome, that works super well. It seems pretty fast, too, as my db on my local machine becomes the bottleneck for importing. Thanks, John!Bobo
I was able to parse my large xml doc using this approach along with the canonical sax-machine gem. Thanks!Wyon
L
3

Unfortunately there are now three different repos for sax-machine. And worse, the gemspec version was not bumped.

Despite the comment on Greg Weber's blog, I don't think this code was integrated into pauldix's or ezkl's forks. To use the lazy, fiber-based version of the code, I think you need to specifically reference gregweb's version in your gemfile like this:

gem 'sax-machine', :git => 'https://github.com/gregwebs/sax-machine'
Lysias answered 7/11, 2012 at 16:37 Comment(1)
Looks like you're correct. The Github network graph (github.com/gregwebs/sax-machine/network ) shows that Greg's changes haven't been merged into the canonical SAXMachine repo (maintained by pauldix)Tomikotomkiel
C
2

I forked sax-machine so that it uses constant memory: https://github.com/gregwebs/sax-machine

Good news: there is a new maintainer that is planning on merging my changes. Myself and the new maintainer have been using my fork without issue for a year now.

Campball answered 30/5, 2012 at 14:41 Comment(2)
This branch seems out of sync with the canonical repository and hasn't been touched in two years. It was also throwing errors about yielding from a root fiber...Wyon
I too get the "(FiberError) can't yield from root fiber" error, looks like this branch has been abandoned.Tarkington
P
0

You are right, SAXMachine reads the whole document eagerly. Have a look at it's handler sources: https://github.com/pauldix/sax-machine/blob/master/lib/sax-machine/sax_handler.rb

To solve your Problem, I would use http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/SAX/Parser.html directly and implement the handler yourself.

Princeling answered 9/2, 2012 at 8:17 Comment(1)
thanks for confirming my suspicion. Its a shame sax machine doesn't do lazy evaluation or provide a true callback mechanism - that would be splendid.Bobo

© 2022 - 2024 — McMap. All rights reserved.