parsing large xml 500M with node.js
Asked Answered
S

1

5

I am using isaacs' SAX to parse a huge xml file. Also recommended by La Gentz.

The process uses about 650M of memory, how can I reduce this or allow node to use even more.

FATAL ERROR: CALL_AND_RETRY_0 Allocation failed - process out of memory

My XML file is larger than 300M it could grow to 1GB.

Stagy answered 3/1, 2012 at 2:18 Comment(4)
Sounds like you need to allow for more memory usage in your php.ini file. Look for memory_limit in your .ini file and change the value to allow for the largest file you expect to handle.Cottony
Are you trying to save the XML file as a JSON file? Or do you need the entire to keep the entire Javascript object in memory?Olivier
I am using node.js - I could not locate a parameter to raise memory limit. (man node, lot's of options)Stagy
@DeaDEnD I don't neet the entire tree. Getting the error while parsing: parser.write(file_buf.toString('utf8'), lenght).close();Stagy
O
12

You should stream the file into the parser, that's the whole point of a streaming parser after all.

var parser = require('sax').createStream(strict, options);
fs.createReadStream(file).pipe(parser);
Olivier answered 3/1, 2012 at 3:23 Comment(10)
This is the way to do it if you don't want/need the whole document in memory. Node is actually not a great solution because it is single-threaded. So, while parsing this enormous document, the process won't be able to do anything else, such as response to HTTP requests.Berkow
@danmactough, what would you recommend to use? For now this solution is great and works for me. I go through that document and let my worker do the one time job, which is not critical.Stagy
@DeaDEnD, thanks. Do you or others know how to emit a end signal on that parser, so parser stops and parser.onend would be called, while parsing?Stagy
You could try calling readstream.destroy() to stop the stream from reading the file and parser.end() to signal the parser that the stream has ended.Olivier
@Stagy If you don't care about the blocking, then node is fine. You'll get better performance out of one of the c++ based parsers -- sax.js, while awesome, is pure javascript. Not sure if other xml parsers provide a Stream interface though.Berkow
@DeaDEnD I had tried parser.end, but getting an error. What I am now doing is if (index > max) { parser.emit('end'); return } which stops after a few more iterations. Thank your for your help.Stagy
@danmactough, I thought you would come up with c++, but that is out of question :). Team and me are not so good with c++. If it get's critical I'll consider c++ or other languages.Stagy
@Stagy What I meant was continue using node, but instead of sax.js, use libxmljs or another c++ parser. They all allow you to continue coding in javascript without knowing a lick about c++. :)Berkow
@Stagy & @DeaDEnd You may want to look at .destroySoon(). nodejs.org/docs/latest/api/streams.html#stream.destroySoonBerkow
Forgive the shameless plug... I have built a simpler parser on top of sax that makes things super easy: github.com/matthewmatician/xml-flow.Twobit

© 2022 - 2024 — McMap. All rights reserved.