How to use xml sax parser to read and write a large xml?
Asked Answered
G

1

10

I'm trying to remove all the project1 nodes (along with their child elements) from the below sample xml document (original document is about 30 GB) using SAX parser.It would be fine to have a separate modified file or ok with the in-line edit.

sample.xml

<ROOT>
    <test src="http://dfs.com">Hi</test>
    <project1>This is old data<foo></foo></project1>
    <bar>
        <project1>ty</project1>
        <foo></foo>
    </bar>
</ROOT>

Here is my attempt..

parser.py

from xml.sax.handler import ContentHandler
import xml.sax

class MyHandler(xml.sax.handler.ContentHandler):
    def __init__(self, out_file):
        self._charBuffer = []
        self._result = []
        self._out = open(out_file, 'w')

    def _createElement(self, name, attrs):
        attributes = attrs.items()
        if attributes:
            out = ''
            for key, value in attributes:
                out += ' {}={}'.format(key, value)
            return '<{}{}>'.format(name, out)
        return '<{}>'.format(name)


    def _getCharacterData(self):
        data = ''.join(self._charBuffer).strip()
        self._charBuffer = []
        self._out.write(data.strip()) #remove strip() if whitespace is important

    def parse(self, f):
        xml.sax.parse(f, self)

    def characters(self, data):
        self._charBuffer.append(data)

    def startElement(self, name, attrs):
        if not name == 'project1': 
            self._result.append({})
            self._out.write(self._createElement(name, attrs))

    def endElement(self, name):
        if not name == 'project1': self._result[-1][name] = self._getCharacterData()

MyHandler('out.xml').parse("sample.xml")

I can't make it to work.

Gamache answered 19/2, 2017 at 8:43 Comment(10)
What's a problem to process data as text? Simply: check flag, is it down, grab line, is it project1, raise flag, write/append or not, repeat... Just an outline of strategyUnexperienced
But this approach will results in loading the whole file into memory.Gamache
I mean: read line - process line - update state - decide write or not. Don't work with whole file at once. There is no need.Unexperienced
u can even use buffer to reduce write count. For example, flush buffer only every 1000 lines. Measure it by yourself if it's important.Unexperienced
What u r doing now is over complicated. SAX parsing good for some situations, but it's just an abstraction over simply reading xml file line by line and dealing with events (startElement, endElement). Every time the bunch of objects would be created, and then u should grab data and produce new bunch of objects just to write this data to file.Unexperienced
This was the first task given, there are many tasks following up which deals with modifying xml like modifying the attributes of a specific element, etc. So that I thought it would be better if I get a sax based answer.Gamache
elementtree.iterparse is easier to use, and allows good control over the objects created by the parser.Slivovitz
@Slivovitz I saw solutions using iterparse which does only the parsing job but I don't find any regarding parsing and writing seriallyGamache
@ar7max: The problem with processing XML as text is well know -- it leads to brittle solutions that break in a myriad ways when perfectly reasonable variations in the XML occur. Please do not make such recommendations. Thanks.Baudekin
Yesterday I filtered XML using simple text processing - nothing broke. Wanna know why? 1) It's text file 2) I know how tags works 3) Little magic 4) Do you know how parsers works? They r reading TEXT. Now he needs to remove redundant elements from XML. Wanna use SAX(or similar)? Use SAX(or similar) (dont even know why my SAX solution recieved -1 from u, mb its a kind of joke). Do u need to use SAX (or similar)? No. Wanna know why? goto 1Unexperienced
N
6

You could use a xml.sax.saxutils.XMLFilterBase implementation to filter out your project1 nodes.

Instead of assembling the xml strings yourself you could use xml.sax.saxutils.XMLGenerator.

The following is Python3 code, adjust super if you require Python2.

from xml.sax import make_parser
from xml.sax.saxutils import XMLFilterBase, XMLGenerator


class Project1Filter(XMLFilterBase):
    """This decides which SAX events to forward to the ContentHandler

    We will not forward events when we are inside any elements with a
    name specified in the 'tags_names_to_exclude' parameter
    """

    def __init__(self, tag_names_to_exclude, parent=None):
        super().__init__(parent)

        # set of tag names to exclude
        self._tag_names_to_exclude = tag_names_to_exclude

        # _project_1_count keeps track of opened project1 elements
        self._project_1_count = 0

    def _forward_events(self):
        # will return True when we are not inside a project1 element
        return self._project_1_count == 0

    def startElement(self, name, attrs):
        if name in self._tag_names_to_exclude:
            self._project_1_count += 1

        if self._forward_events():
            super().startElement(name, attrs)

    def endElement(self, name):
        if self._forward_events():
            super().endElement(name)

        if name in self._tag_names_to_exclude:
            self._project_1_count -= 1

    def characters(self, content):
        if self._forward_events():
            super().characters(content)

    # override other content handler methods on XMLFilterBase as neccessary


def main():
    tag_names_to_exclude = {'project1', 'project2', 'project3'}
    reader = Project1Filter(tag_names_to_exclude, make_parser())

    with open('out-small.xml', 'w') as f:
        handler = XMLGenerator(f)
        reader.setContentHandler(handler)
        reader.parse('input.xml')


if __name__ == "__main__":
    main()
Nahum answered 23/2, 2017 at 9:19 Comment(5)
Nice, even with empty lines. Want to check time cost.Unexperienced
26 seconds slower on ~700mb file.Unexperienced
Hi @Jeremy.. Your solution works for me.. May I know how I do the same for list of nodes, say project1, project2, project3?Gamache
if name in ['project1','project2','project3']: self._project_1_count += 1 same for endElement methodUnexperienced
@AvinashRaj I have updated the code to exclude a set of tag namesNahum

© 2022 - 2024 — McMap. All rights reserved.