Read XML file while it is being written (in Python)
Asked Answered
H

2

1

I have to monitor an XML file being written by a tool running all the day. But the XML file is properly completed and closed only at the end of the day.

Same constraints as XML stream processing:

  1. Parse an incomplete XML file on-the-fly and trigger actions
  2. Keep track of the last position within the file to avoid processing it again from the beginning

On answer of Need to read XML files as a stream using BeautifulSoup in Python, slezica suggests xml.sax, xml.etree.ElementTree and cElementTree. But no success with my attempts to use xml.etree.ElementTree and cElementTree. There are also xml.dom, xml.parsers.expat and lxml but I do not see support for "on-the-fly parsing".

I need more obvious examples...

I am currently using Python 2.7 on Linux, but I will migrate to Python 3.x => please also provide tips on new Python 3.x features. I also use watchdog to detect XML file modifications => Optionally, reuse the watchdog mechanism. Optionally support also Windows.

Please provide easy to understand/maintain solutions. If it is too complex, I may just use tell()/seek() to move within the file, use stupid text search in the raw XML and finally extract the values using basic regex.


XML sample:

<dfxml xmloutputversion='1.0'>
   <creator version='1.0'>
     <program>TCPFLOW</program>
     <version>1.4.6</version>
   </creator>
   <configuration>
     <fileobject>
       <filename>file1</filename>
       <filesize>288</filesize>
       <tcpflow packets='12' srcport='1111' dstport='2222' family='2' />
     </fileobject>
     <fileobject>
       <filename>file2</filename>
       <filesize>352</filesize>
       <tcpflow packets='12' srcport='3333' dstport='4444' family='2' />
     </fileobject>
     <fileobject>
       <filename>file3</filename>
       <filesize>456</filesize>
       ...
       ...

First test using SAX failed:

import xml.sax

class StreamHandler(xml.sax.handler.ContentHandler):
    def startElement(self, name, attrs):
        print 'start: name=', name
    def endElement(self, name):
        print 'end:   name=', name
        if name == 'root':
            raise StopIteration

if __name__ == '__main__':
    parser = xml.sax.make_parser()
    parser.setContentHandler(StreamHandler())
    with open('f.xml') as f:
        parser.parse(f)

Shell:

$ while read line; do echo $line; sleep 1; done <i.xml >f.xml &
...
$ ./test-using-sax.py
start: name= dfxml
start: name= creator
start: name= program
end:   name= program
start: name= version
end:   name= version
Traceback (most recent call last):
  File "./test-using-sax.py", line 17, in <module>
    parser.parse(f)
  File "/usr/lib64/python2.7/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/usr/lib64/python2.7/xml/sax/xmlreader.py", line 125, in parse
    self.close()
  File "/usr/lib64/python2.7/xml/sax/expatreader.py", line 220, in close
    self.feed("", isFinal = 1)
  File "/usr/lib64/python2.7/xml/sax/expatreader.py", line 214, in feed
    self._err_handler.fatalError(exc)
  File "/usr/lib64/python2.7/xml/sax/handler.py", line 38, in fatalError
    raise exception
xml.sax._exceptions.SAXParseException: report.xml:15:0: no element found
Hesperidin answered 6/6, 2017 at 16:16 Comment(0)
H
1

Three hours after posting my question, no answer received. But I have finally implemented the simple example I was looking for.

My inspiration is from saaj's answer and is based on xml.sax and watchdog.

from __future__ import print_function, division
import time
import watchdog.events
import watchdog.observers
import xml.sax

class XmlStreamHandler(xml.sax.handler.ContentHandler):
  def startElement(self, tag, attributes):
    print(tag, 'attributes=', attributes.items())
    self.tag = tag
  def characters(self, content):
    print(self.tag, 'content=', content)

class XmlFileEventHandler(watchdog.events.PatternMatchingEventHandler):
  def __init__(self):
    watchdog.events.PatternMatchingEventHandler.__init__(self, patterns=['*.xml'])
    self.file = None
    self.parser = xml.sax.make_parser()
    self.parser.setContentHandler(XmlStreamHandler())
  def on_modified(self, event):
    if not self.file:
      self.file = open(event.src_path)
    self.parser.feed(self.file.read())

if __name__ == '__main__':
  observer = watchdog.observers.Observer()
  event_handler = XmlFileEventHandler()
  observer.schedule(event_handler, path='.')
  try:
    observer.start()
    while True:
      time.sleep(10)
  finally:
    observer.stop()
    observer.join()

While the script is running, do not forget to touch one XML file, or simulate the on-the-fly writing using the following command:

while read line; do echo $line; sleep 1; done <in.xml >out.xml &
Hesperidin answered 6/6, 2017 at 19:47 Comment(0)
H
1

Since yesterday I found the Peter Gibson's answer about the undocumented xml.etree.ElementTree.XMLTreeBuilder._parser.EndElementHandler.

This example is similar to the other one but uses xml.etree.ElementTree (and watchdog).

It does not work when ElementTree is replaced by cElementTree :-/

import time
import watchdog.events
import watchdog.observers
import xml.etree.ElementTree

class XmlFileEventHandler(watchdog.events.PatternMatchingEventHandler):
    def __init__(self):
        watchdog.events.PatternMatchingEventHandler.__init__(self, patterns=['*.xml'])
        self.xml_file = None
        self.parser = xml.etree.ElementTree.XMLTreeBuilder()
        def end_tag_event(tag):
            node = self.parser._end(tag)
            print 'tag=', tag, 'node=', node
        self.parser._parser.EndElementHandler = end_tag_event

    def on_modified(self, event):
        if not self.xml_file:
            self.xml_file = open(event.src_path)
        buffer = self.xml_file.read()
        if buffer:
            self.parser.feed(buffer)

if __name__ == '__main__':
    observer = watchdog.observers.Observer()
    event_handler = XmlFileEventHandler()
    observer.schedule(event_handler, path='.')
    try:
        observer.start()
        while True:
            time.sleep(10)
    finally:
        observer.stop()
        observer.join()

While the script is running, do not forget to touch one XML file, or simulate the on-the-fly writing using this one line script:

while read line; do echo $line; sleep 1; done <in.xml >out.xml &

For information, the xml.etree.ElementTree.iterparse does not seem to support a file being written. My test code:

from __future__ import print_function, division
import xml.etree.ElementTree

if __name__ == '__main__':
    context = xml.etree.ElementTree.iterparse('f.xml', events=('end',))
    for action, elem in context:
        print(action, elem.tag)

My output:

end program
end version
end creator
end filename
end filesize
end tcpflow
end fileobject
end filename
end filesize
end tcpflow
end fileobject
end filename
end filesize
Traceback (most recent call last):
  File "./iter.py", line 9, in <module>
    for action, elem in context:
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1281, in next
    self._root = self._parser.close()
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1654, in close
    self._raiseerror(v)
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: no element found: line 20, column 0
Hesperidin answered 7/6, 2017 at 13:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.