Python XML Parsing without root
Asked Answered
D

3

11

I wanted to parse a fairly huge xml-like file which doesn't have any root element. The format of the file is:

<tag1>
<tag2>
</tag2>
</tag1>

<tag1>
<tag3/>
</tag1>

What I tried:

  1. tried using ElementTree but it returned a "no root" error. (Is there any other python library which can be used for parsing this file?)
  2. tried adding an extra tag to wrap the entire file and then parse it using Element-Tree. However, I would like to use some more efficient method, in which I would not need to alter the original xml file.
Dominions answered 27/5, 2014 at 13:50 Comment(3)
How large is the file?Prostitution
It contains over 3 million useful terms (apart from the tags and other unnecessary data)Dominions
Approximate file size? Are you looking for time efficiency or memory efficiency? Can the whole file be read into memory?Prostitution
P
10

lxml.html can parse fragments:

from lxml import html
s = """<tag1>
 <tag2>
 </tag2>
</tag1>

<tag1>
 <tag3/>
</tag1>"""
doc = html.fromstring(s)
for thing in doc:
    print thing
    for other in thing:
        print other
"""
>>> 
<Element tag1 at 0x3411a80>
<Element tag2 at 0x3428990>
<Element tag1 at 0x3428930>
<Element tag3 at 0x3411a80>
>>>
"""

Courtesy this SO answer

And if there is more than one level of nesting:

def flatten(nested):
    """recusively flatten nested elements

    yields individual elements
    """
    for thing in nested:
        yield thing
        for other in flatten(thing):
            yield other
doc = html.fromstring(s)
for thing in flatten(doc):
    print thing

Similarly, lxml.etree.HTML will parse this. It adds html and body tags:

d = etree.HTML(s)
for thing in d.iter():
    print thing

""" 
<Element html at 0x3233198>
<Element body at 0x322fcb0>
<Element tag1 at 0x3233260>
<Element tag2 at 0x32332b0>
<Element tag1 at 0x322fcb0>
<Element tag3 at 0x3233148>
"""
Prostitution answered 27/5, 2014 at 14:14 Comment(0)
M
10

ElementTree.fromstringlist accepts an iterable (that yields strings).

Using it with itertools.chain:

import itertools
import xml.etree.ElementTree as ET
# import xml.etree.cElementTree as ET

with open('xml-like-file.xml') as f:
    it = itertools.chain('<root>', f, '</root>')
    root = ET.fromstringlist(it)

# Do something with `root`
root.find('.//tag3')
Meeks answered 27/5, 2014 at 14:16 Comment(4)
I guess this would not be an efficient method in case of large files. Also as I said earlier, I would like to implement this using a different method rather than adding tags to the input.Dominions
@sgp, I should write f instead of f.read(). updated.; At least this does not read the whole content at once.Meeks
Don't you think that perhaps using a different library would be better? I mean again, you are just adding extra tags to the xml, right? Could you explain me why do you think your method is efficient? Thanks :)Dominions
@sgp, Because this does not load the whole contents at once as I said in the previous comment. I didn't benchmark the solutions; I can't tell that which one perform better. (BTW, try cElementTree instead of ElementTree)Meeks
A
8

How about instead of editing the file do something like this

import xml.etree.ElementTree as ET

with file("xml-file.xml") as f:
    xml_object = ET.fromstringlist(["<root>", f.read(), "</root>"])
Atchley answered 27/5, 2014 at 14:10 Comment(5)
In my question, I have said that I already tried this and I would like a better method than that. As the file is large enough, the method you suggested is not an efficient one.Dominions
@Dominions You said you edited the original file. That's not what this does. I would like to use some more efficient method, in which I would not need to alter the original xml file ... The original file is unchanged.Atchley
In practice, both are the same methods aren't they? You basically added the extra tag to the string. I would rather like to have a library that doesn't give a "no root error". This method is not a good one because as the file size is large, the process of adding a tag to the string would take some time, thereby causing the inefficiency.Dominions
@Dominions See my updated answer I know longer edit the string. This is more efficient than editing the feel. I only write to memory and not to disk.Atchley
@Dominions - adding the two tags in this and falsetru's solution is trivial. It adds them on-the-fly - there is no extra iteration over the file contents, it is not constructing a complete new string with the added tags.Prostitution

© 2022 - 2024 — McMap. All rights reserved.