Using python ElementTree's itertree function and writing modified tree to output file
Asked Answered
L

2

9

I need to parse a very large (~40GB) XML file, remove certain elements from it, and write the result to a new xml file. I've been trying to use iterparse from python's ElementTree, but I'm confused about how to modify the tree and then write the resulting tree into a new XML file. I've read the documentation on itertree but it hasn't cleared things up. Are there any simple ways to do this?

Thank you!

EDIT: Here's what I have so far.

import xml.etree.ElementTree as ET
import re 

date_pages = []
f=open('dates_texts.xml', 'w+')

tree = ET.iterparse("sample.xml")

for i, element in tree:
    if element.tag == 'page':
        for page_element in element:
            if page_element.tag == 'revision':
                for revision_element in page_element:
                    if revision_element.tag == '{text':
                        if len(re.findall('20\d\d', revision_element.text.encode('utf8'))) == 0:
                            element.clear()
Lanti answered 14/3, 2013 at 2:4 Comment(5)
Could you show the code from your attempt (even if it's incomplete)? Helping you fix it instead of writing something from scratch would save time.Beverage
Added the code to my question, above.Lanti
I spotted that earlier. Sorry, I've been busy with other stuff, but I promise I'll take a look soon. In the meantime, I've brought up your question on chat to bring it some more attention.Beverage
How do you know that it doesn't work? Do you get an exception? A good idea is to use a small xml file instead of your 40GB to see if the behaviour is correct, before trying the big file.Heptangular
It's not that the current behavior doesn't work. What I have right now is fine, but it's only a parser. I need a way to write the modified xml back out.Lanti
A
8

If you have a large xml that doesn't fit in memory then you could try to serialize it one element at a time. For example, assuming <root><page/><page/><page/>...</root> document structure and ignoring possible namespace issues:

import xml.etree.cElementTree as etree

def getelements(filename_or_file, tag):
    context = iter(etree.iterparse(filename_or_file, events=('start', 'end')))
    _, root = next(context) # get root element
    for event, elem in context:
        if event == 'end' and elem.tag == tag:
            yield elem
            root.clear() # free memory

with open('output.xml', 'wb') as file:
    # start root
    file.write(b'<root>')

    for page in getelements('sample.xml', 'page'):
        if keep(page):
            file.write(etree.tostring(page, encoding='utf-8'))

    # close root
    file.write(b'</root>')

where keep(page) returns True if page should be kept e.g.:

import re

def keep(page):
    # all <revision> elements must have 20xx in them
    return all(re.search(r'20\d\d', rev.text)
               for rev in page.iterfind('revision'))

For comparison, to modify a small xml file, you could:

# parse small xml
tree = etree.parse('sample.xml')

# remove some root/page elements from xml
root = tree.getroot()
for page in root.findall('page'):
    if not keep(page):
        root.remove(page) # modify inplace

# write to a file modified xml tree
tree.write('output.xml', encoding='utf-8')
Aikido answered 17/3, 2013 at 3:59 Comment(9)
Is there a way to get the library to print out <root> and </root> for you, preserving attributes and, e.g., namespace declarations in the start tag while not keeping the root element in memory?Allegorist
@binki: do you see root variable in getelements()? What do you think it refers to?Aikido
Just why do you have file.write(b'<root>') then?Allegorist
@binki: for simplicity (I assume 40GB xml files, do not contain anything useful in <root>).Aikido
@gardenhead what does the comment say?Aikido
@J.F.Sebastian Yes, but why do you need to clear the root on every iteration? Shouldn't it just be once, after the whole operation is complete?Brade
@Brade the xml document is assumed to be larger than the available memoryAikido
@Aikido - nice solution, it works a treat. One bit I don't understand (and have checked the docs, but nothing) is _, root = next(context). Could you explain how that works please? I gather that it's setting each 'context'/element as the root for each iterations (is that correct?), but I don't understand what the underscore does.Demulcent
@rong _ is just a throwaway name. a, b = [1,2] binds a name to 1 int object and b name to 2.Aikido
H
1

Perhaps the answer to my similar question can help you out.

As for how to write this back to an .xml file, I ended up doing this at the bottom of my script:

with open('File.xml', 'w') as t: # I'd suggest using a different file name here than your original
    for line in ET.tostring(doc):
        t.write(line)
    t.close
print('File.xml Complete') # Console message that file wrote successfully, can be omitted

The variable doc is from earlier on in my script, comparable to where you have tree = ET.iterparse("sample.xml") I have this:

doc = ET.parse(filename)

I've been using lxml instead of ElementTree but I think the write out part should still work (I think it's mainly just xpath stuff that ElementTree can't handle.) I'm using lxml imported with this line:

from lxml import etree as ET

Hopefully this (along with my linked question for some additional code context if you need it) can help you out!

Hartle answered 17/3, 2013 at 2:8 Comment(1)
To write tree = ET.parse(source) to a file after you've modified it, you could use: tree.write('File.xml'). Note: your code for c in ET.tostring(doc) writes one character at a time. If you want to use ET.tostring(); you could write it all at once t.write(ET.tostring(doc)). with statement closes the file automatically, you don't need t.close() inside it. See examples in my answer on how to write both large and small xml filesAikido

© 2022 - 2024 — McMap. All rights reserved.