I need to parse a very large (~40GB) XML file, remove certain elements from it, and write the result to a new xml file. I've been trying to use iterparse from python's ElementTree, but I'm confused about how to modify the tree and then write the resulting tree into a new XML file. I've read the documentation on itertree but it hasn't cleared things up. Are there any simple ways to do this?
Thank you!
EDIT: Here's what I have so far.
import xml.etree.ElementTree as ET
import re
date_pages = []
f=open('dates_texts.xml', 'w+')
tree = ET.iterparse("sample.xml")
for i, element in tree:
if element.tag == 'page':
for page_element in element:
if page_element.tag == 'revision':
for revision_element in page_element:
if revision_element.tag == '{text':
if len(re.findall('20\d\d', revision_element.text.encode('utf8'))) == 0:
element.clear()