So I realize this is an older answer with a highly up-voted and accepted answer, but if you are reading LARGE-FILES and find yourself in the same predicament I did; I hope this helps you out.
The issue with this approach is, in fact, the iteration. Regardless of how fast the parser is, doing anything say... a few 100k times is gonna eat your execution time. With that said, it came down to really thinking about the problem for me and understanding how namespaces work (or are "intended to work", because they are honestly not needed). Now if your xml truly uses namespaces, meaning you see tags that look like this: <xs:table>
, then you'll need to tweak the approach here for your use-case. I'll include the full way of handling, as well.
DISCLAIMER : I cannot, with a good conscience, tell you to use regular expressions when parsing html/xml, go look at SergiyKolesnikov's answer as it WORKS, but I had an edge case so with that said... let's dive into some regex!
Problem: namespace stripping takes forever... and most of the time the namespaces only live inside of the very opening tag, or our "root". So in thinking about how python reads information in, and where our only problem-child is that root node, why not use that to our advantage.
Please NOTE: the file i'm using as my example comes as a raw, horrid, remarkably senseless structure of lulz with the promise of data in there somewhere.
my_file
is the path to the file im using for our example, I cannot share it with you for professional reasons; and it has been cut down way in size just to get through this answer.
import os, sys, subprocess, re, io, json
from lxml import etree
# Your file would be '_biggest_file' if playing along at home
my_file = _biggest_file
meta_stuff = dict(
exists = os.path.exists(_biggest_file),
sizeof = os.path.getsize(_biggest_file),
extension_is_a_real_thing = any(re.findall("\.(html|xml)$", my_file, re.I)),
system_thinks_its_a = subprocess.check_output(
["file", "-i", _biggest_file]
).decode().split(":")[-1:][0].strip()
)
print(json.dumps(meta_stuff, indent = 2))
So for starters, decently sized, and system thinks at best it's html; the file extension is neither xml or html either...
{
"exists": true,
"sizeof": 24442371,
"extension_is_a_real_thing": false,
"system_thinks_its_a": "text/html; charset=us-ascii"
}
Approach:
- In order to parse an xml file... it should at the very least be xml, so we'll need to check and add a declarations tag if one doesn't exist
- If I have namespaces.. thats bad because I can't use xpaths, which is what I want to do
- If my file is huge, I should only operate on the smallest imaginable parts that I need to clean before I'm ready to parse it.
Function
def speed_read(file_path):
# We're gonna be low-brow and add our own using this string. It's fine
_xml_dec = '<?xml version="1.0" encoding="utf-8"?>'
# Even worse.. rgx for xml here we go
#
# We'll need to extract the very first node that we find in our document,
# because for our purposes thats the one we know has the namespace uri's
# ie: "attributes"
# FiRsT node : <actual_name xmlns:xsi="idontactuallydoanything.com">
# We're going to pluck out that first node, get the tags actual name
# which means from:
# <actual_name xmlns:xsi="idontactuallydoanything.com">...</actual_name>
# We pluck:
# actual_name
# Then we're gonna replace the entire tag with one we make from that name
# by simple string substitution
#
# -> 'starting from the beginning, capture everything between the < and the >'
_first_node = re.compile('^(\<.*?\>)', re.I|re.M|re.U)
# -> 'Starting from the beginning, but dont you get me the <, find anything that happens
# before the first white-space, which i don't want either man'
_first_tagname = re.compile('(?<=^\<)(.*?)\S+',re.I|re.M|re.U)
# open the file context
with open(file_path, "r", encoding = "utf-8") as f:
# go ahead and strip leading and trailing, cause why not... plus adds
# safety for our regex's
_raw = f.read().strip()
# Now, if the file somehow happens to magically have the xml declaration, we
# wanna go ahead and remove it as we plan to add our own. But for efficiency,
# only check the first couple of characters
if _raw.startswith('<?xml', 0, 5):
#_raw = re.sub(_xml_dec, '', _raw).strip()
_raw = re.sub('\<\?xml.*?\?>\n?', '', _raw).strip()
# Here we grab that first node that has those meaningless namespaces
root_element = _first_node.search(_raw).group()
# here we get its name
first_tag = _first_tagname.search(root_element).group()
# Here, we rubstitute the entire element, with a new one
# that only contains the elements name
_raw = re.sub(root_element, '<{}>'.format(first_tag), _raw)
# Now we add our declaration tag in the worst way you have ever
# seen, but I miss sprintf, so this is how i'm rolling. Python is terrible btw
_raw = "{}{}".format(_xml_dec, _raw)
# The bytes part here might end up being overkill.. but this has worked
# for me consistently so it stays.
return etree.parse(io.BytesIO(bytes(bytearray(_raw, encoding = "utf-8"))))
# a good answer from above:
def safe_read(file_path):
root = etree.parse(file_path)
for elem in root.getiterator():
elem.tag = etree.QName(elem).localname
# Remove unused namespace declarations
etree.cleanup_namespaces(root)
return root
Benchmarking - Yes I know there's better ways to do this.
import pandas as pd
safe_times = []
for i in range(0,5):
s = time.time()
safe_read(_biggest_file)
safe_times.append(time.time() - s)
fast_times = []
for i in range(0,5):
s = time.time()
speed_read(_biggest_file)
fast_times.append(time.time() - s)
pd.DataFrame({"safe":safe_times, "fast":fast_times})
Results
safe |
fast |
2.36 |
0.61 |
2.15 |
0.58 |
2.47 |
0.49 |
2.94 |
0.60 |
2.83 |
0.53 |