Escaping strings for use in XML
Asked Answered
P

8

43

I'm using Python's xml.dom.minidom to create an XML document. (Logical structure -> XML string, not the other way around.)

How do I make it escape the strings I provide so they won't be able to mess up the XML?

Pharmacopsychosis answered 10/10, 2009 at 0:51 Comment(1)
Any XML DOM serialiser will escape character data appropriately as it goes out... that's what DOM manipulation is for, to prevent you having to get your hands dirty with the markup.Tartrate
S
15

Do you mean you do something like this:

from xml.dom.minidom import Text, Element

t = Text()
e = Element('p')

t.data = '<bar><a/><baz spam="eggs"> & blabla &entity;</>'
e.appendChild(t)

Then you will get nicely escaped XML string:

>>> e.toxml()
'<p>&lt;bar&gt;&lt;a/&gt;&lt;baz spam=&quot;eggs&quot;&gt; &amp; blabla &amp;entity;&lt;/&gt;</p>'
Schoolmarm answered 10/10, 2009 at 1:9 Comment(1)
How would you do it for a file? for example from xml.dom.minidom import parse, parseString dom1 = parse('Test-bla.ddf') (example from docs.python.org/3/library/xml.dom.minidom.html)Shadshadberry
T
88

Something like this?

>>> from xml.sax.saxutils import escape
>>> escape("< & >")   
'&lt; &amp; &gt;'
Test answered 10/10, 2009 at 1:5 Comment(5)
Just what I was looking for. Most of my XML handling is done using lxml and I wonder if importing (yet) another XML module would be too polluted? Is there an equivalent in lxml? (Can't seem to find one.)Orebro
This does not handle escaping of quotes.Zane
>>> from xml.sax.saxutils import quoteattr >>> quoteattr('value containing " a double-quote \' and an apostrophe') '"value containing &quot; a double-quote \' and an apostrophe"'Fanya
This will cause existing escaped characters to become malformed. For example, &amp;& becomes &amp;amp;&amp;Mucker
Re: "This will cause existing escaped characters to become malformed" - this is wrong. The existing escapes will not become malformed, but double-escaped. This is expected and correct behaviour: if your input contains both escaped and unescaped such characters, then either it's invalid input, or you want the escaped ones to be displayed verbatim, like in text "In HTML, & is encoded using &amp;", where the final "&amp" should be shown to user in this form. Double-escaping here is wanted.Immune
B
23

xml.sax.saxutils does not escape quotation characters (")

So here is another one:

def escape( str_xml: str ):
    str_xml = str_xml.replace("&", "&amp;")
    str_xml = str_xml.replace("<", "&lt;")
    str_xml = str_xml.replace(">", "&gt;")
    str_xml = str_xml.replace("\"", "&quot;")
    str_xml = str_xml.replace("'", "&apos;")
    return str_xml

if you look it up then xml.sax.saxutils only does string replace

Bluegrass answered 24/2, 2015 at 18:30 Comment(6)
Might want to also escape the single quotation character, ie. 'Hallway
Best to avoid using the keyword str as your variable name.Ulphi
You forgot str = str.replace("'", "&apos;").Ragucci
Also, an alternative to str = str.replace("\"", "&quot;") is str = str.replace('"', "&quot;"), which is more readable I think, as the backslash (\) just looks out of place to me.Ragucci
If you don't copy-paste from here, you should notice that the first replace is of the ampersand ("&"). If it's not the first statment, you will replace the ampersand of the other statments...Applicator
I think there is quoteattr function for that purpose. Note, this is different from escaping text that goes in between tags. You also need to escape double quote. Finally, I read that backslash does not need escaping. See https://mcmap.net/q/18036/-what-characters-do-i-need-to-escape-in-xml-documentsMetzgar
S
19

xml.sax.saxutils.escape only escapes &, <, and > by default, but it does provide an entities parameter to additionally escape other strings:

from xml.sax.saxutils import escape

def xmlescape(data):
    return escape(data, entities={
        "'": "&apos;",
        "\"": "&quot;"
    })

xml.sax.saxutils.escape uses str.replace() internally, so you can also skip the import and write your own function, as shown in MichealMoser's answer.

Stupendous answered 9/2, 2016 at 21:2 Comment(0)
S
15

Do you mean you do something like this:

from xml.dom.minidom import Text, Element

t = Text()
e = Element('p')

t.data = '<bar><a/><baz spam="eggs"> & blabla &entity;</>'
e.appendChild(t)

Then you will get nicely escaped XML string:

>>> e.toxml()
'<p>&lt;bar&gt;&lt;a/&gt;&lt;baz spam=&quot;eggs&quot;&gt; &amp; blabla &amp;entity;&lt;/&gt;</p>'
Schoolmarm answered 10/10, 2009 at 1:9 Comment(1)
How would you do it for a file? for example from xml.dom.minidom import parse, parseString dom1 = parse('Test-bla.ddf') (example from docs.python.org/3/library/xml.dom.minidom.html)Shadshadberry
A
11

The accepted answer from Andrey Vlasovskikh is the most complete answer to the OP. But this topic comes up for most frequent searches for python escape xml and I wanted to offer a time comparison of the three solutions discussed in this article along with offering a fourth option we choose to deploy due to the enhanced performance it offered.

All four rely on either native python data handling, or python standard library. The solutions are offered in order from the slowest to the fastest performance.

Option 1 - regex

This solution uses the python regex library. It yields the slowest performance:

import re
table = {
    "<": "&lt;",
    ">": "&gt;",
    "&": "&amp;",
    "'": "&apos;",
    '"': "&quot;",
}
pat = re.compile("({})".format("|".join(table)))

def xmlesc(txt):
    return pat.sub(lambda match: table[match.group(0)], txt)

>>> %timeit xmlesc('<&>"\'')
1.48 µs ± 1.73 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

FYI: µs is the symbol for microseconds, which is 1-millionth of second. The other implementation's completion times are measured in nanoseconds (ns) which is billionths of a second.

Option 2 -- xml.sax.saxutils

This solution uses python xml.sax.saxutils library.

from xml.sax.saxutils import escape
def xmlesc(txt):
    return escape(txt, entities={"'": "&apos;", '"': "&quot;"})

>>> %timeit xmlesc('<&>"\'')
832 ns ± 4.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Option 3 - str.replace

This solution uses the string replace() method. Under the hood, it implements similar logic as python's xml.sax.saxutils. The saxutils code has a for loop that costs some performance, making this version slightly faster.

def xmlesc(txt):
    txt = txt.replace("&", "&amp;")
    txt = txt.replace("<", "&lt;")
    txt = txt.replace(">", "&gt;")
    txt = txt.replace('"', "&quot;")
    txt = txt.replace("'", "&apos;")
    return txt

>>> %timeit xmlesc('<&>"\'')
503 ns ± 0.725 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Option 4 - str.translate

This is the fastest implementation. It uses the string translate() method.

table = str.maketrans({
    "<": "&lt;",
    ">": "&gt;",
    "&": "&amp;",
    "'": "&apos;",
    '"': "&quot;",
})
def xmlesc(txt):
    return txt.translate(table)

>>> %timeit xmlesc('<&>"\'')
352 ns ± 0.177 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Amylase answered 25/12, 2020 at 18:40 Comment(0)
W
4

If you don't want another project import and you already have cgi, you could use this:

>>> import cgi
>>> cgi.escape("< & >")
'&lt; &amp; &gt;'

Note however that with this code legibility suffers - you should probably put it in a function to better describe your intention: (and write unit tests for it while you are at it ;)

def xml_escape(s):
    return cgi.escape(s) # escapes "<", ">" and "&"
Walterwalters answered 14/1, 2014 at 7:13 Comment(2)
It's also worth noting that this API is now deprecatedInrush
Instead of this deprecated function you can use html.escape("< & >")Schizont
S
2
xml_special_chars = {
    "<": "&lt;",
    ">": "&gt;",
    "&": "&amp;",
    "'": "&apos;",
    '"': "&quot;",
}

xml_special_chars_re = re.compile("({})".format("|".join(xml_special_chars)))

def escape_xml_special_chars(unescaped):
    return xml_special_chars_re.sub(lambda match: xml_special_chars[match.group(0)], unescaped)

All the magic happens in re.sub(): argument repl accepts not only strings, but also functions.

Scarce answered 3/3, 2020 at 11:54 Comment(0)
J
2

You can use the builtin library available since Python 3.2

import html

clean_txt = html.escape(dirty_txt)

If the dirt_txt is already cleaned using an XSS library, you'll need to unescape first.

import html

clean_txt = html.escape(html.unescape(dirty_txt))
Jacintojack answered 3/5 at 12:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.