This answer provides XML sanitizer functions, although they don't escape the unescaped characters, but simply drop them instead.
Using bs4 with lxml
The question wondered how to do it with Beautiful Soup. Here is a function which will sanitize a small XML bytes
object with it. It was tested with the package requirements beautifulsoup4==4.8.0
and lxml==4.4.0
. Note that lxml
is required here by bs4
.
import xml.etree.ElementTree
import bs4
def sanitize_xml(content: bytes) -> bytes:
# Ref: https://stackoverflow.com/a/57450722/
try:
xml.etree.ElementTree.fromstring(content)
except xml.etree.ElementTree.ParseError:
return bs4.BeautifulSoup(content, features='lxml-xml').encode()
return content # already valid XML
Using only lxml
Obviously there is not much of a point in using both bs4
and lxml
when this can be done with lxml
alone. This lxml==4.4.0
using sanitizer function is essentially derived from the answer by jfs.
import lxml.etree
def sanitize_xml(content: bytes) -> bytes:
# Ref: https://stackoverflow.com/a/57450722/
try:
lxml.etree.fromstring(content)
except lxml.etree.XMLSyntaxError:
root = lxml.etree.fromstring(content, parser=lxml.etree.XMLParser(recover=True))
return lxml.etree.tostring(root)
return content # already valid XML