Python UTF-8 XML parsing (SUDS): Removing 'invalid token'
Asked Answered
F

2

6

Here's a common error when dealing with UTF-8 - 'invalid tokens'

In my example, It comes from dealing with a SOAP service provider that had no respect for unicode characters, simply truncating values to 100 bytes and neglecting that the 100'th byte may be in the middle of a multi-byte character: for example:

<name xsi:type="xsd:string">浙江家庭教会五十人遭驱散及抓打 圣诞节聚会被断电及抢走物品(图、视频\xef\xbc</name>

The last two bytes are what remains of a 3 byte unicode character, after the truncation knife assumed that the world uses 1-byte characters. Next stop, sax parser and:

xml.sax._exceptions.SAXParseException: <unknown>:1:2392: not well-formed (invalid token)

I don't care about this character anymore. It should be removed from the document and allow the sax parser to function.

The XML reply is valid in every other respect except for these values.

Question: How do you remove this character without parsing the entire document and re-inventing UTF-8 encoding to check every byte?

Using: Python+SUDS

Fog answered 3/1, 2012 at 22:8 Comment(0)
F
16

Turns out, SUDS sees xml as type 'string' (not unicode) so these are encoded values.

1) The FILTER:

badXML = "your bad utf-8 xml here"  #(type <str>)

#Turn it into a python unicode string - ignore errors, kick out bad unicode
decoded = badXML.decode('utf-8', errors='ignore')  #(type <unicode>)

#turn it back into a string, using utf-8 encoding.
goodXML = decoded.encode('utf-8')   #(type <str>)

2) SUDS: see https://fedorahosted.org/suds/wiki/Documentation#MessagePlugin

from suds.plugin import MessagePlugin
class UnicodeFilter(MessagePlugin):
    def received(self, context):
        decoded = context.reply.decode('utf-8', errors='ignore')
        reencoded = decoded.encode('utf-8')
        context.reply = reencoded

and

from suds.client import Client
client = Client(WSDL_url, plugins=[UnicodeFilter()])

Hope this helps someone.


Note: Thanks to John Machin!

See: Why is python decode replacing more than the invalid bytes from an encoded string?

Python issue8271 regarding errors='ignore' can get in your way here. Without this bug fixed in python, 'ignore' will consume the next few bytes to satisfy the length

during the decoding of an invalid UTF-8 byte sequence, only the
start byte and the continuation byte(s) are now considered invalid, instead of the number of bytes specified by the start byte

Issue was fixed in:
Python 2.6.6 rc1
Python 2.7.1 rc1 (and all future releases of 2.7)
Python 3.1.3 rc1 (and all future release of 3.x)

Python 2.5 and below will contain this issue.

In the example above, "\xef\xbc</name".decode('utf-8', errors='ignore') should
return "</name", but in 'bugged' versions of python it returns "/name".

The first four bits (0xe) describes a 3-byte UTF character, so the bytes0xef, 0xbc, and then (erroneously) 0x3c ('<') are consumed.

0x3c is not a valid continuation byte which creates the invalid 3-byte UTF character in the first place.

Fixed versions of python only remove the first byte and only valid continuation bytes, leaving 0x3c unconsumed

Fog answered 3/1, 2012 at 22:18 Comment(2)
Self Learner badge earned... (that was the point, really!) thank you.Fog
Re which Python version fixed the problem: See bugs.python.org/issue8271 ... The problem was that the code decided the length of the dud sequence by a lookup table based solely on the first byte of the sequence. There was a patch that fixed the symptom of ignoring too many bytes when recovering (i.e. your problem) and most cases of the less important symptom of not generating the correct number of U+FFFD generated in 'replace' mode (not your problem). AFAIK, your problem is fixed in 2.6.6+ & 3.1.3+ and all releases of 2.7, 3.2 and 3.3.Adamina
T
0

@FlipMcF's is the correct answer - I'm just posting my filter for his solution, because the original one didn't work out for me (I had some emoji characters in my XML, which were correctly encoded in UTF-8, but they still crashed XML parsers):

class UnicodeFilter(MessagePlugin):
    def received(self, context):
        from lxml import etree
        from StringIO import StringIO
        parser = etree.XMLParser(recover=True) # recover=True is important here
        doc = etree.parse(StringIO(context.reply), parser)
        context.reply = etree.tostring(doc)
Tsang answered 23/1, 2018 at 18:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.