Here's a common error when dealing with UTF-8 - 'invalid tokens'
In my example, It comes from dealing with a SOAP service provider that had no respect for unicode characters, simply truncating values to 100 bytes and neglecting that the 100'th byte may be in the middle of a multi-byte character: for example:
<name xsi:type="xsd:string">浙江家庭教会五十人遭驱散及抓打 圣诞节聚会被断电及抢走物品(图、视频\xef\xbc</name>
The last two bytes are what remains of a 3 byte unicode character, after the truncation knife assumed that the world uses 1-byte characters. Next stop, sax parser and:
xml.sax._exceptions.SAXParseException: <unknown>:1:2392: not well-formed (invalid token)
I don't care about this character anymore. It should be removed from the document and allow the sax parser to function.
The XML reply is valid in every other respect except for these values.
Question: How do you remove this character without parsing the entire document and re-inventing UTF-8 encoding to check every byte?
Using: Python+SUDS