random text from /dev/random raising an error in lxml: All strings must be XML compatible: Unicode or ASCII, no NULL bytes
Asked Answered
R

1

2

I am, for the sake of testing my web app, pasting some random characters from /dev/random into my web frontend. This line throws an error:

print repr(comment)
import html5lib
print html5lib.parse(comment, treebuilder="lxml")

'a\xef\xbf\xbd\xef\xbf\xbd\xc9\xb6E\xef\xbf\xbd\xef\xbf\xbd`\xef\xbf\xbd]\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd2 \x14\xef\xbf\xbd\xc7\xbe\xef\xbf\xbdy\xcb\x9c\xef\xbf\xbdi1O\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbdZ\xef\xbf\xbd.\xef\xbf\xbd\x17^C'

Unhandled Error
    Traceback (most recent call last):
      File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", line 893, in _inlineCallbacks
        result = g.send(result)
      File "/home/work/random/social/social/item.py", line 389, in _new
        convId, conv = yield plugin.create(request)
      File "/home/work/random/social/social/logging.py", line 47, in wrapper
        ret = func(*args, **kwargs)
      File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", line 1014, in unwindGenerator
        return _inlineCallbacks(None, f(*args, **kwargs), Deferred())
    --- <exception caught here> ---
      File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", line 893, in _inlineCallbacks
        result = g.send(result)
      File "/home/work/random/social/twisted/plugins/status.py", line 63, in create
        print html5lib.parse(comment, treebuilder="lxml")
      File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/html5parser.py", line 38, in parse
        return p.parse(doc, encoding=encoding)
      File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/html5parser.py", line 211, in parse
        parseMeta=parseMeta, useChardet=useChardet)
      File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/html5parser.py", line 111, in _parse
        self.mainLoop()
      File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/html5parser.py", line 174, in mainLoop
        self.phase.processCharacters(token)
      File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/html5parser.py", line 572, in processCharacters
        self.parser.phase.processCharacters(token)
      File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/html5parser.py", line 611, in processCharacters
        self.parser.phase.processCharacters(token)
      File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/html5parser.py", line 652, in processCharacters
        self.parser.phase.processCharacters(token)
      File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/html5parser.py", line 711, in processCharacters
        self.parser.phase.processCharacters(token)
      File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/html5parser.py", line 804, in processCharacters
        self.parser.phase.processCharacters(token)
      File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/html5parser.py", line 948, in processCharacters
        self.tree.insertText(token["data"])
      File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/treebuilders/_base.py", line 288, in insertText
        parent.insertText(data)
      File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/treebuilders/etree_lxml.py", line 225, in insertText
        builder.Element.insertText(self, data, insertBefore)
      File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/treebuilders/etree.py", line 114, in insertText
        self._element.text += data
      File "lxml.etree.pyx", line 821, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:33308)

      File "apihelpers.pxi", line 646, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:15287)

      File "apihelpers.pxi", line 1295, in lxml.etree._utf8 (src/lxml/lxml.etree.c:20212)

    exceptions.ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes

Before I am committing a user entered string, I am doing this:

comment.decode('utf-8').encode('utf-8', "replace")

but this does not seem to be helping in this case.

-- Abhi

Ragnar answered 12/8, 2011 at 8:26 Comment(1)
Had same error, [this][1] solution fixed it for me. [1]: https://mcmap.net/q/590605/-how-to-solve-problem-with-parsing-html-file-with-cyrillic-symbolConjecture
C
4

The problem is that text in XML cannot include certain characters mainly control ones with byte value below 32 The XML 1.0 Recommendation defines a Char as

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

/dev/random can provide bytes that don't match this e.g. control characters and some multi byte characters.

So you have to filter out these bytes before trying any encoding.

Countermeasure answered 20/8, 2011 at 7:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.