Python sys.stdin throws a UnicodeDecodeError
Asked Answered
F

1

7

I'm trying to write a (very) basic web crawler using cURL and Python's BeautifulSoup library (since this is much easier to understand than GNU awk and a mess of regular expressions).

Currently, I'm trying to pipe the contents of a webpage to the program with cURL (i.e., curl http://www.example.com/ | ./parse-html.py)

For some reason, Python throws a UnicodeDecodeError because of an invalid start byte (I've looked at this answer and this answer about invalid start bytes, but did not figure out how to solve the issue from them).

Specifically, I've tried to use a.encode('utf-8').split() from the first answer. The second answer simply explained the issue (that Python found an invalid starter byte), though it didn't give a solution.

I've attempted redirecting the output of cURL to a file (i.e., curl http://www.example.com/ > foobar.html and modifying the program to accept a file as a command-line argument, though this causes the same UnicodeDecodeError.

I've checked, and the output of locale charmap is UTF-8, which as far as I know, means that my system is encoding characters in UTF-8 (which makes me especially confused about this UnicodeDecodeError.

At the moment, the exact line causing the error is html_doc = sys.stdin.readlines().encode('utf-8').strip(). I've tried rewriting this as a for-loop, though I get the same issue.

What exactly is causing the UnicodeDecodeError and how should I fix the issue?

EDIT: By changing the line html_doc = sys.stdin.readlines().encode('utf-8').strip() to html_doc = sys.stdin fixes the issue

Friction answered 20/1, 2016 at 2:19 Comment(0)
Z
4

The problem is during reading, not encoding; the input resource is simply not encoded with UTF-8, but another encoding. In a UTF-8 shell, you can easily reproduce the problem with

$ echo 2¥ | iconv -t iso8859-1 | python3 -c 'import sys;sys.stdin.readline()'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa5 in position 1: invalid start byte

You can read the file (sys.stdin.buffer.read(), or with open(..., 'rb') as f: f.read()) as binary (you'll get a bytes object), examine it, and guess the encoding. The actual algorithm to do that is documented in the HTML standard.

However, in many cases the encoding is not specified in the file itself, but via the HTTP Content-Type header. Unfortunately, your invocation of curl does not capture this header. Instead of using curl and Python, you can simply use Python only - it already can download URLs. Stealing the encoding detection algorithm from youtube-dl, we get something like:

import re
import urllib.request


def guess_encoding(content_type, webpage_bytes):
    m = re.match(
        r'[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+\s*;\s*charset="?([a-zA-Z0-9_-]+)"?',
        content_type)
    if m:
        encoding = m.group(1)
    else:
        m = re.search(br'<meta[^>]+charset=[\'"]?([a-zA-Z0-9_-]+)[ /\'">]',
                      webpage_bytes[:1024])
        if m:
            encoding = m.group(1).decode('ascii')
        elif webpage_bytes.startswith(b'\xff\xfe'):
            encoding = 'utf-16'
        else:
            encoding = 'utf-8'

    return encoding


def download_html(url):
    with urllib.request.urlopen(url) as urlh:
        content = urlh.read()
        encoding = guess_encoding(urlh.getheader('Content-Type'), content)
        return content.decode(encoding)

print(download_html('https://phihag.de/2016/iso8859.php'))

There are also some libraries (though not in the standard library) which support this out of the box, namely requests.

I also recommend that you read up on the basics of what encodings are.

Zucchetto answered 20/1, 2016 at 2:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.