Python sys.stdin throws a UnicodeDecodeError

I'm trying to write a (very) basic web crawler using cURL and Python's BeautifulSoup library (since this is much easier to understand than GNU awk and a mess of regular expressions).

Currently, I'm trying to pipe the contents of a webpage to the program with cURL (i.e., curl http://www.example.com/ | ./parse-html.py)

For some reason, Python throws a UnicodeDecodeError because of an invalid start byte (I've looked at this answer and this answer about invalid start bytes, but did not figure out how to solve the issue from them).

Specifically, I've tried to use a.encode('utf-8').split() from the first answer. The second answer simply explained the issue (that Python found an invalid starter byte), though it didn't give a solution.

I've attempted redirecting the output of cURL to a file (i.e., curl http://www.example.com/ > foobar.html and modifying the program to accept a file as a command-line argument, though this causes the same UnicodeDecodeError.

I've checked, and the output of locale charmap is UTF-8, which as far as I know, means that my system is encoding characters in UTF-8 (which makes me especially confused about this UnicodeDecodeError.

At the moment, the exact line causing the error is html_doc = sys.stdin.readlines().encode('utf-8').strip(). I've tried rewriting this as a for-loop, though I get the same issue.

What exactly is causing the UnicodeDecodeError and how should I fix the issue?

EDIT: By changing the line html_doc = sys.stdin.readlines().encode('utf-8').strip() to html_doc = sys.stdin fixes the issue

$ echo 2¥ | iconv -t iso8859-1 | python3 -c 'import sys;sys.stdin.readline()' Traceback (most recent call last): File "<string>", line 1, in <module> File "/usr/lib/python3.5/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa5 in position 1: invalid start byte

import re import urllib.request def guess_encoding(content_type, webpage_bytes): m = re.match( r'[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+\s*;\s*charset="?([a-zA-Z0-9_-]+)"?', content_type) if m: encoding = m.group(1) else: m = re.search(br'<meta[^>]+charset=[\'"]?([a-zA-Z0-9_-]+)[ /\'">]', webpage_bytes[:1024]) if m: encoding = m.group(1).decode('ascii') elif webpage_bytes.startswith(b'\xff\xfe'): encoding = 'utf-16' else: encoding = 'utf-8' return encoding def download_html(url): with urllib.request.urlopen(url) as urlh: content = urlh.read() encoding = guess_encoding(urlh.getheader('Content-Type'), content) return content.decode(encoding) print(download_html('https://phihag.de/2016/iso8859.php'))

Recommended topics

Hot tags