I'm trying to write a (very) basic web crawler using cURL and Python's BeautifulSoup
library (since this is much easier to understand than GNU awk and a mess of regular expressions).
Currently, I'm trying to pipe the contents of a webpage to the program with cURL (i.e., curl http://www.example.com/ | ./parse-html.py
)
For some reason, Python throws a UnicodeDecodeError
because of an invalid start byte (I've looked at this answer and this answer about invalid start bytes, but did not figure out how to solve the issue from them).
Specifically, I've tried to use a.encode('utf-8').split()
from the first answer. The second answer simply explained the issue (that Python found an invalid starter byte), though it didn't give a solution.
I've attempted redirecting the output of cURL to a file (i.e., curl http://www.example.com/ > foobar.html
and modifying the program to accept a file as a command-line argument, though this causes the same UnicodeDecodeError
.
I've checked, and the output of locale charmap
is UTF-8
, which as far as I know, means that my system is encoding characters in UTF-8
(which makes me especially confused about this UnicodeDecodeError
.
At the moment, the exact line causing the error is html_doc = sys.stdin.readlines().encode('utf-8').strip()
. I've tried rewriting this as a for-loop, though I get the same issue.
What exactly is causing the UnicodeDecodeError
and how should I fix the issue?
EDIT:
By changing the line html_doc = sys.stdin.readlines().encode('utf-8').strip()
to html_doc = sys.stdin
fixes the issue