How to handle response encoding from urllib.request.urlopen() , to avoid TypeError: can't use a string pattern on a bytes-like object [duplicate]
Asked Answered
M

7

68

I'm trying to open a webpage using urllib.request.urlopen() then search it with regular expressions, but that gives the following error:

TypeError: can't use a string pattern on a bytes-like object

I understand why, urllib.request.urlopen() returns a bytestream, so re doesn't know the encoding to use. What am I supposed to do in this situation? Is there a way to specify the encoding method in a urlrequest maybe or will I need to re-encode the string myself? If so what am I looking to do, I assume I should read the encoding from the header info or the encoding type if specified in the html and then re-encode it to that?

Mission answered 13/2, 2011 at 2:5 Comment(1)
not one of these answers work for me in Python 3.5x using urllib.request because urllib.request.urlopen(url) literally returns ONLY a byte stream - it has NO member functions to parse any form of header in the html. So no info(), no headers, etc. I'd have to parse it myself to find the encoding, but without the encoding I can't convert it to text to parse it. It's a catch 22.Sunstone
E
66

You just need to decode the response, using the Content-Type header typically the last value. There is an example given in the tutorial too.

output = response.decode('utf-8')
Encincture answered 13/2, 2011 at 2:9 Comment(2)
What if the charset is not utf-8? Would it be a better idea to somehow determine it from the response instead of hard-coding this assumption?Soares
The Content-Type header on the response includes the charset value, which is what you need to properly decode the response (at least, before guessing utf-8). For example: Content-Type: text/html; charset=utf-8Loco
T
120

As for me, the solution is as following (python3):

resource = urllib.request.urlopen(an_url)
content =  resource.read().decode(resource.headers.get_content_charset())
Terramycin answered 3/10, 2013 at 9:54 Comment(3)
Looks like the best answer but what if the server doesn't send the charset info?Parlay
If the server doesn't send charset info your best bet at that point is to guess.Hydrosol
@rvighne: if the server doesn't pass charset in Content-Type header then there are complex rules to figure out the character encoding e.g., it may be specified inside html document: <meta charset="utf-8">.Dandle
E
66

You just need to decode the response, using the Content-Type header typically the last value. There is an example given in the tutorial too.

output = response.decode('utf-8')
Encincture answered 13/2, 2011 at 2:9 Comment(2)
What if the charset is not utf-8? Would it be a better idea to somehow determine it from the response instead of hard-coding this assumption?Soares
The Content-Type header on the response includes the charset value, which is what you need to properly decode the response (at least, before guessing utf-8). For example: Content-Type: text/html; charset=utf-8Loco
H
11

I had the same issues for the last two days. I finally have a solution. I'm using the info() method of the object returned by urlopen():

req=urllib.request.urlopen(URL)
charset=req.info().get_content_charset()
content=req.read().decode(charset)
Hermilahermina answered 17/11, 2015 at 12:41 Comment(1)
this is exactly the same answer that Ivan Klass posted 2 years before, except using info instead of headers. :-/ With no explanation as to why pick this instead of that, this answer looks like a duplicate to me.Adenaadenauer
D
5

Here is an example simple http request (that I tested and works)...

address = "http://stackoverflow.com"    
urllib.request.urlopen(address).read().decode('utf-8')

Make sure to read the documentation.

https://docs.python.org/3/library/urllib.request.html

If you want to do something more detailed GET/POST REQUEST.

import urllib.request
# HTTP REQUEST of some address
def REQUEST(address):
    req = urllib.request.Request(address)
    req.add_header('User-Agent', 'NAME (Linux/MacOS; FROM, USA)')
    response = urllib.request.urlopen(req)
    html = response.read().decode('utf-8')  # make sure its all text not binary
    print("REQUEST (ONLINE): " + address)
    return html
Discriminative answered 13/12, 2019 at 6:18 Comment(1)
Does this not have the same issue as the accepted answer? To quote a comment from there: What if the charset is not utf-8? Would it be a better idea to somehow determine it from the response instead of hard-coding this assumption?Bezant
F
4

With requests:

import requests

response = requests.get(URL).text
Fundamentalism answered 28/4, 2016 at 9:18 Comment(1)
This is using a different library entirely.Bezant
W
1
urllib.urlopen(url).headers.getheader('Content-Type')

Will output something like this:

text/html; charset=utf-8

Weihs answered 1/12, 2011 at 16:48 Comment(0)
C
-3

after you make a request req = urllib.request.urlopen(...) you have to read the request by calling html_string = req.read() that will give you the string response that you can then parse the way you want.

Calamite answered 13/2, 2011 at 2:9 Comment(2)
I do, that's how I get it, but it returns a bytesteam, b'<HTML>...'.Mission
i see, then you can use .decode() as @Senthil pointed out or you can use urllib2 which should handle this transparently to you.Calamite

© 2022 - 2024 — McMap. All rights reserved.