How to handle response encoding from urllib.request.urlopen() , to avoid TypeError: can't use a string pattern on a bytes-like object [duplicate]

M

7

68

I'm trying to open a webpage using urllib.request.urlopen() then search it with regular expressions, but that gives the following error:

TypeError: can't use a string pattern on a bytes-like object

I understand why, urllib.request.urlopen() returns a bytestream, so re doesn't know the encoding to use. What am I supposed to do in this situation? Is there a way to specify the encoding method in a urlrequest maybe or will I need to re-encode the string myself? If so what am I looking to do, I assume I should read the encoding from the header info or the encoding type if specified in the html and then re-encode it to that?

Mission answered 13/2, 2011 at 2:5 Comment(1)

not one of these answers work for me in Python 3.5x using urllib.request because urllib.request.urlopen(url) literally returns ONLY a byte stream - it has NO member functions to parse any form of header in the html. So no info(), no headers, etc. I'd have to parse it myself to find the encoding, but without the encoding I can't convert it to text to parse it. It's a catch 22. – Sunstone 19/12, 2016 at 22:2

E

66

You just need to decode the response, using the Content-Type header typically the last value. There is an example given in the tutorial too.

output = response.decode('utf-8')

Encincture answered 13/2, 2011 at 2:9 Comment(2)

What if the charset is not utf-8? Would it be a better idea to somehow determine it from the response instead of hard-coding this assumption? – Soares 23/6, 2014 at 17:56

The Content-Type header on the response includes the charset value, which is what you need to properly decode the response (at least, before guessing utf-8). For example: Content-Type: text/html; charset=utf-8 – Loco 19/9, 2018 at 21:4

T

120

As for me, the solution is as following (python3):

resource = urllib.request.urlopen(an_url)
content =  resource.read().decode(resource.headers.get_content_charset())

Terramycin answered 3/10, 2013 at 9:54 Comment(3)

Looks like the best answer but what if the server doesn't send the charset info? – Parlay 16/7, 2014 at 18:5

If the server doesn't send charset info your best bet at that point is to guess. – Hydrosol 6/8, 2014 at 16:30

@rvighne: if the server doesn't pass charset in Content-Type header then there are complex rules to figure out the character encoding e.g., it may be specified inside html document: <meta charset="utf-8">. – Dandle 22/10, 2014 at 4:38

E

66

You just need to decode the response, using the Content-Type header typically the last value. There is an example given in the tutorial too.

output = response.decode('utf-8')

Encincture answered 13/2, 2011 at 2:9 Comment(2)

What if the charset is not utf-8? Would it be a better idea to somehow determine it from the response instead of hard-coding this assumption? – Soares 23/6, 2014 at 17:56

The Content-Type header on the response includes the charset value, which is what you need to properly decode the response (at least, before guessing utf-8). For example: Content-Type: text/html; charset=utf-8 – Loco 19/9, 2018 at 21:4

H

11

I had the same issues for the last two days. I finally have a solution. I'm using the info() method of the object returned by urlopen():

req=urllib.request.urlopen(URL)
charset=req.info().get_content_charset()
content=req.read().decode(charset)

Hermilahermina answered 17/11, 2015 at 12:41 Comment(1)

this is exactly the same answer that Ivan Klass posted 2 years before, except using info instead of headers. :-/ With no explanation as to why pick this instead of that, this answer looks like a duplicate to me. – Adenaadenauer 29/12, 2018 at 1:18

D

5

Here is an example simple http request (that I tested and works)...

address = "http://stackoverflow.com"    
urllib.request.urlopen(address).read().decode('utf-8')

Make sure to read the documentation.

https://docs.python.org/3/library/urllib.request.html

If you want to do something more detailed GET/POST REQUEST.

import urllib.request
# HTTP REQUEST of some address
def REQUEST(address):
    req = urllib.request.Request(address)
    req.add_header('User-Agent', 'NAME (Linux/MacOS; FROM, USA)')
    response = urllib.request.urlopen(req)
    html = response.read().decode('utf-8')  # make sure its all text not binary
    print("REQUEST (ONLINE): " + address)
    return html

Discriminative answered 13/12, 2019 at 6:18 Comment(1)

Does this not have the same issue as the accepted answer? To quote a comment from there: What if the charset is not utf-8? Would it be a better idea to somehow determine it from the response instead of hard-coding this assumption? – Bezant 22/7, 2020 at 21:12

F

4

With requests:

import requests

response = requests.get(URL).text

Fundamentalism answered 28/4, 2016 at 9:18 Comment(1)

This is using a different library entirely. – Bezant 22/7, 2020 at 21:11

W

1

urllib.urlopen(url).headers.getheader('Content-Type')

Will output something like this:

text/html; charset=utf-8

Weihs answered 1/12, 2011 at 16:48 Comment(0)

C

-3

after you make a request req = urllib.request.urlopen(...) you have to read the request by calling html_string = req.read() that will give you the string response that you can then parse the way you want.

Calamite answered 13/2, 2011 at 2:9 Comment(2)

I do, that's how I get it, but it returns a bytesteam, b'<HTML>...'. – Mission 13/2, 2011 at 2:10

i see, then you can use .decode() as @Senthil pointed out or you can use urllib2 which should handle this transparently to you. – Calamite 13/2, 2011 at 2:13

Recommended topics

Hot tags