urllib for python 3

Asked 13/11, 2015 at 8:49 Answered 13/11, 2015 at 19:35

This code in python3 is problematic：

import urllib.request
fhand=urllib.request.urlopen('http://www.py4inf.com/code/romeo.txt')
print(fhand.read())

Its output is:

b'But soft what light through yonder window breaks'
b'It is the east and Juliet is the sun'
b'Arise fair sun and kill the envious moon'
b'Who is already sick and pale with grief'

Why did I get b'...'？ What could I do to get the right response？

The right text should be

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief

Quartas answered 13/11, 2015 at 8:49 Comment(1)

possible duplicate of: #6270265 – Odious 13/11, 2015 at 8:53

The b'...' is a byte string: an array of bytes, not a real string.

To convert to a real string, use

fhand.read().decode()

This uses the default encoding 'UTF-8'. For ASCII encoding, use

fhand.read().decode("ASCII")

for example

Goofy answered 13/11, 2015 at 8:57 Comment(4)

Is there any more simply way? 'fhand.read().decode("ASCII")' OMG It's so long! – Quartas 13/11, 2015 at 9:35

@Fourier: decode() has sensible defaults. Just leave the "ASCII" out. Short enough? – Goofy 13/11, 2015 at 9:56

@Fourier: Thank you for asking about the shorter form. – Goofy 13/11, 2015 at 10:35

Thanks for ur patient! – Quartas 13/11, 2015 at 10:56

As the documentation says, urlopen returns an object whose read method gives you a sequence of bytes, not a sequence of characters. In order to convert the bytes to printable characters, which is what you want, you will need to apply the decode method, using the encoding that the bytes are in.

The reason the result seems to make sense is that the default encoding Python picks to display the bytes happens to be the right one, or at least happens to match the right one for these characters.

To do this properly, you should read().decode(encoding) where encoding is the encoding value from the Content-Type HTTP header, accessible through the HTTPResponse object (that is, fhand, in your code). If there is no Content-Type header, or if it doesn't specify an encoding, you're reduced to guessing which encoding to use, but for typical English text it doesn't matter, and in many other cases it's probably going to be UTF-8.

Viridian answered 13/11, 2015 at 9:3 Comment(0)

Python 3 distinguishes between byte sequences and strings. The "b" in front of the string tells you that urllib returned the contents as "raw" bytes. It might be worth reading into the python 3 bytes/strings situation, but basically, you did get the right text back. If you don't want the result to be bytes, you'd just have to convert it back to a "real" python string.

Otocyst answered 13/11, 2015 at 8:56 Comment(0)

The third-party requests library handles decoding to unicode strings automatically. It does its best to infer the correct encoding so you don't need to guess the encoding yourself.

>>> import requests
>>> r = requests.get('http://www.py4inf.com/code/romeo.txt')
>>> print(r.text)
But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief

Same thing with urllib.request and an assumed UTF-8 encoding:

>>> from urllib.request import urlopen
>>> r = urlopen('http://www.py4inf.com/code/romeo.txt')
>>> print(r.read().decode('UTF-8'))
But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief

Canty answered 13/11, 2015 at 19:35 Comment(0)

Recommended topics

Hot tags