urllib for python 3
Asked Answered
Q

4

0

This code in python3 is problematic:

import urllib.request
fhand=urllib.request.urlopen('http://www.py4inf.com/code/romeo.txt')
print(fhand.read())

Its output is:

b'But soft what light through yonder window breaks'
b'It is the east and Juliet is the sun'
b'Arise fair sun and kill the envious moon'
b'Who is already sick and pale with grief'

Why did I get b'...'? What could I do to get the right response?

The right text should be

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief
Quartas answered 13/11, 2015 at 8:49 Comment(1)
possible duplicate of: #6270265Odious
G
2

The b'...' is a byte string: an array of bytes, not a real string.

To convert to a real string, use

fhand.read().decode()

This uses the default encoding 'UTF-8'. For ASCII encoding, use

fhand.read().decode("ASCII")

for example

Goofy answered 13/11, 2015 at 8:57 Comment(4)
Is there any more simply way? 'fhand.read().decode("ASCII")' OMG It's so long!Quartas
@Fourier: decode() has sensible defaults. Just leave the "ASCII" out. Short enough?Goofy
@Fourier: Thank you for asking about the shorter form.Goofy
Thanks for ur patient!Quartas
V
1

As the documentation says, urlopen returns an object whose read method gives you a sequence of bytes, not a sequence of characters. In order to convert the bytes to printable characters, which is what you want, you will need to apply the decode method, using the encoding that the bytes are in.

The reason the result seems to make sense is that the default encoding Python picks to display the bytes happens to be the right one, or at least happens to match the right one for these characters.

To do this properly, you should read().decode(encoding) where encoding is the encoding value from the Content-Type HTTP header, accessible through the HTTPResponse object (that is, fhand, in your code). If there is no Content-Type header, or if it doesn't specify an encoding, you're reduced to guessing which encoding to use, but for typical English text it doesn't matter, and in many other cases it's probably going to be UTF-8.

Viridian answered 13/11, 2015 at 9:3 Comment(0)
O
0

Python 3 distinguishes between byte sequences and strings. The "b" in front of the string tells you that urllib returned the contents as "raw" bytes. It might be worth reading into the python 3 bytes/strings situation, but basically, you did get the right text back. If you don't want the result to be bytes, you'd just have to convert it back to a "real" python string.

Otocyst answered 13/11, 2015 at 8:56 Comment(0)
C
0

The third-party requests library handles decoding to unicode strings automatically. It does its best to infer the correct encoding so you don't need to guess the encoding yourself.

>>> import requests
>>> r = requests.get('http://www.py4inf.com/code/romeo.txt')
>>> print(r.text)
But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief

Same thing with urllib.request and an assumed UTF-8 encoding:

>>> from urllib.request import urlopen
>>> r = urlopen('http://www.py4inf.com/code/romeo.txt')
>>> print(r.read().decode('UTF-8'))
But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief
Canty answered 13/11, 2015 at 19:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.