Python 2 vs. Python 3 - urllib formats
Asked Answered
I

4

22

I'm getting really tired of trying to figure out why this code works in Python 2 and not in Python 3. I'm just trying to grab a page of json and then parse it. Here's the code in Python 2:

import urllib, json
response = urllib.urlopen("http://reddit.com/.json")
content = response.read()
data = json.loads(content)

I thought the equivalent code in Python 3 would be this:

import urllib.request, json
response = urllib.request.urlopen("http://reddit.com/.json")
content = response.read()
data = json.loads(content)

But it blows up in my face, because the data returned by read() is a "bytes" type. However, I cannot for the life of me get it to convert to something that json will be able to parse. I know from the headers that reddit is trying to send utf-8 back to me, but I can't seem to get the bytes to decode into utf-8:

import urllib.request, json
response = urllib.request.urlopen("http://reddit.com/.json")
content = response.read()
data = json.loads(content.decode("utf8"))

What am I doing wrong?

Edit: the problem is that I cannot get the data into a usable state; even though json loads the data, part of it is undisplayable, and I want to be able to print the data to the screen.

Second edit: The problem has more to do with print than parsing, it seems. Alex's answer provides a way for the script to work in Python 3, by setting the IO to utf8. But a question still remains: why is it that the code worked in Python 2, but not Python 3?

Ilmenite answered 27/6, 2010 at 23:50 Comment(0)
P
15

The code you post is presumably due to wrong cut-and-paste operations because it's clearly wrong in both versions (f.read() fails because there's no f barename defined).

In Py3, ur = response.decode('utf8') works perfectly well for me, as does the following json.loads(ur). Maybe the wrong copys-and-pastes affected your 2-to-3 conversion attempts.

Pansy answered 28/6, 2010 at 0:6 Comment(9)
Whoops, I will fix the code mistakes... I tried reformatting it for display but screwed it all up in the process. :P Regardless, I can't view the data after I parse it (using a simple "print(data)") because it gives me charmap errors.Ilmenite
@Daniel, the problems after you've gotten the data seem to be a separate question from this one about getting the data (which my answer, it appears, responded to -- though seemingly you don't agree, since you didn't even upvote it!). If by data you mean the json.loads(response), I can print it without any problem (on my Mac Terminal.app, which supports UTF-8). What's your sys.stdout.encoding? Have you set properly the environment variable PYTHONIOENCODING: Encoding[:errors] used for stdin/stdout/stderr before starting Python 3? Etc, etc -- totally different issues, see.Pansy
Sorry if I was unclear at first. The core problem is I can't use the data after parsing, for whatever reason (the print is just the beginning of it; if I can't print it, then somewhere down the line I'm going to run into trouble reading the data). I'll check out the encoding, suffice to say it doesn't work on my W7 machine.Ilmenite
@Daniel, if you can't print it, it's perfectly possible that the problem has nothing to do with anything else except the output capability of your Windows terminal -- as en.wikipedia.org/wiki/Code_page says, "Most well-known code pages [...] fit all their code-points into 8 bits and do not involve anything more than mapping each code-point to a single bitmap", meaning they just can't show most Unicode characters. This would not stop you from using your data in any other way -- and we could discuss Unicode woes on Windows much better in a Q & A rather than cramped in comments!Pansy
If it were just the output capability of the Windows terminal, then why does the code work in Python 2?Ilmenite
@Daniel, perhaps by a different setting of sys.stdout.encoding (e.g. via PYTHONIOENCODING, etc) -- I've already asked about that and I've heard nothing from you in response in this interminable thread of comments you insist on perpetuating. Why not just print(repr(data)) in both cases and check if anything is different? If not, then you know it's all about output/terminal issues, as I suspect it may well be -- if specific differences, then of course let us know (editing your Q please, not in yet another cramped comment!-).Pansy
I can't test the code at the moment anyways because reddit itself is down; once I can I'll edit the question with details. I do know that the sys.stdout.encoding is the same between my 2.6 and 3.1 instances (cp437, which I could try setting to something else).Ilmenite
@Daniel, CP437 (like most CPs) just won't let you show every Unicode character (a tiny subset, in fact). Type into the Windows console "chcp 65001" (this sets the code page to UTF-8) and change the terminal font to a Unicode font: Right click title bar, Properties, Font, Lucida Console; then SET PYTHONIOENCODING=utf8.Pansy
The PYTHONIOENCODING solved the problem, but I still want to know why it worked in P2 but not P3.Ilmenite
E
7

Depends of your python version you have to choose the correct library.

for python 3.5

import urllib.request
data = urllib.request.urlopen(url).read().decode('utf8')

for python 2.7

import urllib
url = serviceurl + urllib.urlencode({'sensor':'false', 'address': address})   
uh = urllib.urlopen(url)
Errancy answered 27/10, 2015 at 0:29 Comment(1)
You might want to provide an explanation to clarify your code.Are
N
0

Please see that answer in another Unicode related question.

Now: the Python 3 str (which was the Python 2 unicode) type is an idealised object, in the sense that it deals with “characters”, not “bytes”. These characters, in order to be used for/from disk/network data, need to be encoded-into/decoded-from bytes by a “conversion table”, a.k.a encoding a.k.a codepage. Because of operating system variety, Python historically avoided to guess what that encoding should be; this has been changing over the years, but still the principle of “In the face of ambiguity, refuse the temptation to guess.” applies.

Thankfully, a web server makes your work easier. Your response above should give you all extra information needed:

>>> response.headers['content-type']
'application/json; charset=UTF-8'

So, every time you issue a request to a web server, check the Content-Type header for a charset value, and decode the request's data into Unicode (Python 3: bytes.decode(charset)str) by using that charset.

Narcissus answered 17/10, 2011 at 0:10 Comment(0)
O
0

Here is an approach that is compatible across both versions - it works by first converting bytes data to string, and then loading the string.

import json
try:
    from urllib.request import Request, urlopen #python3+
except ImportError:
    from urllib2 import Request, urlopen        #python2

url = 'https://jsonfeed.org/feed.json'
request = Request(url)
response_json_string = urlopen(request).read().decode('utf8')
response_json_object = json.loads(response_json_string)
Overbid answered 7/2, 2020 at 22:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.