BOM in server response screws up json parsing
Asked Answered
F

3

8

I'm trying to write a Python script that posts some JSON to a web server and gets some JSON back. I patched together a few different examples on StackOverflow, and I think I have something that's mostly working.

import urllib2
import json

url = "http://foo.com/API.svc/SomeMethod"
payload = json.dumps( {'inputs': ['red', 'blue', 'green']} )
headers = {"Content-type": "application/json;"}

req = urllib2.Request(url, payload, headers)
f = urllib2.urlopen(req)
response = f.read()
f.close()

data = json.loads(response) # <-- Crashes

The last line throws an exception:

ValueError: No JSON object could be decoded

When I look at response, I see valid JSON, but the first few characters are a BOM:

>>> response
'\xef\xbb\xbf[\r\n  {\r\n    ... Valid JSON here

So, if I manually strip out the first three bytes:

data = json.loads(response[3::])

Everything works and response is turned into a dictionary.

My Question:

It seems kinda silly that json barfs when you give it a BOM. Is there anything different I can do with urllib or the json library to let it know this is a UTF8 string and to handle it as such? I don't want to manually strip out the first 3 bytes.

Freestyle answered 25/1, 2013 at 23:49 Comment(2)
What happens if you add # -*- coding:UTF-8 -*- at the top of your file?Lovesick
the magic encoding comment only affects what encoding the Python interpreter uses when reading and compiling your code, which for Python 2 means string literals. it has absolutely zero bearing on how Python handles strings at runtime.Genovera
G
12

You should probably yell at whoever's running this service, because a BOM on UTF-8 text makes no sense. The BOM exists to disambiguate byte order, and UTF-8 is defined as being little-endian.

That said, ideally you should decode bytes before doing anything else with them. Luckily, Python has a codec that recognizes and removes the BOM: utf-8-sig.

>>> '\xef\xbb\xbffoo'.decode('utf-8-sig')
u'foo'

So you just need:

data = json.loads(response.decode('utf-8-sig'))
Genovera answered 25/1, 2013 at 23:59 Comment(6)
That fixed it! The person who wrote the service though is me. Why it's adding a BOM to the output, I have no idea. It's a .NET web service, and uses JSON.NET to serialize the output. I'll have to dig into it to see why in the world it's adding these bytes.Freestyle
Ok I fixed the web service too :) It turns out, if you pass an Encoding type into StreamWriter() then it adds the preamble. If you leave it off, it just writes raw bytes with no BOM. Problem solved!Freestyle
A BOM on UTF-8 is specifically allowed by the standard, and used all over the place by Windows to distinguish UTF-8 from whatever the OEM charset is. This is stupid, and the standard recommends that applications not do this, but it's prevalent enough that it's also stupid to not be able to handle it when you see it. Refusing to accept a UTF-8 BOM means refusing to interact with .NET services, open Windows text files, etc. See Wikipedia for more details.Obi
@abarnert, RFC4627 (the json standard) does not allow BOM.Bataan
@avakar: you must not generate json with BOM but you may accept it: "In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error." rfc 7159 (btw, the old rfc 4627 doesn't mention BOM)Haruspex
@J.F.Sebastian, oh, thanks, I didn't realize there was a new RFC.Bataan
R
5

In case I'm not the only one who experienced the same problem, but is using requests module instead of urllib2, here is a solution that works in Python 2.6 as well as 3.3:

import requests
r = requests.get(url, params=my_dict, auth=(user, pass))
print(r.headers['content-type'])  # 'application/json; charset=utf8'
if r.text[0] == u'\ufeff':  # bytes \xef\xbb\xbf in utf-8 encoding
    r.encoding = 'utf-8-sig'
print(r.json())
Roobbie answered 9/5, 2014 at 15:35 Comment(0)
F
0

Since I lack enough reputation for a comment, I'll write an answer instead.

I usually encounter that problem when I need to leave the underlying Stream of a StreamWriter open. However, the overload that has the option to leave the underlying Stream open needs an encoding (which will be UTF8 in most cases), here's how to do it without emitting the BOM.

/* Since Encoding.UTF8 (the one you'd normally use in those cases) **emits**
 * the BOM, use whats below instead!
 */

// UTF8Encoding has an overload which enables / disables BOMs in the output
UTF8Encoding encoding = new UTF8Encoding(false);

using (MemoryStream ms = new MemoryStream())
using (StreamWriter sw = new StreamWriter(ms, encoding, 4096, true))
using (JsonTextWriter jtw = new JsonTextWriter(sw))
{
    serializer.Serialize(jtw, myObject);
}
Faucal answered 18/4, 2015 at 11:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.