OverflowError: Unsupported UTF-8 sequence length when > encoding string
Asked Answered
R

2

8

Inside a Twisted Resource, I am returning a json encoded dict as the response var below. The data is a list of 5 people with name, guid, and a couple other fields less than 32 characters long each, so not a ton of data.

I get this OverflowError exception pretty often, but I don't quite understand what the unsupported utf-8 sequence length refers to.

self.request.write(ujson.dumps(response))

exceptions.OverflowError: Unsupported UTF-8 sequence length when encoding string

Retrospective answered 7/12, 2011 at 20:41 Comment(3)
Look at response content and try to base64.urlsafe_b64encode the bytes string you have in it.Indecorum
I had this error when I had a list of uuid.uuid4(), but should have made str(uuid.uuid4())Kensell
@MartinThoma: Thanks for the insight. That's what solved the issue for me tooChapa
U
4

When in doubt, check the source: http://code.google.com/p/rapidjson/source/browse/trunk/thirdparty/ultrajson/ultrajsonenc.c

This error happens when the UTF-8 length is 5 or 6 bytes. This JSON implementation doesn't implement that. Those characters won't work if you're using the data in a browser anyway, since they're outside the range of UTF-16.

I'd be surprised if this actually happened often; it'd only happen with Unicode codepoints over U+1FFFFF, which are vanishingly rare, and not even supported in Unicode strings by most builds of Python due to being outside this range. You should find out why these characters are showing up in your data.

Unhandy answered 7/12, 2011 at 21:9 Comment(4)
Thanks Glenn. Still getting used to Python and thought it was a Twisted issue, didn't think to look at ujson since it was working fine with other data. The data does come into the app over a socket connection, so that is most likely the culprit. Thanks a lot.Retrospective
I don't see why "outside the BMP" is particularly relevant to the question of whether a browser can render a glyph for a particular code point. It also seems to me like this really qualifies as a bug in the implementation; the JSON spec is quite explicit that a "char" is "any Unicode character except double-quote or backslash or a control character".Bessbessarabia
@Karl: Just a typo; it's the range that matters: [0,0x1FFFFF]. JavaScript uses UTF-16, which can only represent codepoints in that range. In practice, JSON serializers that output ASCII use UTF-16 surrogates, and can only output this range; JSON has no 8-byte Unicode escape.Unhandy
The verdict, I was storing the data in MongoDB. The error came from the default _id value Mongo returns from the db. I unset that field and the errors went away. Thanks again for pointing me in the right direction.Retrospective
T
19

Just a note that I recently encountered this same error, and can give a little background.

If you see this, it's possible you're trying to json encode a Mongo Object with ujson in python.

Using the native python library, we get a more helpful error message:

TypeError: ObjectId('510652d322fc956ca9e41342') is not JSON serializable

ujson is somehow trying to parse an ObjectId python object and getting lost. There are a few options, the most direct being wiping the '_id' field from Mongo before saving. You could also subclass ujson to somehow parse or munge the ObjectIds into a simple character string.

Titular answered 28/1, 2013 at 17:17 Comment(3)
i tried to modify json_utilin bson.py (pymongo) and replaced the import json with import ujson as json it dident work, they dont share the methods :(Wouldst
You saved the day.Elative
This can be solved by setting default_handler argument to str, like this: jsonResult = df.to_json(default_handler=str). The issue has been discussed here: github.com/pandas-dev/pandas/issues/14256 and contains explanations.Anklet
U
4

When in doubt, check the source: http://code.google.com/p/rapidjson/source/browse/trunk/thirdparty/ultrajson/ultrajsonenc.c

This error happens when the UTF-8 length is 5 or 6 bytes. This JSON implementation doesn't implement that. Those characters won't work if you're using the data in a browser anyway, since they're outside the range of UTF-16.

I'd be surprised if this actually happened often; it'd only happen with Unicode codepoints over U+1FFFFF, which are vanishingly rare, and not even supported in Unicode strings by most builds of Python due to being outside this range. You should find out why these characters are showing up in your data.

Unhandy answered 7/12, 2011 at 21:9 Comment(4)
Thanks Glenn. Still getting used to Python and thought it was a Twisted issue, didn't think to look at ujson since it was working fine with other data. The data does come into the app over a socket connection, so that is most likely the culprit. Thanks a lot.Retrospective
I don't see why "outside the BMP" is particularly relevant to the question of whether a browser can render a glyph for a particular code point. It also seems to me like this really qualifies as a bug in the implementation; the JSON spec is quite explicit that a "char" is "any Unicode character except double-quote or backslash or a control character".Bessbessarabia
@Karl: Just a typo; it's the range that matters: [0,0x1FFFFF]. JavaScript uses UTF-16, which can only represent codepoints in that range. In practice, JSON serializers that output ASCII use UTF-16 surrogates, and can only output this range; JSON has no 8-byte Unicode escape.Unhandy
The verdict, I was storing the data in MongoDB. The error came from the default _id value Mongo returns from the db. I unset that field and the errors went away. Thanks again for pointing me in the right direction.Retrospective

© 2022 - 2024 — McMap. All rights reserved.