How to JSON serialize sets? [duplicate]
Asked Answered
D

12

227

I have a Python set that contains objects with __hash__ and __eq__ methods in order to make certain no duplicates are included in the collection.

I need to json encode this result set, but passing even an empty set to the json.dumps method raises a TypeError.

  File "/usr/lib/python2.7/json/encoder.py", line 201, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python2.7/json/encoder.py", line 264, in iterencode
    return _iterencode(o, 0)
  File "/usr/lib/python2.7/json/encoder.py", line 178, in default
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: set([]) is not JSON serializable

I know I can create an extension to the json.JSONEncoder class that has a custom default method, but I'm not even sure where to begin in converting over the set. Should I create a dictionary out of the set values within the default method, and then return the encoding on that? Ideally, I'd like to make the default method able to handle all the datatypes that the original encoder chokes on (I'm using Mongo as a data source so dates seem to raise this error too)

Any hint in the right direction would be appreciated.

EDIT:

Thanks for the answer! Perhaps I should have been more precise.

I utilized (and upvoted) the answers here to get around the limitations of the set being translated, but there are internal keys that are an issue as well.

The objects in the set are complex objects that translate to __dict__, but they themselves can also contain values for their properties that could be ineligible for the basic types in the json encoder.

There's a lot of different types coming into this set, and the hash basically calculates a unique id for the entity, but in the true spirit of NoSQL there's no telling exactly what the child object contains.

One object might contain a date value for starts, whereas another may have some other schema that includes no keys containing "non-primitive" objects.

That is why the only solution I could think of was to extend the JSONEncoder to replace the default method to turn on different cases - but I'm not sure how to go about this and the documentation is ambiguous. In nested objects, does the value returned from default go by key, or is it just a generic include/discard that looks at the whole object? How does that method accommodate nested values? I've looked through previous questions and can't seem to find the best approach to case-specific encoding (which unfortunately seems like what I'm going to need to do here).

Deutzia answered 22/11, 2011 at 16:38 Comment(8)
why dicts? I think you want to make just a list out of the set and then pass it to the encoder... e.g: encode(list(myset))Addi
Instead of using JSON, you could use YAML (JSON is essentially a subset of YAML).Engineer
@PaoloMoretti: Does it bring any advantage though? I don't think sets are among the universally-supported data types of YAML, and it's less widely supported, especially regarding APIs.Majolica
@PaoloMoretti Thank you for your input, but the application frontend requires JSON as a return type and this requirement is for all purposes fixed.Deutzia
@delnan I was suggesting YAML because it has a native support for both sets and dates.Engineer
FWIW, my answer shows how to handle the nested case without clobbering your ability to use regular lists and dicts. This approach is easily extended to handle many different datatypes.Pyrometer
@RaymondHettinger - I'm in the process of implementing your solution right now as well. Strange coincidence - the ratings system for these datasets was built with your neural network code as a guide! Perhaps you remember me tweeting you =)Deutzia
If the question really is "what data should I use, in JSON format, in order to represent the set's information?", then that is an overly-broad question about a design decision. Otherwise, it's a duplicate.Majunga
P
131

JSON notation has only a handful of native datatypes (objects, arrays, strings, numbers, booleans, and null), so anything serialized in JSON needs to be expressed as one of these types.

As shown in the json module docs, this conversion can be done automatically by a JSONEncoder and JSONDecoder, but then you would be giving up some other structure you might need (if you convert sets to a list, then you lose the ability to recover regular lists; if you convert sets to a dictionary using dict.fromkeys(s) then you lose the ability to recover dictionaries).

A more sophisticated solution is to build-out a custom type that can coexist with other native JSON types. This lets you store nested structures that include lists, sets, dicts, decimals, datetime objects, etc.:

from json import dumps, loads, JSONEncoder, JSONDecoder
import pickle

class PythonObjectEncoder(JSONEncoder):
    def default(self, obj):
        try:
            return {'_python_object': pickle.dumps(obj).decode('latin-1')}
        except pickle.PickleError:
            return super().default(obj)

def as_python_object(dct):
    if '_python_object' in dct:
        return pickle.loads(dct['_python_object'].encode('latin-1'))
    return dct

Here is a sample session showing that it can handle lists, dicts, and sets:

>>> data = [1,2,3, set(['knights', 'who', 'say', 'ni']), {'key':'value'}, Decimal('3.14')]

>>> j = dumps(data, cls=PythonObjectEncoder)

>>> loads(j, object_hook=as_python_object)
[1, 2, 3, set(['knights', 'say', 'who', 'ni']), {'key': 'value'}, Decimal('3.14')]

Alternatively, it may be useful to use a more general purpose serialization technique such as YAML, Twisted Jelly, or Python's pickle module. These each support a much greater range of datatypes.

Pyrometer answered 22/11, 2011 at 16:41 Comment(17)
In your sample session, should PythonSetEncoder be PythonObjectEncoder?Reube
Just make sure not to use that on untrusted input, as pickle is not intended to be secure against erroneous or maliciously constructed data, while JSON is (until customized with pickle).Torero
This is the first I've heard that YAML is more general purpose than JSON... o_OMajunga
@KarlKnechtel YAML is a superset of JSON (very nearly). It also adds tags for binary data, sets, ordered maps, and timestamps. Supporting more datatypes is what I meant by "more general purpose". You seem to be using the phrase "general purpose" in a different sense.Pyrometer
@Raymond: +1, however I believe the if isinstance(obj, (list, dict, ... etc)): statement in PythonObjectEncoder.default() is redundant since it will only be called if the object is something the base JSONEncoder class can't handle.Gateshead
Don't forget also jsonpickle, which is intended to be a generalized library for pickling Python objects to JSON, much as this answer suggests.Hotspur
As of version 1.2, YAML is a strict superset of JSON. All legal JSON now is legal YAML. yaml.org/spec/1.2/spec.htmlGrapnel
this code example imports JSONDecoder but doesn't use itBrig
I like that this answer extends the functionality of JSON, however one should keep in mind, this is only for Python and human readability of the output suffers. Actually YAML seems to be the better alternative if readability is an issue.Cosmonaut
@watsonic: It needs to import JSONDecoder because it's the base class of PythonObjectEncoder.Gateshead
Raymond: From the JSONEncoder documentation it sounds like the default() method is only called "for objects that can’t otherwise be serialized" which means it's not necessary to do the isinstance() check in your answer. This seem workt by my own testing, although I would put the return {'_python_object': pickle.dumps(obj)} inside a try/except TypeError: in case pickle.dumps() can't deal with the object. Am I understanding something?Gateshead
Raymond: On a related topic, instead of subclassing JSONEncoder, couldn't one just use the default= keyword argument of the standard class to specify a function that does something similar to what PythonObjectEncoder.default() method does in your answer? This seemed to work when I tested it, as well.Gateshead
This answer ought to mention the security implications of pickleSlenderize
Sadly i get a lot of RecursionError: maximum recursion depth exceeded in __instancecheck__ with your answerLahr
@Lahr I just updated this code for Python 3. It worked fine in Python 2.Pyrometer
Is there some reason you keep ignoring my comments about whether the isinstance() check the the default() method is really needed?Gateshead
@Gateshead Updated the recipe to remove the isinstance() check. Added a try/except to call the parent in case of a pickling error.Pyrometer
D
183

You can create a custom encoder that returns a list when it encounters a set. Here's an example:

import json
class SetEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, set):
            return list(obj)
        return json.JSONEncoder.default(self, obj)

data_str = json.dumps(set([1,2,3,4,5]), cls=SetEncoder)
print(data_str)
# Output: '[1, 2, 3, 4, 5]'

You can detect other types this way too. If you need to retain that the list was actually a set, you could use a custom encoding. Something like return {'type':'set', 'list':list(obj)} might work.

To illustrate nested types, consider serializing this:

class Something(object):
    pass
json.dumps(set([1,2,3,4,5,Something()]), cls=SetEncoder)

This raises the following error:

TypeError: <__main__.Something object at 0x1691c50> is not JSON serializable

This indicates that the encoder will take the list result returned and recursively call the serializer on its children. To add a custom serializer for multiple types, you can do this:

class SetEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, set):
            return list(obj)
        if isinstance(obj, Something):
            return 'CustomSomethingRepresentation'
        return json.JSONEncoder.default(self, obj)
 
data_str = json.dumps(set([1,2,3,4,5,Something()]), cls=SetEncoder)
print(data_str)
# Output: '[1, 2, 3, 4, 5, "CustomSomethingRepresentation"]'
Dogmatic answered 22/11, 2011 at 16:49 Comment(7)
Thanks, I edited the question to better specify that this was the type of thing I needed. What I can't seem to grasp is how this method will handle nested objects. In your example the return value is list for set, but what if the object passed in was a set with dates (another bad datatype) inside it? Should I drill through the keys within the default method itself? Thanks a ton!Deutzia
I think the JSON module handles nested objects for you. Once it gets the list back, it will iterate over the list items trying to encode each one. If one of them is a date, the default function will get called again, this time with obj being a date object, so you just have to test for it and return a date-representation.Dogmatic
So the default method could conceivably run several times for any one object passed to it, since it will also look at the individual keys once it is "listified"?Deutzia
Sort of, it won't get called multiple times for the same object, but it can recurse into children. See updated answer.Dogmatic
Worked exactly as you described. I still have to figure some of the faults out, but most of it is probably stuff that can be refactored out. Thanks a ton for your guidance!Deutzia
@Dogmatic Any ideas to recover this back (list to set) while json.loads? Like encoding this information or something during SetEncoder?Clute
@Dogmatic Also interested in creating the corresponding SetDecoder class, but my naive attempt failed to correctly convert arrays into sets. Any ideas?Deserted
P
131

JSON notation has only a handful of native datatypes (objects, arrays, strings, numbers, booleans, and null), so anything serialized in JSON needs to be expressed as one of these types.

As shown in the json module docs, this conversion can be done automatically by a JSONEncoder and JSONDecoder, but then you would be giving up some other structure you might need (if you convert sets to a list, then you lose the ability to recover regular lists; if you convert sets to a dictionary using dict.fromkeys(s) then you lose the ability to recover dictionaries).

A more sophisticated solution is to build-out a custom type that can coexist with other native JSON types. This lets you store nested structures that include lists, sets, dicts, decimals, datetime objects, etc.:

from json import dumps, loads, JSONEncoder, JSONDecoder
import pickle

class PythonObjectEncoder(JSONEncoder):
    def default(self, obj):
        try:
            return {'_python_object': pickle.dumps(obj).decode('latin-1')}
        except pickle.PickleError:
            return super().default(obj)

def as_python_object(dct):
    if '_python_object' in dct:
        return pickle.loads(dct['_python_object'].encode('latin-1'))
    return dct

Here is a sample session showing that it can handle lists, dicts, and sets:

>>> data = [1,2,3, set(['knights', 'who', 'say', 'ni']), {'key':'value'}, Decimal('3.14')]

>>> j = dumps(data, cls=PythonObjectEncoder)

>>> loads(j, object_hook=as_python_object)
[1, 2, 3, set(['knights', 'say', 'who', 'ni']), {'key': 'value'}, Decimal('3.14')]

Alternatively, it may be useful to use a more general purpose serialization technique such as YAML, Twisted Jelly, or Python's pickle module. These each support a much greater range of datatypes.

Pyrometer answered 22/11, 2011 at 16:41 Comment(17)
In your sample session, should PythonSetEncoder be PythonObjectEncoder?Reube
Just make sure not to use that on untrusted input, as pickle is not intended to be secure against erroneous or maliciously constructed data, while JSON is (until customized with pickle).Torero
This is the first I've heard that YAML is more general purpose than JSON... o_OMajunga
@KarlKnechtel YAML is a superset of JSON (very nearly). It also adds tags for binary data, sets, ordered maps, and timestamps. Supporting more datatypes is what I meant by "more general purpose". You seem to be using the phrase "general purpose" in a different sense.Pyrometer
@Raymond: +1, however I believe the if isinstance(obj, (list, dict, ... etc)): statement in PythonObjectEncoder.default() is redundant since it will only be called if the object is something the base JSONEncoder class can't handle.Gateshead
Don't forget also jsonpickle, which is intended to be a generalized library for pickling Python objects to JSON, much as this answer suggests.Hotspur
As of version 1.2, YAML is a strict superset of JSON. All legal JSON now is legal YAML. yaml.org/spec/1.2/spec.htmlGrapnel
this code example imports JSONDecoder but doesn't use itBrig
I like that this answer extends the functionality of JSON, however one should keep in mind, this is only for Python and human readability of the output suffers. Actually YAML seems to be the better alternative if readability is an issue.Cosmonaut
@watsonic: It needs to import JSONDecoder because it's the base class of PythonObjectEncoder.Gateshead
Raymond: From the JSONEncoder documentation it sounds like the default() method is only called "for objects that can’t otherwise be serialized" which means it's not necessary to do the isinstance() check in your answer. This seem workt by my own testing, although I would put the return {'_python_object': pickle.dumps(obj)} inside a try/except TypeError: in case pickle.dumps() can't deal with the object. Am I understanding something?Gateshead
Raymond: On a related topic, instead of subclassing JSONEncoder, couldn't one just use the default= keyword argument of the standard class to specify a function that does something similar to what PythonObjectEncoder.default() method does in your answer? This seemed to work when I tested it, as well.Gateshead
This answer ought to mention the security implications of pickleSlenderize
Sadly i get a lot of RecursionError: maximum recursion depth exceeded in __instancecheck__ with your answerLahr
@Lahr I just updated this code for Python 3. It worked fine in Python 2.Pyrometer
Is there some reason you keep ignoring my comments about whether the isinstance() check the the default() method is really needed?Gateshead
@Gateshead Updated the recipe to remove the isinstance() check. Added a try/except to call the parent in case of a pickling error.Pyrometer
Z
40

You don't need to make a custom encoder class to supply the default method - it can be passed in as a keyword argument:

import json

def serialize_sets(obj):
    if isinstance(obj, set):
        return list(obj)

    return obj

json_str = json.dumps(set([1,2,3]), default=serialize_sets)
print(json_str)

results in [1, 2, 3] in all supported Python versions.

Zollverein answered 5/3, 2020 at 11:40 Comment(2)
Most simple, readable and elegant solution. I'd personally prefer dict over lists, as dict is, in fact, a sets (with benefits).Copulative
@BerryTsakala but json objects cannot have integers as keys...Thermopylae
A
20

If you know for sure that the only non-serializable data will be sets, there's a very simple (and dirty) solution:

json.dumps({"Hello World": {1, 2}}, default=tuple)

Only non-serializable data will be treated with the function given as default, so only the set will be converted to a tuple.

Albertinaalbertine answered 4/8, 2021 at 12:48 Comment(1)
json.dumps({"Hello World": {1, 2}}, default=list) works tooOrris
S
10

I adapted Raymond Hettinger's solution to python 3.

Here is what has changed:

  • unicode disappeared
  • updated the call to the parents' default with super()
  • using base64 to serialize the bytes type into str (because it seems that bytes in python 3 can't be converted to JSON)
from decimal import Decimal
from base64 import b64encode, b64decode
from json import dumps, loads, JSONEncoder
import pickle

class PythonObjectEncoder(JSONEncoder):
    def default(self, obj):
        if isinstance(obj, (list, dict, str, int, float, bool, type(None))):
            return super().default(obj)
        return {'_python_object': b64encode(pickle.dumps(obj)).decode('utf-8')}

def as_python_object(dct):
    if '_python_object' in dct:
        return pickle.loads(b64decode(dct['_python_object'].encode('utf-8')))
    return dct

data = [1,2,3, set(['knights', 'who', 'say', 'ni']), {'key':'value'}, Decimal('3.14')]
j = dumps(data, cls=PythonObjectEncoder)
print(loads(j, object_hook=as_python_object))
# prints: [1, 2, 3, {'knights', 'who', 'say', 'ni'}, {'key': 'value'}, Decimal('3.14')]
Swellfish answered 27/3, 2016 at 20:26 Comment(1)
The code shown at the end of this answer to a related question accomplishes the same thing by [only] decoding and encoding the bytes object json.dumps() returns to/from 'latin1', skipping the base64 stuff which isn't necessary.Gateshead
G
8

If you need just quick dump and don't want to implement custom encoder. You can use the following:

json_string = json.dumps(data, iterable_as_array=True)

This will convert all sets (and other iterables) into arrays. Just beware that those fields will stay arrays when you parse the JSON back. If you want to preserve the types, you need to write custom encoder.

Also make sure to have simplejson installed and required.
You can find it on PyPi.

Giovanna answered 6/12, 2018 at 14:8 Comment(4)
When I try this I get: TypeError: __init__() got an unexpected keyword argument 'iterable_as_array'Valencia
You need to install simplejsonKerstinkerwin
import simplejson as json and then json_string = json.dumps(data, iterable_as_array=True) works well in Python 3.6Allene
This is the only answer that worked for me but it definitely requires simplejson.Koan
I
6

Only dictionaries, Lists and primitive object types (int, string, bool) are available in JSON.

Innocuous answered 22/11, 2011 at 16:42 Comment(2)
"Primitive object type" makes no sense when talking about Python. "Built-in object" makes more sense, but is too broad here (for starters: it includes dicts, lists and also sets). (JSON terminology may be different though.)Majolica
string number object array true false nullInnocuous
U
6

Shortened version of @AnttiHaapala:

json.dumps(dict_with_sets, default=lambda x: list(x) if isinstance(x, set) else x)
Upstretched answered 9/1, 2021 at 5:24 Comment(1)
Best to me. In my case [set1, set2, set3, set4]. I can read the stringified back this way: [set(i) for i in json.loads(s)].Ideal
G
5

If you only need to encode sets, not general Python objects, and want to keep it easily human-readable, a simplified version of Raymond Hettinger's answer can be used:

import json
import collections

class JSONSetEncoder(json.JSONEncoder):
    """Use with json.dumps to allow Python sets to be encoded to JSON

    Example
    -------

    import json

    data = dict(aset=set([1,2,3]))

    encoded = json.dumps(data, cls=JSONSetEncoder)
    decoded = json.loads(encoded, object_hook=json_as_python_set)
    assert data == decoded     # Should assert successfully

    Any object that is matched by isinstance(obj, collections.Set) will
    be encoded, but the decoded value will always be a normal Python set.

    """

    def default(self, obj):
        if isinstance(obj, collections.Set):
            return dict(_set_object=list(obj))
        else:
            return json.JSONEncoder.default(self, obj)

def json_as_python_set(dct):
    """Decode json {'_set_object': [1,2,3]} to set([1,2,3])

    Example
    -------
    decoded = json.loads(encoded, object_hook=json_as_python_set)

    Also see :class:`JSONSetEncoder`

    """
    if '_set_object' in dct:
        return set(dct['_set_object'])
    return dct
Gillard answered 5/2, 2015 at 8:37 Comment(0)
S
2
>>> import json
>>> set_object = set([1,2,3,4])
>>> json.dumps(list(set_object))
'[1, 2, 3, 4]'
Southeastwardly answered 23/9, 2021 at 14:12 Comment(1)
This does not preserve the type of the object, it turns it into a list.Gateshead
D
1

One shortcoming of the accepted solution is that its output is very python specific. I.e. its raw json output cannot be observed by a human or loaded by another language (e.g. javascript). example:

db = {
        "a": [ 44, set((4,5,6)) ],
        "b": [ 55, set((4,3,2)) ]
        }

j = dumps(db, cls=PythonObjectEncoder)
print(j)

Will get you:

{"a": [44, {"_python_object": "gANjYnVpbHRpbnMKc2V0CnEAXXEBKEsESwVLBmWFcQJScQMu"}], "b": [55, {"_python_object": "gANjYnVpbHRpbnMKc2V0CnEAXXEBKEsCSwNLBGWFcQJScQMu"}]}

I can propose a solution which downgrades the set to a dict containing a list on the way out, and back to a set when loaded into python using the same encoder, therefore preserving observability and language agnosticism:

from decimal import Decimal
from base64 import b64encode, b64decode
from json import dumps, loads, JSONEncoder
import pickle

class PythonObjectEncoder(JSONEncoder):
    def default(self, obj):
        if isinstance(obj, (list, dict, str, int, float, bool, type(None))):
            return super().default(obj)
        elif isinstance(obj, set):
            return {"__set__": list(obj)}
        return {'_python_object': b64encode(pickle.dumps(obj)).decode('utf-8')}

def as_python_object(dct):
    if '__set__' in dct:
        return set(dct['__set__'])
    elif '_python_object' in dct:
        return pickle.loads(b64decode(dct['_python_object'].encode('utf-8')))
    return dct

db = {
        "a": [ 44, set((4,5,6)) ],
        "b": [ 55, set((4,3,2)) ]
        }

j = dumps(db, cls=PythonObjectEncoder)
print(j)
ob = loads(j)
print(ob["a"])

Which gets you:

{"a": [44, {"__set__": [4, 5, 6]}], "b": [55, {"__set__": [2, 3, 4]}]}
[44, {'__set__': [4, 5, 6]}]

Note that serializing a dictionary which has an element with a key "__set__" will break this mechanism. So __set__ has now become a reserved dict key. Obviously feel free to use another, more deeply obfuscated key.

Dermatology answered 16/1, 2020 at 13:33 Comment(0)
H
0

you should try jsonwhatever

https://pypi.org/project/jsonwhatever/

pip install jsonwhatever

from jsonwhatever import JsonWhatEver

set_a = {1,2,3}

jsonwe = JsonWhatEver()

string_res = jsonwe.jsonwhatever('set_string', set_a)

print(string_res)
Hyaloid answered 6/12, 2022 at 12:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.