Deserializing a huge json string to python objects
Asked Answered
S

1

8

I am using simplejson to deserialize json string to python objects. I have a custom written object_hook that takes care of deserializing the json back to my domain objects.

The problem is, when my json string is huge (i.e. the server is returning around 800K domain objects in the form of a json string), my python deserializer is taking almost 10 minutes to deserialize them.

I drilled down a bit further and it looks like simplejson as such is not doing much work rather it's delegating everything to the object_hook. I tried optimizing my object_hook but that too is not improving my performance. (I hardly got 1 min improvement)

My question is, do we have any other standard framework that is optimized to handle huge data set or is there a way where I can utilize the framework's capability rather than doing everything at object_hook level.

I see that without object_hook the framework returns just a list of dictionaries not list of domain objects.

Any pointers here will be useful.

FYI I am using simplejson version 3.7.2

Here is my sample _object_hook:

def _object_hook(dct):
    if '@CLASS' in dct: # server sends domain objects with this @CLASS 
        clsname = dct['@CLASS']
        # This is like Class.forName (This imports the module and gives the class)
        cls = get_class(clsname)
        # As my server is in java, I convert the attributes to python as per python naming convention.
        dct = dict( (convert_java_name_to_python(k), dct[k]) for k in dct.keys())
       if cls != None:
            obj_key = None
            if "@uuid"in dct
                obj_key = dct["@uuid"]
                del(dct["@uuid"])
            else:
                info("Class missing uuid: " + clsname)
            dct.pop("@CLASS", None)

            obj = cls(**dct) #This I found to be the most time consuming process. In my domian object, in the __init__ method I have the logic to set all attributes based on the kwargs passed 
            if obj_key is not None:
                shared_objs[obj_key] = obj #I keep all uuids along with the objects in shared_objs dictionary. This shared_objs will be used later to replace references.
        else:
            warning("class not found: " + clsname)
            obj = dct

        return obj
    else:
        return dct

A Sample response:

    {"@CLASS":"sample.counter","@UUID":"86f26a0a-1a58-4429-a762-  9b1778a99c82","val1":"ABC","val2":1131,"val3":1754095,"value4":  {"@CLASS":"sample.nestedClass","@UUID":"f7bb298c-fd0b-4d87-bed8-  74d5eb1d6517","id":1754095,"name":"XYZ","abbreviation":"ABC"}}

I have many levels of nesting and the number of records I am receiving from server is more than 800K.

Shawna answered 16/6, 2016 at 14:11 Comment(2)
Seems interesting. Any sample snippet to quickly check it, would be useful.Holohedral
If you could post the code of your object_hook function and a sample of the JSON you want to parse, that would help us answer your question.Undulatory
R
6

I don't know of any framework that offers what you seek out of the box, but you may apply a few optimizations to the way your class instance is setup.

Since unpacking the dictionary into keyword arguments and applying them to your class variables is taking the bulk of the time, you may consider passing the dct directly to your class __init__ and setting up the class dictionary cls.__dict__ with dct:

Trial 1

In [1]: data = {"name": "yolanda", "age": 4}

In [2]: class Person:
   ...:     def __init__(self, name, age):
   ...:         self.name = name
   ...:         self.age = age
   ...:
In [3]: %%timeit
   ...: Person(**data)
   ...:
1000000 loops, best of 3: 926 ns per loop

Trial 2

In [4]: data = {"name": "yolanda", "age": 4}

In [5]: class Person2:
   ....:     def __init__(self, data):
   ....:         self.__dict__ = data
   ....:
In [6]: %%timeit
   ....: Person2(data)
   ....:
1000000 loops, best of 3: 541 ns per loop

There will be no worries about the self.__dict__ being modified via another reference since the reference to dct is lost before _object_hook returns.

This will of course mean changing the set up of your __init__, with the attributes of your class strictly depending on the items in dct. It's up to you.


You may also replace cls != None with cls is not None (there is only one None object so an identity check is more pythonic):

Trial 1

In [38]: cls = 5
In [39]: %%timeit
   ....: cls != None
   ....:
10000000 loops, best of 3: 85.8 ns per loop

Trial 2

In [40]: %%timeit
   ....: cls is not None
   ....:
10000000 loops, best of 3: 57.8 ns per loop

And you can replace two lines with one with:

obj_key = dct["@uuid"]
del(dct["@uuid"])

becoming:

obj_key = dct.pop('@uuid') # Not an optimization as this is same with the above

On a scale of 800K domain objects, these would save you some good time on getting the object_hook to create your objects more quickly.

Regen answered 18/6, 2016 at 21:3 Comment(2)
Thanks for looking in to it. With your suggestions I am able to reduce 2 mins of object_hook deserialization time. But still the end-end time for 800K is ~8 mins. I see for 800K records my object_hook is called by simplejson "3709170" number of times. I was wondering if there is any framework that is optimized to reduce this calls. Any thoughts about lambdaJSON (jsontree/jsonpickle or any other framework)Shawna
@Shawna If it works out well with lamdaJSON, you can post your hack as an answer for others who may have the same problem in the future.Regen

© 2022 - 2024 — McMap. All rights reserved.