Pickle or json? [duplicate]
Asked Answered
Y

8

167

I need to save to disk a little dict object whose keys are of the type str and values are ints and then recover it. Something like this:

{'juanjo': 2, 'pedro':99, 'other': 333}

What is the best option and why? Serialize it with pickle or with simplejson?

I am using Python 2.6.

Yoshida answered 13/2, 2010 at 22:12 Comment(7)
convert it to what? Also, in what sense better?Helenehelenka
In 2.6 you wouldn't use simplejson, you'd use the builtin json module (which has the same exact interface).Alter
"best"? Best for what? Speed? Complexity? Flexibility? Cost?Grisham
see also #8969384Odense
@Trilarion: YAML is a superset of JSONDecasyllable
For posterity: JSON has a problem with tuples as keys. Pickle doesn't. e.g. Pickle can handle {('a','b'):'c'}, not JSON as of mid-2016. So bear that in mind. See: #7002106Doubletongue
There's much more than that that JSON cannot handle as keys, @Salmonstrikes. Numbers for a start. JSON is a good firm format, especially when compared to something like YAML with all it's formatting and interpretation problems, but it is quite restricted.Mews
R
98

If you do not have any interoperability requirements (e.g. you are just going to use the data with Python) and a binary format is fine, go with cPickle which gives you really fast Python object serialization.

If you want interoperability or you want a text format to store your data, go with JSON (or some other appropriate format depending on your constraints).

Ruination answered 13/2, 2010 at 22:22 Comment(8)
My answer highlights the concerns I think are most important to consider when choosing either solution. I make no claim about either being faster than the other. If JSON is faster AND otherwise suitable, go with JSON! (I.e., there's no reason for your down-vote.)Closed
My point is: there is no real reason for using cPickle (or pickle) based on your premises over JSON. When I first read your answer I thought the reason might have been speed, but since this is not the case... :)Octamerous
The benchmark cited by @Octamerous only tests strings. I tested str, int and float seperately and found out that json is slower than cPickle with float serialization, but faster with float unserialization. For int (and str), json is faster both ways. Data and code: gist.github.com/marians/f1314446b8bf4d34e782Warila
Given that json is more interoperable, more secure and in many cases faster than cPickle, for simple data structures I would prefer json over cPickle.Warila
cPickle's latest protocol is now faster than JSON. The up-voted comment about JSON being faster is outdated by a few years. https://mcmap.net/q/197351/-pickle-or-json-duplicateEdrick
@JDiMatteo: I suspect cPickle would have been faster even at the time of that comment if the test suite had used protocol 2 (available since 2.3 or so, but not the default for back compat reasons) rather than the default Python 2 protocol, 0. 0 is severely limited, only using 7 of 8 bits in each byte (this hurts a lot for raw binary data, which has to be reencoded, instead of dumped raw), not supporting new-style classes well, etc. Protocol 2 with cPickle (or on Python 3, plain pickle with the default protocol 3 or higher) would likely beat JSON in all but the most contrived cases.Birkenhead
A (might be minor) down side of JSON: JSON don't have tuples. A python tuple will end up being a list after serializing/deserializing. If your data contain tuples and you want to deserialize them as tuples, you need to avoid JSON.Nadenenader
Inter-language portability aside, did someone mention lack of intra-language portability (between minor versions of the same language)?Barabarabarabas
A
132

I prefer JSON over pickle for my serialization. Unpickling can run arbitrary code, and using pickle to transfer data between programs or store data between sessions is a security hole. JSON does not introduce a security hole and is standardized, so the data can be accessed by programs in different languages if you ever need to.

Alter answered 13/2, 2010 at 22:33 Comment(7)
Thanks. Anyway I'll be dumping and loading in the same program.Yoshida
Though the security risks may be low in your current application, JSON allows you to close the whole altogether.Alter
One can create a pickle-virus that pickles itself into everything that is pickled after loaded. With json this is not possible.Marabou
Apart from security, JSON has the additional advantage that it makes migrations easy, so you can load data that was saved by an older version of your application. Meanwhile you could have added a field, or replaced a whole sub structure. Writing such a converter (migration) for dict/list is straight forward, but with Pickle you'll have a hard time loading it in the first place, before you can even think of converting.Rotenone
I hadn't thought about this aspect (security and the ability for pickled objects to run arbitrary code). Thanks for pointing that out!Jacintajacinth
'only unpickle data you trust' - docs.python.org/3/library/pickle.htmlUsherette
Another argument against pickle format is lack of portability guarantees between python minor versions (builds differences are (most likely) fine).Barabarabarabas
R
98

If you do not have any interoperability requirements (e.g. you are just going to use the data with Python) and a binary format is fine, go with cPickle which gives you really fast Python object serialization.

If you want interoperability or you want a text format to store your data, go with JSON (or some other appropriate format depending on your constraints).

Ruination answered 13/2, 2010 at 22:22 Comment(8)
My answer highlights the concerns I think are most important to consider when choosing either solution. I make no claim about either being faster than the other. If JSON is faster AND otherwise suitable, go with JSON! (I.e., there's no reason for your down-vote.)Closed
My point is: there is no real reason for using cPickle (or pickle) based on your premises over JSON. When I first read your answer I thought the reason might have been speed, but since this is not the case... :)Octamerous
The benchmark cited by @Octamerous only tests strings. I tested str, int and float seperately and found out that json is slower than cPickle with float serialization, but faster with float unserialization. For int (and str), json is faster both ways. Data and code: gist.github.com/marians/f1314446b8bf4d34e782Warila
Given that json is more interoperable, more secure and in many cases faster than cPickle, for simple data structures I would prefer json over cPickle.Warila
cPickle's latest protocol is now faster than JSON. The up-voted comment about JSON being faster is outdated by a few years. https://mcmap.net/q/197351/-pickle-or-json-duplicateEdrick
@JDiMatteo: I suspect cPickle would have been faster even at the time of that comment if the test suite had used protocol 2 (available since 2.3 or so, but not the default for back compat reasons) rather than the default Python 2 protocol, 0. 0 is severely limited, only using 7 of 8 bits in each byte (this hurts a lot for raw binary data, which has to be reencoded, instead of dumped raw), not supporting new-style classes well, etc. Protocol 2 with cPickle (or on Python 3, plain pickle with the default protocol 3 or higher) would likely beat JSON in all but the most contrived cases.Birkenhead
A (might be minor) down side of JSON: JSON don't have tuples. A python tuple will end up being a list after serializing/deserializing. If your data contain tuples and you want to deserialize them as tuples, you need to avoid JSON.Nadenenader
Inter-language portability aside, did someone mention lack of intra-language portability (between minor versions of the same language)?Barabarabarabas
L
52

You might also find this interesting, with some charts to compare: http://kovshenin.com/archives/pickle-vs-json-which-is-faster/

Languishing answered 2/6, 2011 at 14:30 Comment(2)
The article compares performance only related to strings. Here is a script you can run in order to test strings, floats and ints seperately: gist.github.com/marians/f1314446b8bf4d34e782Warila
In Python 3.4, pickle beats json at int, str, and float.Jer
E
33

If you are primarily concerned with speed and space, use cPickle because cPickle is faster than JSON.

If you are more concerned with interoperability, security, and/or human readability, then use JSON.


The tests results referenced in other answers were recorded in 2010, and the updated tests in 2016 with cPickle protocol 2 show:

  • cPickle 3.8x faster loading
  • cPickle 1.5x faster reading
  • cPickle slightly smaller encoding

Reproduce this yourself with this gist, which is based on the Konstantin's benchmark referenced in other answers, but using cPickle with protocol 2 instead of pickle, and using json instead of simplejson (since json is faster than simplejson), e.g.

wget https://gist.github.com/jdimatteo/af317ef24ccf1b3fa91f4399902bb534/raw/03e8dbab11b5605bc572bc117c8ac34cfa959a70/pickle_vs_json.py
python pickle_vs_json.py

Results with python 2.7 on a decent 2015 Xeon processor:

Dir Entries Method  Time    Length

dump    10  JSON    0.017   1484510
load    10  JSON    0.375   -
dump    10  Pickle  0.011   1428790
load    10  Pickle  0.098   -
dump    20  JSON    0.036   2969020
load    20  JSON    1.498   -
dump    20  Pickle  0.022   2857580
load    20  Pickle  0.394   -
dump    50  JSON    0.079   7422550
load    50  JSON    9.485   -
dump    50  Pickle  0.055   7143950
load    50  Pickle  2.518   -
dump    100 JSON    0.165   14845100
load    100 JSON    37.730  -
dump    100 Pickle  0.107   14287900
load    100 Pickle  9.907   -

Python 3.4 with pickle protocol 3 is even faster.

Edrick answered 21/9, 2016 at 3:42 Comment(0)
A
16

JSON or pickle? How about JSON and pickle!

You can use jsonpickle. It easy to use and the file on disk is readable because it's JSON.

See jsonpickle Documentation

Antigorite answered 13/2, 2010 at 22:55 Comment(4)
Any one has benchmarked it's performance against of the options? Is it comparable in performance as raw json as seen here benfrederickson.com/dont-pickle-your-data ?Vitia
This is not a wide ranging benchmark, but I had an existing game where it was saving the levels using pickle (python3). I wanted to try jsonpickle for the human readable aspect - however the level saves were sadly much slower. 1597ms for jsonpickle and 88ms or regular pickle on level save. For level load, 1604ms for jsonpickle and 388 for pickle. Pity as I like the human readable saves.Stockroom
I tested this in our trading system, the readability comes with about 2x serialization+deserialization speed penalty compared to pickle. Great for anything else, though.Bijection
Note that using it without any further options (like you would normally do with a loads/dumps-protocol in python) may lead to objects that don't roundtrip: #30376687Purposely
K
9

I have tried several methods and found out that using cPickle with setting the protocol argument of the dumps method as: cPickle.dumps(obj, protocol=cPickle.HIGHEST_PROTOCOL) is the fastest dump method.

import msgpack
import json
import pickle
import timeit
import cPickle
import numpy as np

num_tests = 10

obj = np.random.normal(0.5, 1, [240, 320, 3])

command = 'pickle.dumps(obj)'
setup = 'from __main__ import pickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("pickle:  %f seconds" % result)

command = 'cPickle.dumps(obj)'
setup = 'from __main__ import cPickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("cPickle:   %f seconds" % result)


command = 'cPickle.dumps(obj, protocol=cPickle.HIGHEST_PROTOCOL)'
setup = 'from __main__ import cPickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("cPickle highest:   %f seconds" % result)

command = 'json.dumps(obj.tolist())'
setup = 'from __main__ import json, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("json:   %f seconds" % result)


command = 'msgpack.packb(obj.tolist())'
setup = 'from __main__ import msgpack, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("msgpack:   %f seconds" % result)

Output:

pickle         :   0.847938 seconds
cPickle        :   0.810384 seconds
cPickle highest:   0.004283 seconds
json           :   1.769215 seconds
msgpack        :   0.270886 seconds
Keijo answered 12/12, 2017 at 2:42 Comment(1)
Interesting - how about deserializing though?Edgebone
E
6

Personally, I generally prefer JSON because the data is human-readable. Definitely, if you need to serialize something that JSON won't take, than use pickle.

But for most data storage, you won't need to serialize anything weird and JSON is much easier and always allows you to pop it open in a text editor and check out the data yourself.

The speed is nice, but for most datasets the difference is negligible; Python generally isn't too fast anyways.

Epigeous answered 26/6, 2014 at 12:35 Comment(2)
In Python 3.4, pickle is over twice as fast as json.Jer
True. But for 100 elements in a list, the difference is completely negligible to the human eye. Definitely different when working with larger datasets.Epigeous
W
0

Most answers are quite old and miss some info.

For the statement "Unpickling can run arbitrary code":
  1. Check the example in https://docs.python.org/3/library/pickle.html#restricting-globals
import pickle
pickle.loads(b"cos\nsystem\n(S'echo hello world'\ntR.")
pickle.loads(b"cos\nsystem\n(S'pwd'\ntR.")

pwd can be replaced e.g. by rm to delete files.

  1. Check https://checkoway.net/musings/pickle/ for more sophisicated "run arbitrary code" template. The code is written in python2.7 but I guess with some modification, could also work in python3. If you make it work in python3, please add the python3 version my answer. :)
For the "pickle speed vs json" part:

Firstly, there is no explicit cpickle in python3 now .

And for this test code borrowed from another answer, pickle beats json in all:

import pickle
import json, random
from time import time
from hashlib import md5

test_runs = 100000

if __name__ == "__main__":
    payload = {
        "float": [(random.randrange(0, 99) + random.random()) for i in range(1000)],
        "int": [random.randrange(0, 9999) for i in range(1000)],
        "str": [md5(str(random.random()).encode('utf8')).hexdigest() for i in range(1000)]
    }
    modules = [json, pickle]

    for payload_type in payload:
        data = payload[payload_type]
        for module in modules:
            start = time()
            if module.__name__ in ['pickle']:
                for i in range(test_runs): serialized = module.dumps(data)
            else:
                for i in range(test_runs): 
                    # print(i)
                    serialized = module.dumps(data)
            w = time() - start
            start = time()
            for i in range(test_runs):
                unserialized = module.loads(serialized)
            r = time() - start
            print("%s %s W %.3f R %.3f" % (module.__name__, payload_type, w, r))

result:

tian@tian-B250M-Wind:~/playground/pickle_vs_json$ p3 pickle_test.py 
json float W 41.775 R 26.738
pickle float W 1.272 R 2.286
json int W 5.142 R 4.974
pickle int W 0.589 R 1.352
json str W 10.379 R 4.626
pickle str W 3.062 R 3.294
Welter answered 25/7, 2022 at 16:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.