to oversimplify things it looks something like: {((('word','list'),(1,2),(1,3)),(...)):0.0, ....}
Here you have a dictionary with (nested) tuples as keys. This means, straight away dumping to JSON (with either json or simplejson) will not work
TypeError: keys must be str, int, float, bool or None, not tuple
because the JSON standard is specific about what it allows as keys and tuples are not part of that.
Which means you have to pick your poison:
- use
pickle
- use
numpy
(which uses pickle)
- turn the keys (tuples) to strings and use
json
or derivatives
- change the structure of your dictionary and use
json
or derivatives
- use "plain" writing a.k.a. transform dict manually
- ... bson, ... , ... (not getting into these)
In order to get a better picture in regards to performance indicators, I created a test, whose results you find at the bottom.
The tiniest amount of theory first
I feel that python must have a better way than me looping around
through some string looking for : and ( trying to parse it into a
dictionary.
Let's think about what dictionaries are for a moment. For us, they look like pairs of some key and value; for python dictionaries are Hash Tables, a relation of memory addresses of keys and values. Very simply speaking, those addresses vanish for python after program execution and so does whatever is stored there (this is really the super simplified version for brevity). That means when storing your dict to disk, you can't simply store those memory addresses, because the operating system will not allow a new python instance access to those addresses. Everything else would be a security nightmare. Also, python cannot claim specified memory addresses that it may have used on a previous run. So, even if you stored the names of those addresses and the values stored there before, you can "rebuild" your dict by storing value x at memory address at will (in actuality, there may be ways to do parts of that, but that's asking for corrupting memory).
All of this means, that python indeed has to (re-)build the dict by dumping and then loading a.k.a. parsing organized (serialized) data. Pickle
does this by converting a hierarchy of python objects into a byte stream, Json
and derivatives do it by creating strings that follow the JSON specs.
1. Pickle
Practically speaking, this is the overall best choice: lowest file-size, fast write, decently fast read, no need for any type transformations. This is as long your dictionary contains only pickleable objects. The largest downside is that you need to be certain that you can trust the data, or you rip yourself a huge security hole. Even if you have the data stored locally, consider employing measures to control the data has not been tampered with (arbitrary code execution is possible). Another (potential) downside of pickle is that there are different protocol versions (how the serialization/deserialization is done), so you need to make sure versions of pickle are compatible if you use different setups for dumping and loading.
>>> import pickle
>>> with open('test.pickle', 'wb') as f:
pickle.dump(dd, f)
>>> with open('test.pickle', 'rb') as f:
rd = pickle.load(f)
>>> rd == dd
True
2. Numpy
As suggested here and discussed here in depth numpy is indeed another option, despite not originally meant for this purpose. To be fair, numpy uses pickle to store and load object arrays, meaning that lots of the speed in this particular use-case comes from pickle. Downsides are the larger file-size (around 10-20% larger than the pickle file), and the same security implications as for pickle for the reason mentioned before.
>>> import numpy as np
>>> np.save('test_large', dd)
>>> rd = np.load('test_large.npy',allow_pickle='TRUE').item()
>>> rd == dd
True
3. JSON
As mentioned before, with the JSON standard you can't use tuples as keys. One thing that can be done is to cast the tuples to strings before writing to the file and then evaluating the strings as tuples after reading. This creates a lot of overhead in terms of processing time (speed) in both directions. Security-wise, you can't be care-free here either: malicious strings can be used for resource exhaustion attacks and using ast.literal_eval
(while safer than the dreaded eval
) is not without its problems either.
There are several modules you could use, such as the built-in json
, its externally developed sistersimplejson
and modules such as ujson
and orjson
and a bunch more, many of which become undeveloped over time.
(A mild warning: it may be questionable, whether this process (tuple to string, string to tuple) always works. Many people use and suggest it, but it may be possible that you can create tuples that after "stringification" cannot be evaluated back to their original. I'd certainly be tempted to test my original vs. the dump at least once to make sure.)
>>> import json
>>> import ast
>>> with open('test.json', 'w') as f:
json.dump({str(k):v for k,v in dd.items()})
>>> with open('test.json') as f:
rd = {ast.literal_eval(k):v for k,v in json.load(f).items()}
>>> rd == dd
True
Note: There is at least one module ujson
, which allows tuples as keys, but after loading the data, you'll have to evaluate they string-keys back into tuples. I find that this goes against the "explicit is better than implicit" axiom.
4. change your structure and JSON
You can avoid the troubles mentioned in the point above by changing the structure of your data into something that be dumped as JSON directly.
5. "plain" writing
As a comment here says, you can't simply write a dict to a file. As for why, see the tiny theory section above. What you can do is do the serialization and deserialization yourself. This is much slower than any of the other methods.
>>> dd = {str(k):v for k,v in dd.items()}
>>> with open('test.txt', 'w') as f:
f.write(str(dd))
# double-serialization is needed for your example!
>>> with open('test.txt') as f:
rd = {ast.literal_eval(k):v for k,v in ast.literal_eval(f.read()).items()}
# double-deserialization is needed as well (due to tuples as keys)
Testing
I've created three different dictionaries, each with one million keys. Two of those have tuples as keys (case 1: long strings in the tuple, case 2: large integers) and the last one is a dictionary that has only a long string as a key. Simplejson was tested with simplejson._toggle_speedups(True)
. I used the same three dictionaries for each test case.
The integrity check was done to ensure that after dumping & loading I have the same dict as before. Regarding your example, the first test is the one closest to it in terms of key-composition. Size is the size of the file after write to disk (Pickle clearly wins here). The first value is the time in seconds to serialize, write to file, read from file, deserialize (resp. evaluate) back to (another) dictionary in one go.
My personal conclusion: if changing the structure is not an option, pickle
certainly is the way to go.
Results string-tupled-dict:
Length str-dict: 1000000
Numpy (str): 2.0617s, size: 107.77 MB, integrity: True
Pickle (str): 2.0056s, size: 86.80 MB, integrity: True
Json (str): 17.4388s, size: 93.46 MB, integrity: True
Simplejson (str): 16.5553s, size: 93.46 MB, integrity: True
Ujson (str): 16.3620s, size: 91.55 MB, integrity: True
Orjson (str): 15.9673s, size: 91.55 MB, integrity: True
Plain (str): 21.8850s, size: 93.46 MB, integrity: True
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
Results int-tupled-dict:
Length int-dict: 1000000
Numpy (int): 1.1442s, size: 29.57 MB, integrity: True
Pickle (int): 1.1685s, size: 21.94 MB, integrity: True
Json (int): 14.9805s, size: 37.94 MB, integrity: True
Simplejson (int): 14.5882s, size: 37.94 MB, integrity: True
Ujson (int): 14.4269s, size: 36.03 MB, integrity: True
Orjson (int): 14.1546s, size: 36.03 MB, integrity: True
Plain (int): 18.9656s, size: 37.94 MB, integrity: True
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
Results non-tupled-dict:
Length reg-dict: 1000000
Numpy (reg): 0.6880s, size: 45.78 MB, integrity: True
Pickle (reg): 0.6475s, size: 39.11 MB, integrity: True
Json (reg): 1.7580s, size: 41.01 MB, integrity: True
Simplejson (reg): 1.5863s, size: 41.01 MB, integrity: True
Ujson (reg): 0.6543s, size: 39.10 MB, integrity: True
Orjson (reg): 0.4691s, size: 39.10 MB, integrity: True
Plain (reg): 5.9042s, size: 41.01 MB, integrity: True