How can I write structured data to a file and then read it back into the same structure later?
Asked Answered
A

7

38

I have a rather large dict (6 GB) with a structure along the lines of:

{((('word','list'),(1,2),(1,3)),(...)):0.0, ....}

and I need to do some processing on it. I'm trying out several document clustering methods, so I need to have the whole thing in memory at once. I have other functions to run on this data, but the contents will not change.

Currently, every time I think of new functions I have to write them, and then re-generate the dict.

I want to store the dict in a file, in such a way that I can easily load it into memory later, preserving the original structure and data types, instead of recalculating all its values.

How can I do this in Python, without manually parsing dictionary etc. syntax?

Adelaadelaida answered 20/5, 2009 at 22:2 Comment(1)
I would use ZODB if you need a dict too large to fit into memory to be persistent.Gales
E
68

Why not use python pickle? Python has a great serializing module called pickle it is very easy to use.

import cPickle
cPickle.dump(obj, open('save.p', 'wb')) 
obj = cPickle.load(open('save.p', 'rb'))

There are two disadvantages with pickle:

  • It's not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.
  • The format is not human readable.

If you are using python 2.6 there is a builtin module called json. It is as easy as pickle to use:

import json
encoded = json.dumps(obj)
obj = json.loads(encoded)

Json format is human readable and is very similar to the dictionary string representation in python. And doesn't have any security issues like pickle. But might be slower than cPickle.

Extraterritorial answered 20/5, 2009 at 22:7 Comment(2)
I've also seen that pickle takes up more memory than a text file.Quadrille
Using pickle is probably the best option here, and as the sole creator and user, the security issues shouldn't be a problem. Also, note in Python 3.x, the native version of libraries (e.g. cPickle) shouldn't/can't be imported directly, and instead import pickle will try to import the native version automatically, if available. Also note, you might have issues unpickling data creating with a different Python version, but cross platform use is reported as okay.Desimone
D
12

I'd use shelve, json, yaml, or whatever, as suggested by other answers.

shelve is specially cool because you can have the dict on disk and still use it. Values will be loaded on-demand.

But if you really want to parse the text of the dict, and it contains only strings, ints and tuples like you've shown, you can use ast.literal_eval to parse it. It is a lot safer, since you can't eval full expressions with it - It only works with strings, numbers, tuples, lists, dicts, booleans, and None:

>>> import ast
>>> print ast.literal_eval("{12: 'mydict', 14: (1, 2, 3)}")
{12: 'mydict', 14: (1, 2, 3)}
Depolymerize answered 20/5, 2009 at 23:12 Comment(0)
O
4

I would suggest that you use YAML for your file format so you can tinker with it on the disc

How does it look:
  - It is indent based
  - It can represent dictionaries and lists
  - It is easy for humans to understand
An example: This block of code is an example of YAML (a dict holding a list and a string)
Full syntax: http://www.yaml.org/refcard.html

To get it in python, just easy_install pyyaml. See http://pyyaml.org/

It comes with easy file save / load functions, that I can't remember right this minute.

Obelia answered 20/5, 2009 at 22:57 Comment(0)
M
2

to oversimplify things it looks something like: {((('word','list'),(1,2),(1,3)),(...)):0.0, ....}

Here you have a dictionary with (nested) tuples as keys. This means, straight away dumping to JSON (with either json or simplejson) will not work

TypeError: keys must be str, int, float, bool or None, not tuple

because the JSON standard is specific about what it allows as keys and tuples are not part of that.

Which means you have to pick your poison:

  1. use pickle
  2. use numpy (which uses pickle)
  3. turn the keys (tuples) to strings and use json or derivatives
  4. change the structure of your dictionary and use json or derivatives
  5. use "plain" writing a.k.a. transform dict manually
  6. ... bson, ... , ... (not getting into these)

In order to get a better picture in regards to performance indicators, I created a test, whose results you find at the bottom.

The tiniest amount of theory first

I feel that python must have a better way than me looping around through some string looking for : and ( trying to parse it into a dictionary.

Let's think about what dictionaries are for a moment. For us, they look like pairs of some key and value; for python dictionaries are Hash Tables, a relation of memory addresses of keys and values. Very simply speaking, those addresses vanish for python after program execution and so does whatever is stored there (this is really the super simplified version for brevity). That means when storing your dict to disk, you can't simply store those memory addresses, because the operating system will not allow a new python instance access to those addresses. Everything else would be a security nightmare. Also, python cannot claim specified memory addresses that it may have used on a previous run. So, even if you stored the names of those addresses and the values stored there before, you can "rebuild" your dict by storing value x at memory address at will (in actuality, there may be ways to do parts of that, but that's asking for corrupting memory).

All of this means, that python indeed has to (re-)build the dict by dumping and then loading a.k.a. parsing organized (serialized) data. Pickle does this by converting a hierarchy of python objects into a byte stream, Json and derivatives do it by creating strings that follow the JSON specs.

1. Pickle

Practically speaking, this is the overall best choice: lowest file-size, fast write, decently fast read, no need for any type transformations. This is as long your dictionary contains only pickleable objects. The largest downside is that you need to be certain that you can trust the data, or you rip yourself a huge security hole. Even if you have the data stored locally, consider employing measures to control the data has not been tampered with (arbitrary code execution is possible). Another (potential) downside of pickle is that there are different protocol versions (how the serialization/deserialization is done), so you need to make sure versions of pickle are compatible if you use different setups for dumping and loading.

>>> import pickle
>>> with open('test.pickle', 'wb') as f:
        pickle.dump(dd, f)
>>> with open('test.pickle', 'rb') as f:
        rd = pickle.load(f)
>>> rd == dd
True

2. Numpy

As suggested here and discussed here in depth numpy is indeed another option, despite not originally meant for this purpose. To be fair, numpy uses pickle to store and load object arrays, meaning that lots of the speed in this particular use-case comes from pickle. Downsides are the larger file-size (around 10-20% larger than the pickle file), and the same security implications as for pickle for the reason mentioned before.

>>> import numpy as np
>>> np.save('test_large', dd)
>>> rd = np.load('test_large.npy',allow_pickle='TRUE').item()
>>> rd == dd
True 

3. JSON

As mentioned before, with the JSON standard you can't use tuples as keys. One thing that can be done is to cast the tuples to strings before writing to the file and then evaluating the strings as tuples after reading. This creates a lot of overhead in terms of processing time (speed) in both directions. Security-wise, you can't be care-free here either: malicious strings can be used for resource exhaustion attacks and using ast.literal_eval (while safer than the dreaded eval) is not without its problems either.

There are several modules you could use, such as the built-in json, its externally developed sistersimplejson and modules such as ujson and orjson and a bunch more, many of which become undeveloped over time.

(A mild warning: it may be questionable, whether this process (tuple to string, string to tuple) always works. Many people use and suggest it, but it may be possible that you can create tuples that after "stringification" cannot be evaluated back to their original. I'd certainly be tempted to test my original vs. the dump at least once to make sure.)

>>> import json
>>> import ast
>>> with open('test.json', 'w') as f:
        json.dump({str(k):v for k,v in dd.items()})
>>> with open('test.json') as f:
        rd = {ast.literal_eval(k):v for k,v in json.load(f).items()}
>>> rd == dd
True 

Note: There is at least one module ujson, which allows tuples as keys, but after loading the data, you'll have to evaluate they string-keys back into tuples. I find that this goes against the "explicit is better than implicit" axiom.

4. change your structure and JSON

You can avoid the troubles mentioned in the point above by changing the structure of your data into something that be dumped as JSON directly.

5. "plain" writing

As a comment here says, you can't simply write a dict to a file. As for why, see the tiny theory section above. What you can do is do the serialization and deserialization yourself. This is much slower than any of the other methods.

>>> dd = {str(k):v for k,v in dd.items()}
>>> with open('test.txt', 'w') as f:
        f.write(str(dd))
        # double-serialization is needed for your example!
>>> with open('test.txt') as f:
        rd = {ast.literal_eval(k):v for k,v in ast.literal_eval(f.read()).items()}
        # double-deserialization is needed as well (due to tuples as keys)
    

Testing

I've created three different dictionaries, each with one million keys. Two of those have tuples as keys (case 1: long strings in the tuple, case 2: large integers) and the last one is a dictionary that has only a long string as a key. Simplejson was tested with simplejson._toggle_speedups(True). I used the same three dictionaries for each test case.

The integrity check was done to ensure that after dumping & loading I have the same dict as before. Regarding your example, the first test is the one closest to it in terms of key-composition. Size is the size of the file after write to disk (Pickle clearly wins here). The first value is the time in seconds to serialize, write to file, read from file, deserialize (resp. evaluate) back to (another) dictionary in one go.

My personal conclusion: if changing the structure is not an option, pickle certainly is the way to go.

Results string-tupled-dict:
Length str-dict: 1000000
Numpy (str): 2.0617s, size: 107.77 MB, integrity: True
Pickle (str): 2.0056s, size: 86.80 MB, integrity: True
Json (str): 17.4388s, size: 93.46 MB, integrity: True
Simplejson (str): 16.5553s, size: 93.46 MB, integrity: True
Ujson (str): 16.3620s, size: 91.55 MB, integrity: True
Orjson (str): 15.9673s, size: 91.55 MB, integrity: True
Plain (str): 21.8850s, size: 93.46 MB, integrity: True


–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––

Results int-tupled-dict:
Length int-dict: 1000000
Numpy (int): 1.1442s, size: 29.57 MB, integrity: True
Pickle (int): 1.1685s, size: 21.94 MB, integrity: True
Json (int): 14.9805s, size: 37.94 MB, integrity: True
Simplejson (int): 14.5882s, size: 37.94 MB, integrity: True
Ujson (int): 14.4269s, size: 36.03 MB, integrity: True
Orjson (int): 14.1546s, size: 36.03 MB, integrity: True
Plain (int): 18.9656s, size: 37.94 MB, integrity: True


–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––

Results non-tupled-dict:
Length reg-dict: 1000000
Numpy (reg): 0.6880s, size: 45.78 MB, integrity: True
Pickle (reg): 0.6475s, size: 39.11 MB, integrity: True
Json (reg): 1.7580s, size: 41.01 MB, integrity: True
Simplejson (reg): 1.5863s, size: 41.01 MB, integrity: True
Ujson (reg): 0.6543s, size: 39.10 MB, integrity: True
Orjson (reg): 0.4691s, size: 39.10 MB, integrity: True
Plain (reg): 5.9042s, size: 41.01 MB, integrity: True
Monnet answered 20/3 at 3:29 Comment(0)
S
0

Write it out in a serialized format, such as pickle (a python standard library module for serialization) or perhaps by using JSON (which is a representation that can be evaluated to produce the memory representation again).

Sackcloth answered 20/5, 2009 at 22:27 Comment(1)
This describes the general approach, but is not actionable as is. It would be useful as an answer if it at least clearly explained what a "serialized format" (or "serialization") is, and why it is necessary. Aside from that, JSON is not general-purpose, and parsing JSON is not really "evaluating" it. If the intended suggestion really was to use the built-in eval (I edited to correct "evaled" to "evaluated"), then that is all three of wrong (because JSON literals like null are not valid in Python), dangerous (for obvious reasons) and unnecessary.Dishman
A
-2

Here are a few alternatives depending on your requirements:

  • numpy stores your plain data in a compact form and performs group/mass operations well

  • shelve is like a large dict backed up by a file

  • some 3rd party storage module, e.g. stash, stores arbitrary plain data

  • proper database, e.g. mongodb for hairy data or mysql or sqlite plain data and faster retrieval

Antiphrasis answered 5/11, 2012 at 15:15 Comment(1)
Numpy is designed to store completely differently structured data.Dishman
M
-2

If you don't care about the read speed of your dictionary, I can consider using the file system as a dict. each filename is the key of the dict. and store the data in json file.

I create a file json package to support use huge dict data, maybe it's useful for you.

Marimaria answered 20/3 at 1:23 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.