Efficient ways to write a large NumPy array to a file

Asked 8/1, 2012 at 6:12 Answered 8/1, 2012 at 7:12

Solved python numpy scientific-computing

I've currently got a project running on PiCloud that involves multiple iterations of an ODE Solver. Each iteration produces a NumPy array of about 30 rows and 1500 columns, with each iterations being appended to the bottom of the array of the previous results.

Normally, I'd just let these fairly big arrays be returned by the function, hold them in memory and deal with them all at one. Except PiCloud has a fairly restrictive cap on the size of the data that can be out and out returned by a function, to keep down on transmission costs. Which is fine, except that means I'd have to launch thousands of jobs, each running on iteration, with considerable overhead.

It appears the best solution to this is to write the output to a file, and then collect the file using another function they have that doesn't have a transfer limit.

Is my best bet to do this just dumping it into a CSV file? Should I add to the CSV file each iteration, or hold it all in an array until the end and then just write once? Is there something terribly clever I'm missing?

Satan answered 8/1, 2012 at 6:12 Comment(0)

Unless there is a reason for the intermediate files to be human-readable, do not use CSV, as this will inevitably involve a loss of precision.

The most efficient is probably tofile (doc) which is intended for quick dumps of file to disk when you know all of the attributes of the data ahead of time.

For platform-independent, but numpy-specific, saves, you can use save (doc).

Numpy and scipy also have support for various scientific data formats like HDF5 if you need portability.

Salinometer answered 8/1, 2012 at 7:9 Comment(1)

There really isn't a reason for them to be human-readable - just so used to using CSV files to move around data sets, where precision really isn't a factor (most things are integers). This seems to be about what I was looking for. – Satan 8/1, 2012 at 7:54

I would recommend looking at the pickle module. The pickle module allows you to serialize python objects as streams of bytes (e.g., strings). This allows you to write them to a file or send them over a network, and then reinstantiate the objects later.

Appling answered 8/1, 2012 at 7:1 Comment(2)

use cPickle instead of pickle, it is way faster. – Dews 27/3, 2012 at 4:52

pickle is good for immediate use but it should not be used when you have to port it across versions of python (it is no backward compatible i.e. 3.x can't read binary data pickled by 2.x despite what ever the documentation says) use the npy format native to numpy. (bugs.python.org/issue6784) – Konstanz 24/10, 2013 at 6:22

Try Joblib - Fast compressed persistence

One of the key components of joblib is it’s ability to persist arbitrary Python objects, and read them back very quickly. It is particularly efficient for containers that do their heavy lifting with numpy arrays. The trick to achieving great speed has been to save in separate files the numpy arrays, and load them via memmapping.

Edit: Newer (2016) blog entry on data persistence in Joblib

Bathulda answered 8/1, 2012 at 7:12 Comment(0)

Recommended topics

Hot tags