best way to preserve numpy arrays on disk
Asked Answered
N

7

177

I am looking for a fast way to preserve large numpy arrays. I want to save them to the disk in a binary format, then read them back into memory relatively fastly. cPickle is not fast enough, unfortunately.

I found numpy.savez and numpy.load. But the weird thing is, numpy.load loads a npy file into "memory-map". That means regular manipulating of arrays really slow. For example, something like this would be really slow:

#!/usr/bin/python
import numpy as np;
import time; 
from tempfile import TemporaryFile

n = 10000000;

a = np.arange(n)
b = np.arange(n) * 10
c = np.arange(n) * -0.5

file = TemporaryFile()
np.savez(file,a = a, b = b, c = c);

file.seek(0)
t = time.time()
z = np.load(file)
print "loading time = ", time.time() - t

t = time.time()
aa = z['a']
bb = z['b']
cc = z['c']
print "assigning time = ", time.time() - t;

more precisely, the first line will be really fast, but the remaining lines that assign the arrays to obj are ridiculously slow:

loading time =  0.000220775604248
assining time =  2.72940087318

Is there any better way of preserving numpy arrays? Ideally, I want to be able to store multiple arrays in one file.

Novick answered 8/3, 2012 at 14:28 Comment(11)
By default, np.load should not mmap the file.Wendall
What about pytables?Cudbear
@larsmans, thanks for the reply. but why is the lookup time (z['a'] in my code example) so slow?Novick
@Cudbear Thanks for your reply. I am considering it.. but before I move to add in more 3rd party libraries I wanted to find a numpy solution first...Novick
It would be nice if we there were a little more information in your question, like the kind of array which is stored in ifile and its size, or if they are several arrays in different files, or how exactly do you save them. By your question, I have got the impression that the first line does nothing and that the actual loading happens after, but those are only guesses.Cudbear
@larsmans - For what it's worth, for an "npz" file (i.e. multiple arrays saved with numpy.savez), the default is to "lazily load" the arrays. It isn't memmapping them, but it doesn't load them until the NpzFile object is indexed. (Thus the delay the OP is referring to.) The documentation for load skips this, and is therefore a touch misleading...Polyphagia
@JoeKington Thanks Joe. But how do I "not lazily load" a npz file?Novick
If pickle was slow, maybe you did not set its "protocol" flag? pickle.dump(obj, file, -1) Without the "protocol" flag, "pickle" will use a slow ASCII format. Here is the documentation: pickle.dumpRhymester
I got loading time = 0.00024962425231933594 assigning time = 0.3003871440887451 with python 3 and numpy 1.13. I doubt the lazy-loading time can be significantly reduced by other packages, and as I'm not too concerned about compression, I'm perfectly happy with numpy.savez.Salamanca
one warning that some ppl might care about is that pickle can execute arbitrary code which makes it less secure than other protocols for saving data.Kong
Loading NumPy arrays from disk: mmap() vs. Zarr/HDF5 (2023) is informativeCristen
A
77

I'm a big fan of hdf5 for storing large numpy arrays. There are two options for dealing with hdf5 in python:

http://www.pytables.org/

http://www.h5py.org/

Both are designed to work with numpy arrays efficiently.

Ardene answered 8/3, 2012 at 15:2 Comment(5)
would you be willing to provide some example code using these packages to save an array?Widget
h5py example and pytables examplePoinciana
From my experiences, hdf5 performances very slow reading and writing with chunk storage and compression enabled. For example, I've two 2-D arrays with shape (2500,000 * 2000) with chunk size (10,000 * 2000). A single write operation of a array with shape (2000 * 2000) will take about 1 ~ 2s to complete. Do you have any suggestion on improving the performance? thx.Cuthburt
1 to 2 s doesn't look so long for such a big array. What is the performance compared to .npy format ?Stormy
Does hdf5 have problem with CPU memory consumption?I encountered some problem with multi worker training when the hdf5 is large. While npz can use memory map to avoid.Duty
E
312

I've compared performance (space and time) for a number of ways to store numpy arrays. Few of them support multiple arrays per file, but perhaps it's useful anyway.

benchmark for numpy array storage

Npy and binary files are both really fast and small for dense data. If the data is sparse or very structured, you might want to use npz with compression, which'll save a lot of space but cost some load time.

If portability is an issue, binary is better than npy. If human readability is important, then you'll have to sacrifice a lot of performance, but it can be achieved fairly well using csv (which is also very portable of course).

More details and the code are available at the github repo.

Elery answered 2/1, 2017 at 11:21 Comment(14)
Could you explain why binary is better than npy for portability? Does this also apply for npz?Owlish
@Owlish Because any language can read binary files if they just know the shape, data type and whether it's row or column based. If you're just using Python then npy is fine, probably a little easier than binary.Elery
Thank you! One more question: do I overlook something or did you leave out HDF5? Since this is pretty common, I would be interested how it compares to the other methods.Owlish
I tried to use png and npy to save a same image. png only takes 2K space while the npy takes 307K. This result is really different from your work. Am I doing something wrong? This image is a greyscale image and only 0 and 255 are inside. I think this is a sparse data correct? Then I also used npz but the size is totally same.Petal
@YorkYang This situation is different from the benchmark, where I saved float64 data as npy and png, rather than an image (probably 3D uint8 data?). Maybe something is going wrong with the data type, and perhaps png has better compression for this type of thing. But I doubt that explains the factor 154 difference completely. I'd personally just use png; for actual images I think it's unlikely that it can be improved upon much (except with lossy storage).Elery
Why is h5py missing? Or am I missing something?Owlish
Unofficial results for hdf5 vs npy. The code broke inn several points for other formats. Also, I jt_dump was complaining in several places - it has been commented out. You can find the results at this fork: github.com/epignatelli/array_storage_benchmarkOdelet
your answer is extremely good. To make it better and have all the options pros/cons available, one con that some ppl might care about is that pickle can execute arbitrary code which makes it less secure than other protocols for saving data.Kong
What is 'FortUnf'?Joviality
@Joviality Fortran unformatted, a binary format used by Fortran with decades of history but I think not a lot of users today. See the repo for more details.Elery
@YorkYang, PNG supports up to 16 bits, so I suspect that the PNG is converting the data type from numpy while NPY leaves the format unaltered.Lactoprotein
@YorkYang, The example used in the answer is random numbers and therefore has a huge information content and can't really be compressed (png is losslessly compressed). I think the key reason that your png is much smaller is because it is an actual image that is highly compressible. The data-types may be a factor, too, as others have said.Abreaction
@EduardoPignatelli so is hdf5 faster for reading than npy? What does inn mean? Is that good or bad?Biysk
@Biysk from the benchmark, yes. I also often prefer hdf5 to npy in practice.Odelet
A
77

I'm a big fan of hdf5 for storing large numpy arrays. There are two options for dealing with hdf5 in python:

http://www.pytables.org/

http://www.h5py.org/

Both are designed to work with numpy arrays efficiently.

Ardene answered 8/3, 2012 at 15:2 Comment(5)
would you be willing to provide some example code using these packages to save an array?Widget
h5py example and pytables examplePoinciana
From my experiences, hdf5 performances very slow reading and writing with chunk storage and compression enabled. For example, I've two 2-D arrays with shape (2500,000 * 2000) with chunk size (10,000 * 2000). A single write operation of a array with shape (2000 * 2000) will take about 1 ~ 2s to complete. Do you have any suggestion on improving the performance? thx.Cuthburt
1 to 2 s doesn't look so long for such a big array. What is the performance compared to .npy format ?Stormy
Does hdf5 have problem with CPU memory consumption?I encountered some problem with multi worker training when the hdf5 is large. While npz can use memory map to avoid.Duty
A
54

There is now a HDF5 based clone of pickle called hickle!

https://github.com/telegraphic/hickle

import hickle as hkl 

data = {'name': 'test', 'data_arr': [1, 2, 3, 4]}

# Dump data to file
hkl.dump(data, 'new_data_file.hkl')

# Load data from file
data2 = hkl.load('new_data_file.hkl')

print(data == data2)

EDIT:

There also is the possibility to "pickle" directly into a compressed archive by doing:

import pickle, gzip, lzma, bz2

pickle.dump(data, gzip.open('data.pkl.gz', 'wb'))
pickle.dump(data, lzma.open('data.pkl.lzma', 'wb'))
pickle.dump(data, bz2.open('data.pkl.bz2', 'wb'))

compression


Appendix

import numpy as np
import matplotlib.pyplot as plt
import pickle, os, time
import gzip, lzma, bz2, h5py

compressions = ['pickle', 'h5py', 'gzip', 'lzma', 'bz2']
modules = dict(
    pickle=pickle, h5py=h5py, gzip=gzip, lzma=lzma, bz2=bz2
)

labels = ['pickle', 'h5py', 'pickle+gzip', 'pickle+lzma', 'pickle+bz2']
size = 1000

data = {}

# Random data
data['random'] = np.random.random((size, size))

# Not that random data
data['semi-random'] = np.zeros((size, size))
for i in range(size):
    for j in range(size):
        data['semi-random'][i, j] = np.sum(
            data['random'][i, :]) + np.sum(data['random'][:, j]
        )

# Not random data
data['not-random'] = np.arange(
    size * size, dtype=np.float64
).reshape((size, size))

sizes = {}

for key in data:

    sizes[key] = {}

    for compression in compressions:
        path = 'data.pkl.{}'.format(compression)

        if compression == 'pickle':
            time_start = time.time()
            pickle.dump(data[key], open(path, 'wb'))
            time_tot = time.time() - time_start
            sizes[key]['pickle'] = (
                os.path.getsize(path) * 10**-6, 
                time_tot.
            )
            os.remove(path)

        elif compression == 'h5py':
            time_start = time.time()
            with h5py.File(path, 'w') as h5f:
                h5f.create_dataset('data', data=data[key])
            time_tot = time.time() - time_start
            sizes[key][compression] = (os.path.getsize(path) * 10**-6, time_tot)
            os.remove(path)

        else:
            time_start = time.time()
            with modules[compression].open(path, 'wb') as fout:
                pickle.dump(data[key], fout)
            time_tot = time.time() - time_start
            sizes[key][labels[compressions.index(compression)]] = (
                os.path.getsize(path) * 10**-6, 
                time_tot,
            )
            os.remove(path)


f, ax_size = plt.subplots()
ax_time = ax_size.twinx()

x_ticks = labels
x = np.arange(len(x_ticks))

y_size = {}
y_time = {}
for key in data:
    y_size[key] = [sizes[key][x_ticks[i]][0] for i in x]
    y_time[key] = [sizes[key][x_ticks[i]][1] for i in x]

width = .2
viridis = plt.cm.viridis

p1 = ax_size.bar(x - width, y_size['random'], width, color = viridis(0))
p2 = ax_size.bar(x, y_size['semi-random'], width, color = viridis(.45))
p3 = ax_size.bar(x + width, y_size['not-random'], width, color = viridis(.9))
p4 = ax_time.bar(x - width, y_time['random'], .02, color='red')

ax_time.bar(x, y_time['semi-random'], .02, color='red')
ax_time.bar(x + width, y_time['not-random'], .02, color='red')

ax_size.legend(
    (p1, p2, p3, p4), 
    ('random', 'semi-random', 'not-random', 'saving time'),
    loc='upper center', 
    bbox_to_anchor=(.5, -.1), 
    ncol=4,
)
ax_size.set_xticks(x)
ax_size.set_xticklabels(x_ticks)

f.suptitle('Pickle Compression Comparison')
ax_size.set_ylabel('Size [MB]')
ax_time.set_ylabel('Time [s]')

f.savefig('sizes.pdf', bbox_inches='tight')
Ashy answered 5/3, 2014 at 13:10 Comment(3)
one warning that some ppl might care about is that pickle can execute arbitrary code which makes it less secure than other protocols for saving data.Kong
This is great! Can you also provide the code for reading the files pickled directly into compression using lzma or bz2?Prut
@ErnestSKirubakaran It's basically the same: If you saved it using pickle.dump( obj, gzip.open( 'filename.pkl.gz', 'wb' ) ), you can load it using pickle.load( gzip.open( 'filename.pkl.gz', 'r' ) )Ashy
L
18

savez() save data in a zip file, It may take some time to zip & unzip the file. You can use save() & load() function:

f = file("tmp.bin","wb")
np.save(f,a)
np.save(f,b)
np.save(f,c)
f.close()

f = file("tmp.bin","rb")
aa = np.load(f)
bb = np.load(f)
cc = np.load(f)
f.close()

To save multiple arrays in one file, you just need to open the file first, and then save or load the arrays in sequence.

Liatrice answered 9/3, 2012 at 6:45 Comment(0)
H
8

Another possibility to store numpy arrays efficiently is Bloscpack:

#!/usr/bin/python
import numpy as np
import bloscpack as bp
import time

n = 10000000

a = np.arange(n)
b = np.arange(n) * 10
c = np.arange(n) * -0.5
tsizeMB = sum(i.size*i.itemsize for i in (a,b,c)) / 2**20.

blosc_args = bp.DEFAULT_BLOSC_ARGS
blosc_args['clevel'] = 6
t = time.time()
bp.pack_ndarray_file(a, 'a.blp', blosc_args=blosc_args)
bp.pack_ndarray_file(b, 'b.blp', blosc_args=blosc_args)
bp.pack_ndarray_file(c, 'c.blp', blosc_args=blosc_args)
t1 = time.time() - t
print "store time = %.2f (%.2f MB/s)" % (t1, tsizeMB / t1)

t = time.time()
a1 = bp.unpack_ndarray_file('a.blp')
b1 = bp.unpack_ndarray_file('b.blp')
c1 = bp.unpack_ndarray_file('c.blp')
t1 = time.time() - t
print "loading time = %.2f (%.2f MB/s)" % (t1, tsizeMB / t1)

and the output for my laptop (a relatively old MacBook Air with a Core2 processor):

$ python store-blpk.py
store time = 0.19 (1216.45 MB/s)
loading time = 0.25 (898.08 MB/s)

that means that it can store really fast, i.e. the bottleneck is typically the disk. However, as the compression ratios are pretty good here, the effective speed is multiplied by the compression ratios. Here are the sizes for these 76 MB arrays:

$ ll -h *.blp
-rw-r--r--  1 faltet  staff   921K Mar  6 13:50 a.blp
-rw-r--r--  1 faltet  staff   2.2M Mar  6 13:50 b.blp
-rw-r--r--  1 faltet  staff   1.4M Mar  6 13:50 c.blp

Please note that the use of the Blosc compressor is fundamental for achieving this. The same script but using 'clevel' = 0 (i.e. disabling compression):

$ python bench/store-blpk.py
store time = 3.36 (68.04 MB/s)
loading time = 2.61 (87.80 MB/s)

is clearly bottlenecked by the disk performance.

Hazaki answered 6/3, 2014 at 13:1 Comment(1)
To whom it may concern: Although Bloscpack and PyTables are different projects, the former focusing only on disk dump and not stored arrays slicing, I tested both and for pure "file dump projects" Bloscpack is almost 6x faster than PyTables.Chez
S
5

The lookup time is slow because when you use mmap to does not load content of array to memory when you invoke load method. Data is lazy loaded when particular data is needed. And this happens in lookup in your case. But second lookup won`t be so slow.

This is nice feature of mmap when you have a big array you do not have to load whole data into memory.

To solve your can use joblib you can dump any object you want using joblib.dump even two or more numpy arrays, see the example

firstArray = np.arange(100)
secondArray = np.arange(50)
# I will put two arrays in dictionary and save to one file
my_dict = {'first' : firstArray, 'second' : secondArray}
joblib.dump(my_dict, 'file_name.dat')
Spallation answered 27/3, 2014 at 11:25 Comment(1)
The library is no longer available.Metaphase
C
0

'Best' depends on what your goal is. As others have said, a binary is maximally portable, but the problem is that you need to know about how the data is stored.

Darr saves your numpy array in a self-documented way based on flat binary and text files. This maximizes wide readability. It also automatically includes code on how to read your array in a variety of data science languages, such as numpy itself, but also R, Matlab, Julia etc.

Disclosure: I wrote the library.

Car answered 19/7, 2021 at 14:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.