Save Numpy Array using Pickle
Asked Answered
B

9

47

I've got a Numpy array that I would like to save (130,000 x 3) that I would like to save using Pickle, with the following code. However, I keep getting the error "EOFError: Ran out of input" or "UnsupportedOperation: read" at the pkl.load line. This is my first time using Pickle, any ideas?

Thanks,

Anant

import pickle as pkl
import numpy as np

arrayInput = np.zeros((1000,2)) #Trial input
save = True
load = True

filename = path + 'CNN_Input'
fileObject = open(fileName, 'wb')

if save:
    pkl.dump(arrayInput, fileObject)
    fileObject.close()

if load:
    fileObject2 = open(fileName, 'wb')
    modelInput = pkl.load(fileObject2)
    fileObject2.close()

if arrayInput == modelInput:
    Print(True)
Berbera answered 21/9, 2018 at 13:34 Comment(3)
fileobhect2 should be opened for read, not writeMariko
I'm confused, what are the pros/cons of pickle vs np.save/z etc?Rockett
did any of the solutions here help you? If not do you mind clarifying what was wrong with them? #28440201Rockett
A
60

You should use numpy.save and numpy.load.

Antonietta answered 21/9, 2018 at 13:40 Comment(7)
Upvoting. pickle is good for arbitrary python data. np.save and np.load will be much more efficient for numeric data.Numismatics
I can't test it now, but I thought save was the numpy pickling method. Conversely save uses pickle to write non array elements.Mariko
as suggested, please have a look at complete sample code hereAssyria
can you comment what is wrong with using pickle please? thanks in advance.Rockett
perhaps a comment on np.save vs np.savez would be helpful.Rockett
Tried using save/load for numpy.ndarrays and they always came back empty. Urgh.Bobolink
@Antonietta can you elaborate, why using numpy.save and load are better than pickle?Numerary
M
29

I have no problems using pickle:

In [126]: arr = np.zeros((1000,2))
In [127]: with open('test.pkl','wb') as f:
     ...:     pickle.dump(arr, f)
     ...:     
In [128]: with open('test.pkl','rb') as f:
     ...:     x = pickle.load(f)
     ...:     print(x.shape)
     ...:     
     ...:     
(1000, 2)

pickle and np.save/load have a deep reciprocity. Like I can load this pickle with np.load:

In [129]: np.load('test.pkl').shape
Out[129]: (1000, 2)

If I open the pickle file in the wrong I do get your error:

In [130]: with open('test.pkl','wb') as f:
     ...:     x = pickle.load(f)
     ...:     print(x.shape)
     ...:    
UnsupportedOperation: read

But that shouldn't be surprising - you can't read a freshly opened write file. It will be empty.

np.save/load is the usual pair for writing numpy arrays. But pickle uses save to serialize arrays, and save uses pickle to serialize non-array objects (in the array). Resulting file sizes are similar. Curiously in timings the pickle version is faster.

Mariko answered 21/9, 2018 at 16:35 Comment(2)
I'm confused, what are the pros/cons of pickle vs np.save/z etc?Rockett
seems the pickle library won't work unless one uses the b bytes for read rb when loading and writing wb when saving, right? Without it I get this error: TypeError: a bytes-like object is required, not 'str'Rockett
B
13

It's been a bit but if you're finding this, Pickle completes in a fraction of the time.

with open('filename','wb') as f: pickle.dump(arrayname, f)

with open('filename','rb') as f: arrayname1 = pickle.load(f)

numpy.array_equal(arrayname,arrayname1) #sanity check

On the other hand, by default numpy compress took my 5.2GB down to .4GB and Pickle went to 1.7GB.

Bennir answered 18/4, 2019 at 16:19 Comment(2)
basically the main advantage of using pickle vs numpy.save/z is that numpy is optimized to use less space when saving? is that right?Rockett
I wonder if your comment of storage efficiency only applies to np.save or it also applies to np.savezRockett
R
8

Don't use pickle for numpy arrays, for an extended discussion that links to all resources I could find see my answer here.

Short reasons:

  • there is already a nice interface the developers of numpy made and will save you lots of time of debugging (most important reason)
  • np.save,np.load,np.savez have pretty good performance in most metrics, see this, which is to be expected since it's an established library and the developers of numpy made those functions.
  • Pickle executes arbitrary code and is a security issue
  • to use pickle you would have to open and file and might get issues that leads to bugs (e.g. I wasn't aware of using b and it stopped working, took time to debug)
  • if you refuse to accept this advice, at least really articulate the reason you need to use something else. Make sure it's crystal clear in your head.

Avoid repeating code at all costs if a solution already exists!

Anyway, here are all the interfaces I tried, hopefully it saves someone time (probably my future self):

import numpy as np
import pickle
from pathlib import Path

path = Path('~/data/tmp/').expanduser()
path.mkdir(parents=True, exist_ok=True)

lb,ub = -1,1
num_samples = 5
x = np.random.uniform(low=lb,high=ub,size=(1,num_samples))
y = x**2 + x + 2

# using save (to npy), savez (to npz)
np.save(path/'x', x)
np.save(path/'y', y)
np.savez(path/'db', x=x, y=y)
with open(path/'db.pkl', 'wb') as db_file:
    pickle.dump(obj={'x':x, 'y':y}, file=db_file)

## using loading npy, npz files
x_loaded = np.load(path/'x.npy')
y_load = np.load(path/'y.npy')
db = np.load(path/'db.npz')
with open(path/'db.pkl', 'rb') as db_file:
    db_pkl = pickle.load(db_file)

print(x is x_loaded)
print(x == x_loaded)
print(x == db['x'])
print(x == db_pkl['x'])
print('done')

but most useful see my answer here.

Rockett answered 13/7, 2020 at 20:0 Comment(1)
I'm using a library that embeds numpy arrays inside of python objects, but I need to store data along the way. Seems like using pickle is preferable in this case so a person does not need to deconstruct the entire library in order to save the numpy arrays with save and the rest with pickle.Spendthrift
A
2

Here is one of extra possible ways. Sometimes you should add extra option protocol. For example,

import pickle

# Your array
arrayInput = np.zeros((1000,2))

Here is your approach:

pickle.dump(arrayInput, open('file_name.pickle', 'wb'))

Which you can change to:

# in two lines of code
with open("file_name.pickle", "wb") as f:
    pickle.dump(arrayInput, f, protocol=pickle.HIGHEST_PROTOCOL)

or

# Save in line of code
pickle.dump(arrayInput, open("file_name.pickle", "wb"), protocol=pickle.HIGHEST_PROTOCOL)

Aftermath, you can easy read you numpy array like:

arrayInput = pickle.load(open(self._dir_models+"model_rf.sav", 'rb'))

Hope, it is useful for you

Ahmedahmedabad answered 19/10, 2021 at 16:41 Comment(0)
O
2

Many are forgetting one very important thing: security.

Pickled data is binary, so it gets run immediately upon using pickle.load. If loading from an untrusted source, the file could contain executable instructions to achieve things like man-in-the-middle attacks over a network, among other things. (e.g. see this realpython.com article)

Pure pickled data may be faster to save/load if you don't follow with bz2 compression, and hence have a larger file size, but numpy load/save may be more secure.

Alternatively, you may save purely pickled data along with an encryption key using the builtin hashlib and hmac libraries and, prior to loading, compare the hash key against your security key:

import hashlib
import hmac

def calculate_hash(
    key_,
    file_path,
    hash_=hashlib.sha256
):
    with open(file_path, "rb") as fp:
        file_hash = hmac.new(key_, fp.read(), hash_).hexdigest()

    return file_hash

def compare_hash(
    hash1,
    hash2,
):
    """
    Warning:
        Do not use `==` directly to compare hash values. Timing attacks can be used
        to learn your security key. Use ``compare_digest()``.
    """
    return hmac.compare_digest(hash1, hash2)

In a corporate setting, always be sure to confirm with your IT department. You want to be sure proper authentication, encryption, and authorization is all "set to go" when loading and saving data over servers and networks.

Pickle/CPickle

If you are confident you are using nothing but trusted sources and speed is a major concern over security and file size, pickle might be the way to go. In addition, you can take a few extra security measures using cPickle (this may have been incorporated directly into pickle in recent Python3 versions, but I'm not sure, so always double-check):

  1. Use a cPickle.Unpickler instance, and set its "find_global" attribute to None to disable importing any modules (thus restricting loading to builtin types such as dict, int, list, string, etc).

  2. Use a cPickle.Unpickler instance, and set its "find_global" attribute to a function that only allows importing of modules and names from a whitelist.

  3. Use something like the itsdangerous package to authenticate the data before unpickling it if you're loading it from an untrusted source.

Numpy

If you are only saving numpy data and no other python data, and security is a greater priority over file size and speed, then numpy might be the way to go.

HDF5/H5PY

If your data is truly large and complex, hdf5 format via h5py is good.

JSON

And of course, this discussion wouldn't be complete without mentioning json. You may need to do extra work setting up encoding and decoding of your data, but nothing gets immediately run when you use json.load, so you can check the template/structure of the loaded data before you use it.

DISCLAIMER: I take no responsibility for end-user security with this provided information. The above information is for informational purposes only. Please use proper discretion and appropriate measures (including corporate policies, where applicable) with regard to security needs.

Outskirts answered 20/2, 2022 at 19:39 Comment(0)
L
1

You should use numpy.save() for saving numpy matrices.

Lashawn answered 26/6, 2019 at 6:19 Comment(1)
I'm confused, what are the pros/cons of pickle vs np.save/z etc?Rockett
D
0

In your code, you're using

if load:
    fileObject2 = open(fileName, 'wb')
    modelInput = pkl.load(fileObject2)
    fileObject2.close()

The second argument in the open function is the method. w stands for writing, r for reading. The second character b denotes that bytes will be read/written. A file that will be written to cannot be read and vice versa. Therefore, opening the file with fileObject2 = open(fileName, 'rb') will do the trick.

Darden answered 24/6, 2020 at 11:15 Comment(0)
D
0

The easiest way to save and load a NumPy array -

# a numpy array

result.importances_mean
array([-1.43651529e-03, -2.73401297e-03,  9.26784059e-05, -7.41427247e-04,
        3.56811863e-03,  2.78035218e-03,  3.70713624e-03,  5.51436515e-03,
        1.16821131e-01,  9.26784059e-05,  9.26784059e-04, -1.80722892e-03,
       -1.71455051e-03, -1.29749768e-03, -9.26784059e-05, -1.43651529e-03,
        0.00000000e+00, -1.11214087e-03, -4.63392030e-05, -4.63392030e-04,
        1.20481928e-03,  5.42168675e-03, -5.56070436e-04,  8.34105653e-04,
       -1.85356812e-04,  0.00000000e+00, -9.73123262e-04, -1.43651529e-03,
       -1.76088971e-03])

# save the array format - np.save(filename.npy, array)

np.save(os.path.join(model_path, "permutation_imp.npy"), result.importances_mean)

# load the array format - np.load(filename.npy)

res = np.load(os.path.join(model_path, "permutation_imp.npy"))
res
array([-1.43651529e-03, -2.73401297e-03,  9.26784059e-05, -7.41427247e-04,
        3.56811863e-03,  2.78035218e-03,  3.70713624e-03,  5.51436515e-03,
        1.16821131e-01,  9.26784059e-05,  9.26784059e-04, -1.80722892e-03,
       -1.71455051e-03, -1.29749768e-03, -9.26784059e-05, -1.43651529e-03,
        0.00000000e+00, -1.11214087e-03, -4.63392030e-05, -4.63392030e-04,
        1.20481928e-03,  5.42168675e-03, -5.56070436e-04,  8.34105653e-04,
       -1.85356812e-04,  0.00000000e+00, -9.73123262e-04, -1.43651529e-03,
       -1.76088971e-03])
Dna answered 16/6, 2021 at 17:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.