The short answer is: you should use np.save
and np.load
.
The advantage of using these functions is that they are made by the developers of the Numpy library and they already work (plus are likely optimized nicely for processing speed).
For example:
import numpy as np
from pathlib import Path
path = Path('~/data/tmp/').expanduser()
path.mkdir(parents=True, exist_ok=True)
lb,ub = -1,1
num_samples = 5
x = np.random.uniform(low=lb,high=ub,size=(1,num_samples))
y = x**2 + x + 2
np.save(path/'x', x)
np.save(path/'y', y)
x_loaded = np.load(path/'x.npy')
y_load = np.load(path/'y.npy')
print(x is x_loaded) # False
print(x == x_loaded) # [[ True True True True True]]
Expanded answer:
In the end it really depends in your needs because you can also save it in a human-readable format (see Dump a NumPy array into a csv file) or even with other libraries if your files are extremely large (see best way to preserve numpy arrays on disk for an expanded discussion).
However, (making an expansion since you use the word "properly" in your question) I still think using the numpy function out of the box (and most code!) most likely satisfy most user needs. The most important reason is that it already works. Trying to use something else for any other reason might take you on an unexpectedly LONG rabbit hole to figure out why it doesn't work and force it work.
Take for example trying to save it with pickle. I tried that just for fun and it took me at least 30 minutes to realize that pickle wouldn't save my stuff unless I opened & read the file in bytes mode with wb
. It took time to google the problem, test potential solutions, understand the error message, etc... It's a small detail, but the fact that it already required me to open a file complicated things in unexpected ways. To add to that, it required me to re-read this (which btw is sort of confusing): Difference between modes a, a+, w, w+, and r+ in built-in open function?.
So if there is an interface that meets your needs, use it unless you have a (very) good reason (e.g. compatibility with matlab or for some reason your really want to read the file and printing in Python really doesn't meet your needs, which might be questionable). Furthermore, most likely if you need to optimize it, you'll find out later down the line (rather than spending ages debugging useless stuff like opening a simple Numpy file).
So use the interface/numpy provide. It might not be perfect, but it's most likely fine, especially for a library that's been around as long as Numpy.
I already spent the saving and loading data with numpy in a bunch of way so have fun with it. Hope this helps!
import numpy as np
import pickle
from pathlib import Path
path = Path('~/data/tmp/').expanduser()
path.mkdir(parents=True, exist_ok=True)
lb,ub = -1,1
num_samples = 5
x = np.random.uniform(low=lb,high=ub,size=(1,num_samples))
y = x**2 + x + 2
# using save (to npy), savez (to npz)
np.save(path/'x', x)
np.save(path/'y', y)
np.savez(path/'db', x=x, y=y)
with open(path/'db.pkl', 'wb') as db_file:
pickle.dump(obj={'x':x, 'y':y}, file=db_file)
## using loading npy, npz files
x_loaded = np.load(path/'x.npy')
y_load = np.load(path/'y.npy')
db = np.load(path/'db.npz')
with open(path/'db.pkl', 'rb') as db_file:
db_pkl = pickle.load(db_file)
print(x is x_loaded)
print(x == x_loaded)
print(x == db['x'])
print(x == db_pkl['x'])
print('done')
Some comments on what I learned:
np.save
as expected, this already compresses it well (see https://mcmap.net/q/151742/-save-numpy-array-using-pickle), works out of the box without any file opening. Clean. Easy. Efficient. Use it.
np.savez
uses a uncompressed format (see docs) Save several arrays into a single file in uncompressed .npz
format. If you decide to use this (you were warned about going away from the standard solution so expect bugs!) you might discover that you need to use argument names to save it, unless you want to use the default names. So don't use this if the first already works (or any works use that!)
- Pickle also allows for arbitrary code execution. Some people might not want to use this for security reasons.
- Human-readable files are expensive to make etc. Probably not worth it.
- There is something called
hdf5
for large files. Cool! https://mcmap.net/q/88749/-best-way-to-preserve-numpy-arrays-on-disk
Note that this is not an exhaustive answer. But for other resources check this:
np.save()
andnp.load()
. – Gradelyscipy.io.savemat
andscipy.io.loadmat
. – Gradelyfromfile
is to read the data as binary.loadtxt
is the correct pairing withsavetxt
. Look at the function documentation. – Lucindalucinepickle
? – Calcaneustorch.save
? – Calcaneus