Many are forgetting one very important thing: security.
Pickled data is binary, so it gets run immediately upon using pickle.load
. If loading from an untrusted source, the file could contain executable instructions to achieve things like man-in-the-middle attacks over a network, among other things. (e.g. see this realpython.com article)
Pure pickled data may be faster to save/load if you don't follow with bz2 compression, and hence have a larger file size, but numpy
load/save may be more secure.
Alternatively, you may save purely pickled data along with an encryption key using the builtin hashlib
and hmac
libraries and, prior to loading, compare the hash key against your security key:
import hashlib
import hmac
def calculate_hash(
key_,
file_path,
hash_=hashlib.sha256
):
with open(file_path, "rb") as fp:
file_hash = hmac.new(key_, fp.read(), hash_).hexdigest()
return file_hash
def compare_hash(
hash1,
hash2,
):
"""
Warning:
Do not use `==` directly to compare hash values. Timing attacks can be used
to learn your security key. Use ``compare_digest()``.
"""
return hmac.compare_digest(hash1, hash2)
In a corporate setting, always be sure to confirm with your IT department. You want to be sure proper authentication, encryption, and authorization is all "set to go" when loading and saving data over servers and networks.
Pickle/CPickle
If you are confident you are using nothing but trusted sources and speed is a major concern over security and file size, pickle
might be the way to go. In addition, you can take a few extra security measures using cPickle
(this may have been incorporated directly into pickle
in recent Python3 versions, but I'm not sure, so always double-check):
Use a cPickle.Unpickler
instance, and set its "find_global" attribute to None
to disable importing any modules (thus restricting loading to builtin types such as dict
, int
, list
, string
, etc).
Use a cPickle.Unpickler
instance, and set its "find_global" attribute to a function that only allows importing of modules and names from a whitelist.
Use something like the itsdangerous
package to authenticate the data before unpickling it if you're loading it from an untrusted source.
Numpy
If you are only saving numpy
data and no other python
data, and security is a greater priority over file size and speed, then numpy
might be the way to go.
HDF5/H5PY
If your data is truly large and complex, hdf5
format via h5py
is good.
JSON
And of course, this discussion wouldn't be complete without mentioning json
. You may need to do extra work setting up encoding and decoding of your data, but nothing gets immediately run when you use json.load
, so you can check the template/structure of the loaded data before you use it.
DISCLAIMER: I take no responsibility for end-user security with this provided information. The above information is for informational purposes only. Please use proper discretion and appropriate measures (including corporate policies, where applicable) with regard to security needs.
pickle vs np.save/z
etc? – Rockett