Pickle incompatibility of numpy arrays between Python 2 and 3
Asked Answered
H

7

182

I am trying to load the MNIST dataset linked here in Python 3.2 using this program:

import pickle
import gzip
import numpy


with gzip.open('mnist.pkl.gz', 'rb') as f:
    l = list(pickle.load(f))
    print(l)

Unfortunately, it gives me the error:

Traceback (most recent call last):
   File "mnist.py", line 7, in <module>
     train_set, valid_set, test_set = pickle.load(f)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x90 in position 614: ordinal not in range(128)

I then tried to decode the pickled file in Python 2.7, and re-encode it. So, I ran this program in Python 2.7:

import pickle
import gzip
import numpy


with gzip.open('mnist.pkl.gz', 'rb') as f:
    train_set, valid_set, test_set = pickle.load(f)

    # Printing out the three objects reveals that they are
    # all pairs containing numpy arrays.

    with gzip.open('mnistx.pkl.gz', 'wb') as g:
        pickle.dump(
            (train_set, valid_set, test_set),
            g,
            protocol=2)  # I also tried protocol 0.

It ran without error, so I reran this program in Python 3.2:

import pickle
import gzip
import numpy

# note the filename change
with gzip.open('mnistx.pkl.gz', 'rb') as f:
    l = list(pickle.load(f))
    print(l)

However, it gave me the same error as before. How do I get this to work?


This is a better approach for loading the MNIST dataset.

Hash answered 3/7, 2012 at 6:46 Comment(9)
there are compatibility breaks between 2.7 and 3.x. especially string vs unicode. And picking a numpy object requires that both systems load the numpy module but those modules are different. Sorry I don't have an answer but this might not be do-able and is probably not advisable. If this are big things (gzip), maybe hdf5 with pytables??Cockup
@PhilCooper: Thanks, your comment (post this as an answer?) clued me in to the right answer. I could have used hdf5, but it seemed complicated to learn, so I went with numpy.save/load and this worked.Hash
h5py is very simple to use, almost certainly much easier then solving nebulous compatibility problems with pickling numpy arrays.Synchronous
You say you "ran this program under Python 2.7". OK but what did you run under 3.2? :-) The same?Truthful
@LennartRegebro: After running the second program that pickles the arrays, I ran the first program (substituting the filename mnistx.pkl.gz) in Python 3.2. It didn't work, which I think illustrates some kind of incompatibility.Hash
@NeilG: It would be a good idea if you listed the program you actually get the error. This error looks like you open the file in text mode, specifically.Truthful
@LennartRegebro: Okay, done. Please let me know if you end up reproducing the error. Thanks.Hash
@NeilG thanks for the link to the 'better approach,' but could you clarify on how to do it? what code did you run for this?Cursor
@KevinZhao Read the docs I linked or ask a question including what you tried if it's unclear how to use that function.Hash
T
157

This seems like some sort of incompatibility. It's trying to load a "binstring" object, which is assumed to be ASCII, while in this case it is binary data. If this is a bug in the Python 3 unpickler, or a "misuse" of the pickler by numpy, I don't know.

Here is something of a workaround, but I don't know how meaningful the data is at this point:

import pickle
import gzip
import numpy

with open('mnist.pkl', 'rb') as f:
    u = pickle._Unpickler(f)
    u.encoding = 'latin1'
    p = u.load()
    print(p)

Unpickling it in Python 2 and then repickling it is only going to create the same problem again, so you need to save it in another format.

Truthful answered 3/7, 2012 at 15:48 Comment(6)
You can use pickle.load(file_obj, encoding='latin1') (at least in Python 3.3). This seems to work.Vitrine
For those who's using numpy load and facing the similar problem: it is possible to pass encoding there as well: np.load('./bvlc_alexnet.npy', encoding='latin1')Fresno
This worked for me when adding encoding='latin1' failed. Thanks!Privilege
For my case, only pickle.load(open(file_path, "rb"), encoding="latin1") worked.Target
why is encoding latin1? Why isn't it bytes?Toolmaker
Latin1 is an encoding that just maps each byte in a byte string to the character with the same value in unicode, so that's why.Truthful
L
166

If you are getting this error in python3, then, it could be an incompatibility issue between python 2 and python 3, for me the solution was to load with latin1 encoding:

pickle.load(file, encoding='latin1')
Lanfranc answered 28/12, 2016 at 17:17 Comment(0)
T
157

This seems like some sort of incompatibility. It's trying to load a "binstring" object, which is assumed to be ASCII, while in this case it is binary data. If this is a bug in the Python 3 unpickler, or a "misuse" of the pickler by numpy, I don't know.

Here is something of a workaround, but I don't know how meaningful the data is at this point:

import pickle
import gzip
import numpy

with open('mnist.pkl', 'rb') as f:
    u = pickle._Unpickler(f)
    u.encoding = 'latin1'
    p = u.load()
    print(p)

Unpickling it in Python 2 and then repickling it is only going to create the same problem again, so you need to save it in another format.

Truthful answered 3/7, 2012 at 15:48 Comment(6)
You can use pickle.load(file_obj, encoding='latin1') (at least in Python 3.3). This seems to work.Vitrine
For those who's using numpy load and facing the similar problem: it is possible to pass encoding there as well: np.load('./bvlc_alexnet.npy', encoding='latin1')Fresno
This worked for me when adding encoding='latin1' failed. Thanks!Privilege
For my case, only pickle.load(open(file_path, "rb"), encoding="latin1") worked.Target
why is encoding latin1? Why isn't it bytes?Toolmaker
Latin1 is an encoding that just maps each byte in a byte string to the character with the same value in unicode, so that's why.Truthful
T
17

It appears to be an incompatibility issue between Python 2 and Python 3. I tried loading the MNIST dataset with

    train_set, valid_set, test_set = pickle.load(file, encoding='iso-8859-1')

and it worked for Python 3.5.2

Thill answered 3/2, 2017 at 3:49 Comment(1)
it worked when 'latin1' failedArbitrage
T
8

It looks like there are some compatablility issues in pickle between 2.x and 3.x due to the move to unicode. Your file appears to be pickled with python 2.x and decoding it in 3.x could be troublesome.

I'd suggest unpickling it with python 2.x and saving to a format that plays more nicely across the two versions you're using.

Terenceterencio answered 3/7, 2012 at 6:57 Comment(2)
That's what I was trying to do. Which format do you recommend?Hash
I think the problem might have been encoding numpy dtype, which might be a string. In any case, I ended up using numpy.save/load to bridge the gap between python 2 and 3, and this worked.Hash
A
8

I just stumbled upon this snippet. Hope this helps to clarify the compatibility issue.

import sys

with gzip.open('mnist.pkl.gz', 'rb') as f:
    if sys.version_info.major > 2:
        train_set, valid_set, test_set = pickle.load(f, encoding='latin1')
    else:
        train_set, valid_set, test_set = pickle.load(f)
Accumulator answered 28/10, 2017 at 17:12 Comment(2)
Consider adding more amplifying information. How does this solve the problem?Wandis
@Accumulator that helped, please explanation to the answerLuminal
E
7

Try:

l = list(pickle.load(f, encoding='bytes')) #if you are loading image data or 
l = list(pickle.load(f, encoding='latin1')) #if you are loading text data

From the documentation of pickle.load method:

Optional keyword arguments are fix_imports, encoding and errors, which are used to control compatibility support for pickle stream generated by Python 2.

If fix_imports is True, pickle will try to map the old Python 2 names to the new names used in Python 3.

The encoding and errors tell pickle how to decode 8-bit string instances pickled by Python 2; these default to 'ASCII' and 'strict', respectively. The encoding can be 'bytes' to read these 8-bit string instances as bytes objects.

Ebersole answered 28/11, 2018 at 21:28 Comment(0)
R
0

There is hickle which is faster than pickle and easier. I tried to save and read it in pickle dump but while reading there were a lot of problems and wasted an hour and still didn't find a solution though I was working on my own data to create a chatbot.

vec_x and vec_y are numpy arrays:

data=[vec_x,vec_y]
hkl.dump( data, 'new_data_file.hkl' )

Then you just read it and perform the operations:

data2 = hkl.load( 'new_data_file.hkl' )
Runabout answered 13/7, 2018 at 20:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.