How to unpack pkl file
Asked Answered
P

5

142

I have a pkl file from MNIST dataset, which consists of handwritten digit images.

I'd like to take a look at each of those digit images, so I need to unpack the pkl file. Is there a way to unpack/unzip pkl file?

Poachy answered 23/7, 2014 at 8:58 Comment(0)
L
262

Generally

Your pkl file is, in fact, a serialized pickle file, which means it has been dumped using Python's pickle module.

To un-pickle the data you can:

import pickle


with open('serialized.pkl', 'rb') as f:
    data = pickle.load(f)

For the MNIST data set

Note gzip is only needed if the file is compressed:

import gzip
import pickle


with gzip.open('mnist.pkl.gz', 'rb') as f:
    train_set, valid_set, test_set = pickle.load(f)

Where each set can be further divided (i.e. for the training set):

train_x, train_y = train_set

Those would be the inputs (digits) and outputs (labels) of your sets.

If you want to display the digits:

import matplotlib.cm as cm
import matplotlib.pyplot as plt


plt.imshow(train_x[0].reshape((28, 28)), cmap=cm.Greys_r)
plt.show()

mnist_digit

The other alternative would be to look at the original data:

http://yann.lecun.com/exdb/mnist/

But that will be harder, as you'll need to create a program to read the binary data in those files. So I recommend you to use Python, and load the data with pickle. As you've seen, it's very easy. ;-)

Liard answered 1/8, 2014 at 11:22 Comment(4)
Is there also a way to make a pkl file out of the image files that I have?Poachy
Could be plain-old pickled, right? As opposed to cPickled? I'm not sure about the MNIST dataset, but for pkl files in general, pickle.load works for unpacking -- though I guess it performs less well than cPickle.load. For pkl files on the smaller side, the performance difference is not noticeable.Pappy
Also remember that, by default, open function has a default value of mode set to r (read), so it's important about opening a file with rb mode. If b (binary) mode is not added, unpickling might result in a UnicodeDecodeError.Weep
People using the pickle module should keep in mind that it is not secure and should only be used to unpickle data from trusted sources as there is the possibility for arbitrary code execution during the unpickling process. If you are producing pickles, consider signing data with hmac to ensure data has not been tampered with, or using alternative forms of serialisation like JSON.Bipod
T
11

Handy one-liner

pkl() (
  python -c 'import pickle,sys;d=pickle.load(open(sys.argv[1],"rb"));print(d)' "$1"
)
pkl my.pkl

Will print __str__ for the pickled object.

The generic problem of visualizing an object is of course undefined, so if __str__ is not enough, you will need a custom script, @dataclass + pprint may be of interest: Is there a built-in function to print all the current properties and values of an object?

Mass direct extraction of MNIST -idx3-ubyte.gz files to PNG

You can also easily download the official dataset files from http://yann.lecun.com/exdb/mnist/ and expand them to PNGs as per:

which uses the script from: https://github.com/myleott/mnist_png

Related: How to put my dataset in a .pkl file in the exact format and data structure used in "mnist.pkl.gz"?

Triparted answered 8/12, 2016 at 11:1 Comment(0)
S
2

In case you want to work with the original MNIST files, here is how you can deserialize them.

If you haven't downloaded the files yet, do that first by running the following in the terminal:

wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz

Then save the following as deserialize.py and run it.

import numpy as np
import gzip

IMG_DIM = 28

def decode_image_file(fname):
    result = []
    n_bytes_per_img = IMG_DIM*IMG_DIM

    with gzip.open(fname, 'rb') as f:
        bytes_ = f.read()
        data = bytes_[16:]

        if len(data) % n_bytes_per_img != 0:
            raise Exception('Something wrong with the file')

        result = np.frombuffer(data, dtype=np.uint8).reshape(
            len(bytes_)//n_bytes_per_img, n_bytes_per_img)

    return result

def decode_label_file(fname):
    result = []

    with gzip.open(fname, 'rb') as f:
        bytes_ = f.read()
        data = bytes_[8:]

        result = np.frombuffer(data, dtype=np.uint8)

    return result

train_images = decode_image_file('train-images-idx3-ubyte.gz')
train_labels = decode_label_file('train-labels-idx1-ubyte.gz')

test_images = decode_image_file('t10k-images-idx3-ubyte.gz')
test_labels = decode_label_file('t10k-labels-idx1-ubyte.gz')

The script doesn't normalize the pixel values like in the pickled file. To do that, all you have to do is

train_images = train_images/255
test_images = test_images/255
Supersensual answered 23/11, 2018 at 14:6 Comment(0)
B
2

The pickle (and gzip if the file is compressed) module need to be used

NOTE: These are already in the standard Python library. No need to install anything new

Baddie answered 10/9, 2019 at 15:20 Comment(0)
P
0

Pandas library makes unpickling very simple. It uses pickle from the standard library under the hood but takes care of some common issues for us as well (for example, to open an MNIST pickled data, you probably need to pass encoding like: pickle.load(f, encoding='bytes'); such issues are handled by pandas).

In general, to unpickle a pickled file, use it like:

import pandas as pd
data = pd.read_pickle("serialized.pkl")

It can handle compressed files as well.

train_set, valid_set, test_set = pd.read_pickle("mnist.pkl.gz")

You can just pass a URL to it as well:1

url = "https://raw.githubusercontent.com/mnielsen/neural-networks-and-deep-learning/master/data/mnist.pkl.gz"
train_set, valid_set, test_set = pd.read_pickle(url)

If we plot the first image in the training set and title it with its label:

import matplotlib.pyplot as plt
plt.imshow(train_set[0][0].reshape((28, 28)), cmap='gray')
plt.gca().set(title=train_set[1][0], xticks=[], yticks=[])

we get the following:

result

1 This URL points to the MNIST dataset stored in the Github repo of Neural Networks and Deep Learning by Michael Nielsen.

Peak answered 6/3 at 9:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.