I have a pkl file from MNIST dataset, which consists of handwritten digit images.
I'd like to take a look at each of those digit images, so I need to unpack the pkl file. Is there a way to unpack/unzip pkl file?
I have a pkl file from MNIST dataset, which consists of handwritten digit images.
I'd like to take a look at each of those digit images, so I need to unpack the pkl file. Is there a way to unpack/unzip pkl file?
Your pkl
file is, in fact, a serialized pickle
file, which means it has been dumped using Python's pickle
module.
To un-pickle the data you can:
import pickle
with open('serialized.pkl', 'rb') as f:
data = pickle.load(f)
Note gzip
is only needed if the file is compressed:
import gzip
import pickle
with gzip.open('mnist.pkl.gz', 'rb') as f:
train_set, valid_set, test_set = pickle.load(f)
Where each set can be further divided (i.e. for the training set):
train_x, train_y = train_set
Those would be the inputs (digits) and outputs (labels) of your sets.
If you want to display the digits:
import matplotlib.cm as cm
import matplotlib.pyplot as plt
plt.imshow(train_x[0].reshape((28, 28)), cmap=cm.Greys_r)
plt.show()
The other alternative would be to look at the original data:
http://yann.lecun.com/exdb/mnist/
But that will be harder, as you'll need to create a program to read the binary data in those files. So I recommend you to use Python, and load the data with pickle
. As you've seen, it's very easy. ;-)
pkl
files in general, pickle.load
works for unpacking -- though I guess it performs less well than cPickle.load
. For pkl
files on the smaller side, the performance difference is not noticeable. –
Pappy open
function has a default value of mode set to r
(read), so it's important about opening a file with rb
mode. If b
(binary) mode is not added, unpickling might result in a UnicodeDecodeError
. –
Weep pickle
module should keep in mind that it is not secure and should only be used to unpickle data from trusted sources as there is the possibility for arbitrary code execution during the unpickling process. If you are producing pickles, consider signing data with hmac to ensure data has not been tampered with, or using alternative forms of serialisation like JSON. –
Bipod Handy one-liner
pkl() (
python -c 'import pickle,sys;d=pickle.load(open(sys.argv[1],"rb"));print(d)' "$1"
)
pkl my.pkl
Will print __str__
for the pickled object.
The generic problem of visualizing an object is of course undefined, so if __str__
is not enough, you will need a custom script, @dataclass
+ pprint
may be of interest: Is there a built-in function to print all the current properties and values of an object?
Mass direct extraction of MNIST -idx3-ubyte.gz
files to PNG
You can also easily download the official dataset files from http://yann.lecun.com/exdb/mnist/ and expand them to PNGs as per:
which uses the script from: https://github.com/myleott/mnist_png
Related: How to put my dataset in a .pkl file in the exact format and data structure used in "mnist.pkl.gz"?
In case you want to work with the original MNIST files, here is how you can deserialize them.
If you haven't downloaded the files yet, do that first by running the following in the terminal:
wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Then save the following as deserialize.py
and run it.
import numpy as np
import gzip
IMG_DIM = 28
def decode_image_file(fname):
result = []
n_bytes_per_img = IMG_DIM*IMG_DIM
with gzip.open(fname, 'rb') as f:
bytes_ = f.read()
data = bytes_[16:]
if len(data) % n_bytes_per_img != 0:
raise Exception('Something wrong with the file')
result = np.frombuffer(data, dtype=np.uint8).reshape(
len(bytes_)//n_bytes_per_img, n_bytes_per_img)
return result
def decode_label_file(fname):
result = []
with gzip.open(fname, 'rb') as f:
bytes_ = f.read()
data = bytes_[8:]
result = np.frombuffer(data, dtype=np.uint8)
return result
train_images = decode_image_file('train-images-idx3-ubyte.gz')
train_labels = decode_label_file('train-labels-idx1-ubyte.gz')
test_images = decode_image_file('t10k-images-idx3-ubyte.gz')
test_labels = decode_label_file('t10k-labels-idx1-ubyte.gz')
The script doesn't normalize the pixel values like in the pickled file. To do that, all you have to do is
train_images = train_images/255
test_images = test_images/255
Pandas library makes unpickling very simple. It uses pickle
from the standard library under the hood but takes care of some common issues for us as well (for example, to open an MNIST pickled data, you probably need to pass encoding like: pickle.load(f, encoding='bytes')
; such issues are handled by pandas).
In general, to unpickle a pickled file, use it like:
import pandas as pd
data = pd.read_pickle("serialized.pkl")
It can handle compressed files as well.
train_set, valid_set, test_set = pd.read_pickle("mnist.pkl.gz")
You can just pass a URL to it as well:1
url = "https://raw.githubusercontent.com/mnielsen/neural-networks-and-deep-learning/master/data/mnist.pkl.gz"
train_set, valid_set, test_set = pd.read_pickle(url)
If we plot the first image in the training set and title it with its label:
import matplotlib.pyplot as plt
plt.imshow(train_set[0][0].reshape((28, 28)), cmap='gray')
plt.gca().set(title=train_set[1][0], xticks=[], yticks=[])
we get the following:
1 This URL points to the MNIST dataset stored in the Github repo of Neural Networks and Deep Learning by Michael Nielsen.
© 2022 - 2024 — McMap. All rights reserved.