Can't load 'mnist-original' dataset using sklearn [duplicate]
Asked Answered
G

9

5

This question is similar to what asked here and here. Unfortunately, in my case the suggested solution didn't fix the problem.

I need to work with the MNIST dataset but I can't fetch it, even if I specify the address of the scikit_learn_data/mldata/ folder (see below). How can I fix this?

In case it might help, I'm using Anaconda.

Code:

from sklearn.datasets.mldata import fetch_mldata

dataset = fetch_mldata('mnist-original', data_home='/Users/michelangelo/scikit_learn_data/mldata/')
mnist = fetch_mldata('MNIST original')

Error:

---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-5-dc4d45bc928e> in <module>()
----> 1 mnist = fetch_mldata('MNIST original')

/Users/michelangelo/anaconda2/lib/python2.7/site-packages/sklearn/datasets/mldata.pyc in fetch_mldata(dataname, target_name, data_name, transpose_data, data_home)
    168     # load dataset matlab file
    169     with open(filename, 'rb') as matlab_file:
--> 170         matlab_dict = io.loadmat(matlab_file, struct_as_record=True)
    171 
    172     # -- extract data from matlab_dict

/Users/michelangelo/anaconda2/lib/python2.7/site-packages/scipy/io/matlab/mio.pyc in loadmat(file_name, mdict, appendmat, **kwargs)
    134     variable_names = kwargs.pop('variable_names', None)
    135     MR = mat_reader_factory(file_name, appendmat, **kwargs)
--> 136     matfile_dict = MR.get_variables(variable_names)
    137     if mdict is not None:
    138         mdict.update(matfile_dict)

/Users/michelangelo/anaconda2/lib/python2.7/site-packages/scipy/io/matlab/mio5.pyc in get_variables(self, variable_names)
    290                 continue
    291             try:
--> 292                 res = self.read_var_array(hdr, process)
    293             except MatReadError as err:
    294                 warnings.warn(

/Users/michelangelo/anaconda2/lib/python2.7/site-packages/scipy/io/matlab/mio5.pyc in read_var_array(self, header, process)
    250            `process`.
    251         '''
--> 252         return self._matrix_reader.array_from_header(header, process)
    253 
    254     def get_variables(self, variable_names=None):

mio5_utils.pyx in scipy.io.matlab.mio5_utils.VarReader5.array_from_header()

mio5_utils.pyx in scipy.io.matlab.mio5_utils.VarReader5.array_from_header()

mio5_utils.pyx in scipy.io.matlab.mio5_utils.VarReader5.read_real_complex()

mio5_utils.pyx in scipy.io.matlab.mio5_utils.VarReader5.read_numeric()

mio5_utils.pyx in scipy.io.matlab.mio5_utils.VarReader5.read_element()

streams.pyx in scipy.io.matlab.streams.FileStream.read_string()

IOError: could not read bytes
Girt answered 16/11, 2017 at 8:32 Comment(7)
If you type this: from sklearn.datasets import fetch_mldata, mnist = fetch_mldata('MNIST original') does it work ?Penninite
Nope, and I get SyntaxError: invalid syntaxGirt
copy paste each command separately. 1) from sklearn.datasets import fetch_mldata 2) mnist = fetch_mldata('MNIST original')Penninite
Unfortunately that was not the problem.Girt
what is your sklearn version ? use: import sklearn and sklearn.__version__ to print the versionPenninite
Version number: '0.19.1'Girt
Let us continue this discussion in chat.Penninite
S
5

I just faced the same issue and it took me some time to find the problem. One reason is, data can be corrupted during the first download. Remove the cached data. Find the scikit data home dir as follows:

from sklearn.datasets.base import get_data_home 
print (get_data_home())

Clean the directory and redownload the dataset. This solution works for me. For reference: https://github.com/ageron/handson-ml/issues/143

This is also related with the following question: How to use datasets.fetch_mldata() in sklearn?

Shinberg answered 12/9, 2018 at 14:23 Comment(0)
S
21

Unfortunately fetch_mldata() has been replaced in the latest version of sklearn as fetch_openml().

So, instead of using:

from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')

You must use:

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')
x = mnist.data
y = mnist.target

shape of x will be = (70000,784)
shape of y will be = (70000,)

Splinter answered 27/11, 2019 at 7:45 Comment(0)
R
7

A quick update for the question here:

mldata.org seems to still be down. Then scikit-learn will remove fetch_mldata.

Solution for the moment: Since using the lines above will create a empty folder a the place of data_home, find the copy of the data here: https://github.com/amplab/datascience-sp14/blob/master/lab7/mldata/mnist-original.mat and download it. Then place it the ~/sklearn_data/mldata/ which is empty.

It worked for me.

Reinert answered 4/3, 2019 at 15:21 Comment(0)
S
5

I just faced the same issue and it took me some time to find the problem. One reason is, data can be corrupted during the first download. Remove the cached data. Find the scikit data home dir as follows:

from sklearn.datasets.base import get_data_home 
print (get_data_home())

Clean the directory and redownload the dataset. This solution works for me. For reference: https://github.com/ageron/handson-ml/issues/143

This is also related with the following question: How to use datasets.fetch_mldata() in sklearn?

Shinberg answered 12/9, 2018 at 14:23 Comment(0)
N
3

Instead of :

from sklearn.datasets.mldata import fetch_mldata

use:

from sklearn.datasets import fetch_mldata

And then:

mnist = fetch_mldata('MNIST original')
X = mnist.data.astype('float64')
y = mnist.target

Please see this example:

Nash answered 16/11, 2017 at 9:9 Comment(5)
Thanks for the reply Vivek! I still get IOError: could not read bytesGirt
@albus_c Possibly the download is corrupt. Please check the size of the downloaded file in scikit_learn_data/mldata. It should be at least 52 MB. If not, delete and try again.Nash
@albus_c Precisely 52.9 MB. If still not successful, then please download the file from this link in a browser and replace the file in that folder.Nash
Most likely that was the problem. I will now try with the direct link.Girt
fetch_mldata is now deprecated : (Maureenmaureene
G
3

For people having the same issue: it was a connection problem. If you get a similar error, check that you have the entire mnist-original.mat file, as suggested by @vivek-kumar. Current file size: 55.4 MB.

Girt answered 16/11, 2017 at 9:49 Comment(1)
Yes, in the system, it shows as 55.4 MB, but during download its shown as 52.9 MB. Please consider upvoting and accepting the answer if helped.Nash
P
2

In the latest sklearn version (0.21) use this:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits

digits = load_digits()

X = digits.data
y = digits.target
Penninite answered 16/11, 2019 at 21:16 Comment(0)
P
2

Just use these two lines:

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, cache=True)
Panther answered 5/4, 2020 at 21:16 Comment(1)
This now throws urlopen error [Errno -3] Temporary failure in name resolutionPriggery
C
0

try this:

print(sklearn.__version__)

try:
    from sklearn.datasets import fetch_openml
    mnist = fetch_openml('mnist_784', version=1, cache=True)
except ImportError:
    from sklearn.datasets import fetch_mldata
    mnist = fetch_mldata('MNIST original')
Culminate answered 22/11, 2021 at 21:8 Comment(0)
E
-1

Try this one, this will work.

from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
Elector answered 16/10, 2019 at 14:48 Comment(1)
Hi Puneet! Could you add an explanation as to why this is the correct answer?Scapegoat

© 2022 - 2024 — McMap. All rights reserved.