Change size of train and test set from MNIST Dataset
Asked Answered
E

1

6

I'm using the MNIST and Keras for learning about CNNs. I'm downloading the MNIST database of handwritten digits under Keras API as show below. The dataset is already split in 60.000 images for training and 10.000 images for test (see Dataset - Keras Documentation).

from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

How can I join the training and test sets and then separate them into 70% for training and 30% for testing?

Emmert answered 22/1, 2019 at 21:34 Comment(0)
A
7

There's no such argument in mnist.load_data. Instead you can concatenate data via numpy then split via sklearn (or numpy):

from keras.datasets import mnist
import numpy as np
from sklearn.model_selection import train_test_split

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x = np.concatenate((x_train, x_test))
y = np.concatenate((y_train, y_test))

train_size = 0.7
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=train_size, random_seed=2019)

Set a random seed for a reproducibility.

Via numpy (if you don't use sklearn):

# do the same concatenation
np.random.seed(2019)
train_size = 0.7
index = np.random.rand(len(x)) < train_size  # boolean index
x_train, x_test = x[index], x[~index]  # index and it's negation
y_train, y_test = y[index], y[~index]

You'll get an arrays of approximately required size (~210xx instead of 21000 test size).

The source code of mnist.load_data looks like this function just fetches this data from a URL already split as 60000 / 10000 test, so there's only a concatenation workaround.

You could also download the MNIST dataset from http://yann.lecun.com/exdb/mnist/ and preprocess it manually, and then concatenate it (as you need). But, as far as I understand, it was divided into 60000 examples for training and 10000 for testing because this splitting is used in standard benchmarks.

Antonelli answered 22/1, 2019 at 21:51 Comment(1)
Thanks for the answer. I knew that the split was standart but I'm working in a project for university that need to use differents size of train/test sets for see the affect of this change on the results.Emmert

© 2022 - 2024 — McMap. All rights reserved.