How to create train, test and validation splits in tensorflow 2.0
Asked Answered
C

4

6

I am new to tensorflow, and I have started to use tensorflow 2.0

I have built a tensorflow dataset for a multi-class classification problem. Let's call this labeled_ds. I have prepared this dataset by loading all the image files from their respective class wise directories. I have followed along the tutorial here : tensorflow guide to load image dataset

Now, I need to split labeld_ds into three disjoint pieces : train, validation and test. I was going through the tensorflow API, but there was no example which allows to specify the split percentages. I found something in the load method, but I am not sure how to use it. Further, how can I get splits to be stratified ?

# labeled_ds contains multi class data, which is unbalanced.
train_ds, val_ds, test_ds = tf.data.Dataset.tfds.load(labeled_ds, split=["train", "validation", "test"])

I am stuck here, would appreciate any advice on how to progress from here. Thanks in advance.

Criticism answered 15/10, 2019 at 21:37 Comment(4)
Refer this answer to split tf.data datasetAwad
@SWAPNILMASUREKAR the solution provided to there will work for splitting data into multiple subsets. The problem is, the resulting splits will still not be stratified.Criticism
I came accross the same problem, and didn't seem to find a solution in tensorflow that makes sure the dataset is in fact stratified. The solution I ended up using is this. It's a function that splits your dataset into subdirectories of train and validation - then you can create train and validation tensorflow datasets from each directoryHoch
@ofirdubi thanks for sharing the link to the code. I too did something similar since TensorFlow does not provide such a functionality out of the box.Criticism
W
2

Please refer below code to create train, test and validation splits using tensorflow dataset "oxford_flowers102"

!pip install tensorflow==2.0.0

import tensorflow as tf
print(tf.__version__)
import tensorflow_datasets as tfds

labeled_ds, summary = tfds.load('oxford_flowers102', split='train+test+validation', with_info=True)

labeled_all_length = [i for i,_ in enumerate(labeled_ds)][-1] + 1

train_size = int(0.8 * labeled_all_length)
val_test_size = int(0.1 * labeled_all_length)

df_train = labeled_ds.take(train_size)
df_test = labeled_ds.skip(train_size)
df_val = df_test.skip(val_test_size)
df_test = df_test.take(val_test_size)

df_train_length = [i for i,_ in enumerate(df_train)][-1] + 1
df_val_length = [i for i,_ in enumerate(df_val)][-1] + 1
df_test_length = [i for i,_ in enumerate(df_test)][-1] + 1

print('Original: ', labeled_all_length)
print('Train: ', df_train_length)
print('Validation :', df_val_length)
print('Test :', df_test_length)
Wrongful answered 18/3, 2020 at 12:51 Comment(1)
The solution looks good, but this method of choosing training, test and Validation subsets does not ensure the data to be stratified. The term stratified means to have equal proportion of samples from all the classes (in all the three subsets).Criticism
M
1

I had the same problem

It depends on the dataset, most of which have a train and test set. In this case you can do the following (assuming 80-10-10 split):

splits, info = tfds.load('fashion_mnist', with_info=True, as_supervised=True,
split=['train+test[:80]','train+test[80:90]', 'train+test[90:]'],
data_dir=filePath)
Monro answered 21/10, 2020 at 8:0 Comment(1)
Thanks, Francesco, I was looking for a solution on a custom dataset. However, your solution will help others using TensorFlow provided datasets.Criticism
P
0

Francesco Boi Soultion works good for me.

splits, info = tfds.load('fashion_mnist', with_info=True, as_supervised=True, split=['train+test[:80]','train+test[80:90]', 'train+test[90:]'])

(train_examples, validation_examples, test_examples) = splits
Pollinate answered 15/5, 2021 at 15:56 Comment(0)
A
0

Importing tensorflow datasets :

import tensorflow_datasets as tfds

MNIST_info used to save the MNIST dataset once the MNIST dataset gets loaded:

MNIST_dataset, MNIST_info = tfds.load(name='MNIST', with_info= True, as_supervised= True)

Splitting the MNIST dataset into two parts, train and test dataset :

MNIST_train, MNIST_test = MNIST_dataset['train'],MNIST_dataset['test']

num_validation_samples=0.1*MNIST_info.splits['train'].num_examples
# (allocating 10 percent of the training dataset to create the validation dataset.)

Once validation dataset gets created, we can then have the samples convert to integer.

num_validation_samples = tf.cast(num_validation_samples, tf.int64)

Similarily, we have created the test samples in a similar way,

num_test_samples = MNIST_info.splits['test'].num_examples    
num_test_samples = tf.cast(num_test_samples, tf.int64)    
num_train_samples = 0.8*MNIST_info_splits['train'].num_examples

(allocating 80 percent out of the test dataset to create the training dataset.)

num_train_samples = tf.cast(num_train_samples, tf.int64)

Hope this has answered your question 🙂👍

Ampliate answered 16/3, 2023 at 7:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.