How to make tf.data.Dataset return all of the elements in one call?
Asked Answered
B

8

18

Is there an easy way to get the entire set of elements in a tf.data.Dataset? i.e. I want to set batch size of the Dataset to be the size of my dataset without specifically passing it the number of elements. This would be useful for validation dataset where I want to measure accuracy on the entire dataset in one go. I'm surprised there isn't a method to get the size of a tf.data.Dataset

Briones answered 6/1, 2018 at 11:4 Comment(2)
You can also use tf.metrics.accuracy and run sess.run(update_op) on each batch of the validation data. At the end, calling sess.run(accuracy) should give you the total accuracy.Preamplifier
I am getting convinced it is a waste of time to use tensorflow API's and estimators. I spent so much time learning them, and then you face one limitation after another, like the one you have mentioned. I would just create my own dataset and batch generator.Wampler
T
7

In Tensorflow 2.0

You can enumerate the dataset using as_numpy_iterator

for element in Xtrain.as_numpy_iterator(): 
  print(element) 
Tartrate answered 20/4, 2020 at 23:14 Comment(0)
B
3

In short, there is not a good way to get the size/length; tf.data.Dataset is built for pipelines of data, so has an iterator structure (in my understanding and according to my read of the Dataset ops code. From the programmer's guide:

A tf.data.Iterator provides the main way to extract elements from a dataset. The operation returned by Iterator.get_next() yields the next element of a Dataset when executed, and typically acts as the interface between input pipeline code and your model.

And, by their nature, iterators do not have a convenient notion of size/length; see here: Getting number of elements in an iterator in Python

More generally though, why does this problem arise? If you are calling batch, you are also getting a tf.data.Dataset, so whatever you are running on a batch you should be able to run on the whole dataset; it will iterate through all the elements and calculate validation accuracy. Put differently, I don't think you actually need the size/length to do what you want to do.

Betulaceous answered 6/1, 2018 at 13:0 Comment(2)
My code accepts training and validation tfrecords files and turns them into two tf.Datasets with a single iterator that can be initialised to both Datasets (similar to examples in TF's documentation). The number of epochs and batch sizes for training data is in my control and I can easily apply .batch() and .repeat() method on the training dataset. However, for the validation data I want to create a single batch containing all the samples but I don't necessarily know how many samples are in the tfrecord file.Briones
I see; thanks for explanation. What I was trying to say is that, when you are running '.batch()`, it returns an object of the same type as your dataset. Thus whatever you are calling on a batch you should be able to call on the dataset itself (just without the call to batch).Betulaceous
B
1

Not sure if this still works in latest versions of TensorFlow but if this is absolutely needed a hacky solution is to create a batch that's bigger than the dataset size. You don't need to know how big the dataset is, just request a batch size that's larger.

Briones answered 5/6, 2018 at 16:4 Comment(1)
Ugly but still working solution in Tensorflow 2.5.0Juniorjuniority
A
1

tf.data API creates a tensor called 'tensors/component' with the appropriate prefix/suffix if applicable). after you create the instance. You can evaluate the tensor by name and use it as a batch size.

#Ignore the warnings
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import tensorflow as tf
import numpy as np

import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (8,7)
%matplotlib inline


from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/")

Xtrain = mnist.train.images[mnist.train.labels < 2]
ytrain = mnist.train.labels[mnist.train.labels < 2]

print(Xtrain.shape)
#(11623, 784)
print(ytrain.shape)
#(11623,)  

#Data parameters
num_inputs = 28
num_classes = 2
num_steps=28

# create the training dataset
Xtrain = tf.data.Dataset.from_tensor_slices(Xtrain).map(lambda x: tf.reshape(x,(num_steps, num_inputs)))
# apply a one-hot transformation to each label for use in the neural network
ytrain = tf.data.Dataset.from_tensor_slices(ytrain).map(lambda z: tf.one_hot(z, num_classes))
# zip the x and y training data together and batch and Prefetch data for faster consumption
train_dataset = tf.data.Dataset.zip((Xtrain, ytrain)).batch(128).prefetch(128)

iterator = tf.data.Iterator.from_structure(train_dataset.output_types,train_dataset.output_shapes)
X, y = iterator.get_next()

training_init_op = iterator.make_initializer(train_dataset)

def get_tensors(graph=tf.get_default_graph()):
    return [t for op in graph.get_operations() for t in op.values()]

get_tensors()
#<tf.Tensor 'tensors_1/component_0:0' shape=(11623,) dtype=uint8>,
#<tf.Tensor 'batch_size:0' shape=() dtype=int64>,
#<tf.Tensor 'drop_remainder:0' shape=() dtype=bool>,
#<tf.Tensor 'buffer_size:0' shape=() dtype=int64>,
#<tf.Tensor 'IteratorV2:0' shape=() dtype=resource>,
#<tf.Tensor 'IteratorToStringHandle:0' shape=() dtype=string>,
#<tf.Tensor 'IteratorGetNext:0' shape=(?, 28, 28) dtype=float32>,
#<tf.Tensor 'IteratorGetNext:1' shape=(?, 2) dtype=float32>,
#<tf.Tensor 'TensorSliceDataset:0' shape=() dtype=variant>,
#<tf.Tensor 'MapDataset:0' shape=() dtype=variant>,
#<tf.Tensor 'TensorSliceDataset_1:0' shape=() dtype=variant>,
#<tf.Tensor 'MapDataset_1:0' shape=() dtype=variant>,
#<tf.Tensor 'ZipDataset:0' shape=() dtype=variant>,
#<tf.Tensor 'BatchDatasetV2:0' shape=() dtype=variant>,
#<tf.Tensor 'PrefetchDataset:0' shape=() dtype=variant>]

sess = tf.InteractiveSession()
print('Size of Xtrain: %d' % tf.get_default_graph().get_tensor_by_name('tensors/component_0:0').eval().shape[0])
#Size of Xtrain: 11623
Awlwort answered 24/11, 2018 at 0:12 Comment(0)
L
1

Adding on John's answer:

total = []
for element in val_ds.as_numpy_iterator(): 
  total.append(element[1])

all_total = np.concatenate(total)
print(all_total)
Lament answered 18/6, 2021 at 15:18 Comment(0)
A
1

TensorFlow's get_single_element() is finally around which does exactly this - return all of the elements in one call.

This avoids the need of generating and using an iterator using .map() or iter() (which could be costly for big datasets).

get_single_element() returns a tensor (or a tuple or dict of tensors) encapsulating all the members of the dataset. We need to pass all the members of the dataset batched into a single element.

This can be used to get features as a tensor-array, or features and labels as a tuple or dictionary (of tensor-arrays) depending upon how the original dataset was created.

Check this answer on SO for an example that unpacks features and labels into a tuple of tensor-arrays.

Anatomy answered 19/8, 2021 at 16:42 Comment(0)
N
1

You can get all the elements of the dataset with

`dataset.take(lenth_of_the_dataset)`

Args: count: A tf.int64 scalar tf.Tensor, representing the number of elements of this dataset that should be taken to form the new dataset. If count is -1, or if count is greater than the size of this dataset, the new dataset will contain all elements of this dataset.
name: (Optional.) A name for the tf.data operation.

Returns: Dataset: A Dataset.

from tensorflow.org

Lets suppose that my dataset is as follows

<BatchDataset element_spec=TensorSpec(shape=(None, 400, 32, 32, 3), dtype=tf.float32, name=None)>

I'm going to iterate over the dataset and get each datapoint as a numpy object.

for img in data.unbatch().take(10): # because the length is 10
   print(img.numpy().shape)
Nightspot answered 7/6, 2022 at 9:5 Comment(2)
All I got was <TakeDataset>.Saar
oh, you have to iterate it over with a loop !!!Nightspot
W
1

The following example will batch all the elements in the dataset as a single item, and extract them as an array.

data = data.batch(len(data))
data = data.get_single_element()

This will add an outer dimension to the data equal to the length of the batch. For example, if you start with a dataset containing 456 elements of dimension (32, 100), you will receive an array of shape (456, 32, 100).

Wershba answered 16/8, 2022 at 10:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.