Is there an easy way to get the entire set of elements in a tf.data.Dataset
? i.e. I want to set batch size of the Dataset to be the size of my dataset without specifically passing it the number of elements. This would be useful for validation dataset where I want to measure accuracy on the entire dataset in one go. I'm surprised there isn't a method to get the size of a tf.data.Dataset
In Tensorflow 2.0
You can enumerate the dataset using as_numpy_iterator
for element in Xtrain.as_numpy_iterator():
print(element)
In short, there is not a good way to get the size/length; tf.data.Dataset
is built for pipelines of data, so has an iterator structure (in my understanding and according to my read of the Dataset ops code. From the programmer's guide:
A
tf.data.Iterator
provides the main way to extract elements from a dataset. The operation returned byIterator.get_next()
yields the next element of a Dataset when executed, and typically acts as the interface between input pipeline code and your model.
And, by their nature, iterators do not have a convenient notion of size/length; see here: Getting number of elements in an iterator in Python
More generally though, why does this problem arise? If you are calling batch
, you are also getting a tf.data.Dataset
, so whatever you are running on a batch you should be able to run on the whole dataset; it will iterate through all the elements and calculate validation accuracy. Put differently, I don't think you actually need the size/length to do what you want to do.
Not sure if this still works in latest versions of TensorFlow but if this is absolutely needed a hacky solution is to create a batch that's bigger than the dataset size. You don't need to know how big the dataset is, just request a batch size that's larger.
tf.data
API creates a tensor called 'tensors/component'
with the appropriate prefix/suffix if applicable). after you create the instance. You can evaluate the tensor by name and use it as a batch size.
#Ignore the warnings
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (8,7)
%matplotlib inline
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/")
Xtrain = mnist.train.images[mnist.train.labels < 2]
ytrain = mnist.train.labels[mnist.train.labels < 2]
print(Xtrain.shape)
#(11623, 784)
print(ytrain.shape)
#(11623,)
#Data parameters
num_inputs = 28
num_classes = 2
num_steps=28
# create the training dataset
Xtrain = tf.data.Dataset.from_tensor_slices(Xtrain).map(lambda x: tf.reshape(x,(num_steps, num_inputs)))
# apply a one-hot transformation to each label for use in the neural network
ytrain = tf.data.Dataset.from_tensor_slices(ytrain).map(lambda z: tf.one_hot(z, num_classes))
# zip the x and y training data together and batch and Prefetch data for faster consumption
train_dataset = tf.data.Dataset.zip((Xtrain, ytrain)).batch(128).prefetch(128)
iterator = tf.data.Iterator.from_structure(train_dataset.output_types,train_dataset.output_shapes)
X, y = iterator.get_next()
training_init_op = iterator.make_initializer(train_dataset)
def get_tensors(graph=tf.get_default_graph()):
return [t for op in graph.get_operations() for t in op.values()]
get_tensors()
#<tf.Tensor 'tensors_1/component_0:0' shape=(11623,) dtype=uint8>,
#<tf.Tensor 'batch_size:0' shape=() dtype=int64>,
#<tf.Tensor 'drop_remainder:0' shape=() dtype=bool>,
#<tf.Tensor 'buffer_size:0' shape=() dtype=int64>,
#<tf.Tensor 'IteratorV2:0' shape=() dtype=resource>,
#<tf.Tensor 'IteratorToStringHandle:0' shape=() dtype=string>,
#<tf.Tensor 'IteratorGetNext:0' shape=(?, 28, 28) dtype=float32>,
#<tf.Tensor 'IteratorGetNext:1' shape=(?, 2) dtype=float32>,
#<tf.Tensor 'TensorSliceDataset:0' shape=() dtype=variant>,
#<tf.Tensor 'MapDataset:0' shape=() dtype=variant>,
#<tf.Tensor 'TensorSliceDataset_1:0' shape=() dtype=variant>,
#<tf.Tensor 'MapDataset_1:0' shape=() dtype=variant>,
#<tf.Tensor 'ZipDataset:0' shape=() dtype=variant>,
#<tf.Tensor 'BatchDatasetV2:0' shape=() dtype=variant>,
#<tf.Tensor 'PrefetchDataset:0' shape=() dtype=variant>]
sess = tf.InteractiveSession()
print('Size of Xtrain: %d' % tf.get_default_graph().get_tensor_by_name('tensors/component_0:0').eval().shape[0])
#Size of Xtrain: 11623
Adding on John's answer:
total = []
for element in val_ds.as_numpy_iterator():
total.append(element[1])
all_total = np.concatenate(total)
print(all_total)
TensorFlow's get_single_element()
is finally around which does exactly this - return all of the elements in one call.
This avoids the need of generating and using an iterator using .map()
or iter()
(which could be costly for big datasets).
get_single_element()
returns a tensor (or a tuple or dict of tensors) encapsulating all the members of the dataset. We need to pass all the members of the dataset batched into a single element.
This can be used to get features as a tensor-array, or features and labels as a tuple or dictionary (of tensor-arrays) depending upon how the original dataset was created.
Check this answer on SO for an example that unpacks features and labels into a tuple of tensor-arrays.
You can get all the elements of the dataset with
`dataset.take(lenth_of_the_dataset)`
Args: count: A tf.int64 scalar tf.Tensor, representing the number of elements of this dataset that should be taken to form the new dataset. If count is -1, or if count is greater than the size of this dataset, the new dataset will contain all elements of this dataset.
name: (Optional.) A name for the tf.data operation.Returns: Dataset: A Dataset.
from tensorflow.org
Lets suppose that my dataset is as follows
<BatchDataset element_spec=TensorSpec(shape=(None, 400, 32, 32, 3), dtype=tf.float32, name=None)>
I'm going to iterate over the dataset and get each datapoint as a numpy object.
for img in data.unbatch().take(10): # because the length is 10
print(img.numpy().shape)
<TakeDataset>
. –
Saar The following example will batch all the elements in the dataset as a single item, and extract them as an array.
data = data.batch(len(data))
data = data.get_single_element()
This will add an outer dimension to the data equal to the length of the batch. For example, if you start with a dataset containing 456
elements of dimension (32, 100)
, you will receive an array of shape (456, 32, 100)
.
© 2022 - 2024 — McMap. All rights reserved.
tf.metrics.accuracy
and runsess.run(update_op)
on each batch of the validation data. At the end, callingsess.run(accuracy)
should give you the total accuracy. – Preamplifier