Tensorflow: How to find the size of a tf.data.Dataset API object
Asked Answered
S

5

16

I understand Dataset API is a sort of a iterator which does not load the entire dataset into memory, because of which it is unable to find the size of the Dataset. I am talking in the context of large corpus of data that is stored in text files or tfRecord files. These files are generally read using tf.data.TextLineDataset or something similar. It is trivial to find the size of dataset loaded using tf.data.Dataset.from_tensor_slices.

The reason I am asking the size of the Dataset is the following: Let's say my Dataset size is 1000 elements. Batch size = 50 elements. Then training steps/batches (assuming 1 epoch) = 20. During these 20 steps, I would like to exponentially decay my learning rate from 0.1 to 0.01 as

tf.train.exponential_decay(
    learning_rate = 0.1,
    global_step = global_step,
    decay_steps = 20,
    decay_rate = 0.1,
    staircase=False,
    name=None
)

In the above code, I have "and" would like to set decay_steps = number of steps/batches per epoch = num_elements/batch_size. This can be calculated only if the number of elements in the dataset is known in advance.

Another reason to know the size in advance is to split the data into train and test sets using tf.data.Dataset.take(), tf.data.Dataset.skip() methods.

PS: I am not looking for brute-force approaches like iterating through the whole dataset and updating a counter to count the number of elements or putting a very large batch size and then finding the size of the resultant dataset, etc.

Sociology answered 19/6, 2018 at 1:17 Comment(0)
G
4

You can easily get the number of data samples using :

dataset.__len__()

You can get each element like this:

for step, element in enumerate(dataset.as_numpy_iterator()):
...     print(step, element)

You can also get the shape of one sample:

dataset.element_spec

If you want to take specific elements you can use shard method as well.

Gatian answered 13/8, 2020 at 1:51 Comment(2)
dataset.__len__() does not work for a dataset made with tf.data.TextLineDataset. Error is TypeError: dataset length is unknown.Vick
I found that TextLineDataset reads each file and takes each row of that files as samples, so what you can do is len(dataset.as_numpy_iterator()). However, number of characters for each row varies, Then you can use some fixed size to make it trainable.Gatian
V
1

I realize this question is two years old, but perhaps this answer will be useful.

If you are reading your data with tf.data.TextLineDataset, then a way to get the number of samples could be to count the number of lines in all of the text files you are using.

Consider the following example:

import random
import string
import tensorflow as tf

filenames = ["data0.txt", "data1.txt", "data2.txt"]

# Generate synthetic data.
for filename in filenames:
    with open(filename, "w") as f:
        lines = [random.choice(string.ascii_letters) for _ in range(random.randint(10, 100))]
        print("\n".join(lines), file=f)

dataset = tf.data.TextLineDataset(filenames)

Trying to get the length with len raises a TypeError:

len(dataset)

But one can calculate the number of lines in a file relatively quickly.

# https://mcmap.net/q/45407/-how-to-get-the-line-count-of-a-large-file-cheaply-in-python/5666087
def get_n_lines(filepath):
    i = -1
    with open(filepath) as f:
        for i, _ in enumerate(f):
            pass
    return i + 1

n_lines = sum(get_n_lines(f) for f in filenames)

In the above, n_lines is equal to the number of elements found when iterating over the dataset with

for i, _ in enumerate(dataset):
    pass
n_lines == i + 1
Vick answered 13/8, 2020 at 2:57 Comment(0)
S
0

Is it an option for you to specify the size of your dataset by hand?

How I load my data:

sample_id_hldr = tf.placeholder(dtype=tf.int64, shape=(None,), name="samples")

sample_ids = tf.Variable(sample_id_hldr, validate_shape=False, name="samples_cache")
num_samples = tf.size(sample_ids)

data = tf.data.Dataset.from_tensor_slices(sample_ids)
# "load" data by id:
# return (id, data) for each id
data = data.map(
    lambda id: (id, some_load_op(id))
)

Here you can specify all your sample id's by initializing sample_ids once with the placeholder.
Your sample id's could be e.g. file paths or simple numbers (np.arange(num_elems))

The number of elements is then available in num_samples.

Snider answered 24/6, 2018 at 1:56 Comment(1)
thank you for your answer. However it seems you did not get my question, sorry for that. I will revise my question again. I am not using from_tensor_slices which is used when you have small datasets. In that case, it is trivial to find the size of the dataset. My question is about reading large corpus of data stored in text files that can be only read using tf.data.TextLineDataset(). In this case, how to find the size of the entire dataset?Sociology
P
0

Here what I did to solve the problem, add the following line to your dataset tf.data.experimental.assert_cardinality(len_of_data) this will fix the problem,

ast = Audioset(df) # the generator class
db = tf.data.Dataset.from_generator(ast, output_types=(tf.float32, tf.float32, tf.int32))
db = db.apply(tf.data.experimental.assert_cardinality(len(ast))) # number of samples
db = db.batch(batch_size)

the dataset len change based on the batch_size, to get the dataset len just run len(db). check here for more details

Prussia answered 9/11, 2021 at 14:50 Comment(0)
L
0
print(len(list(dataset.as_numpy_iterator())))  

This will get the size by returning the length of the iterator object.

Lynea answered 19/4 at 8:17 Comment(3)
This is not an answer, because OP specifically said they were "not looking for brute-force approaches like iterating through the whole dataset..."Nervine
it wont iterate through whole dataset, rather it will just return iterartor objectLynea
Not true: the list() constructor will force the iteration. Worse, it will create a temporary copy for no good reason.Nervine

© 2022 - 2024 — McMap. All rights reserved.